Friday, March 8, 2019

Word2Vec Implementation in Rust: Continuous Bag of Words


I recently started to get serious about learning rust and I needed a small to medium project to train on.
One thing I explained in this blog and walked through the code is Word2Vec. A method to learn word embeddings. In this post I will explain the method again using my new rust code on github.

Word Similarity - Basic Idea 

In the following, I will describe the basic idea behind continuous bag of words, one way to estimate word vectors. Word vectors are a dense vectorised word representation capturing similarity of words in Euclidean space. Basically we have one vector for each word in our vocabulary. The euclidean distance between two vectors of words that are similar is smaller than the distance between vectors of words that are less similar. The assumption in Word2Vec is that words that appear in a similar context are similar. We shift a sliding window over the text and try to predict the center word in the window from the word vectors of the surrounding words in the window. 

pub struct Window<'a> {
    pub words:       &'a [usize], 
    pub predict_pos: usize
}

pub struct Document {
    pub words: Vec<usize>
}

impl<'a> Window<'a> {

    pub fn label(&self) -> usize {
        self.words[self.predict_pos]
    }

}

impl Document {

    pub fn window(&self, pos: usize, size: usize) -> Option<Window> {
        let start = sub(pos, size);
        let stop  = usize::min(pos + size, self.words.len() - 1); 
        if stop - start == 2 * size {
            Some(Window {words: &self.words[start as usize .. stop as usize], predict_pos: size})
        } else {
            None
        }
    }

}

The code above defines a document as a vector of word ids and allows to shift a sliding window over it. The return type is not a vector itself but a slice, which is convenient since there is nothing copied. In the following the center word in the window is called a label $l$ and the other words are called the context: $c_1 .. c_m$.

Word Vectors

Now that we have defined the window, we can start implementing a representation for the word vectors. There will be one vector of $d$ dimensions or columns per word in the dictionary.
In code we save the word vectors as a flat vector of doubles. We can then access the $i-th$ vector as a slice from $[i * cols .. (i + 1) * cols]$



#[derive(Serialize, Deserialize)]
pub struct ParameterStore {
    pub cols: usize,
    pub weights: Vec<f64>
}

impl ParameterStore {

    pub fn seeded(rows: usize, cols: usize) -> ParameterStore {
        let mut rng = rand::thread_rng();
        let mut weights = Vec::new();
        for _i in 0 .. (rows * cols) {
            let uniform = (rng.gen_range(0.0, 1.0) - 0.5) / cols as f64;
            weights.push(uniform);
        }
        ParameterStore {cols: cols, weights: weights}
    }

    pub fn at(&self, i: usize) -> &[f64] {
        let from = i * self.cols;
        let to = (i + 1) * self.cols;
        &self.weights[from .. to]
    }

    pub fn update(&mut self, i: usize, v: &Vec<f64>) {
        let from = i * self.cols;
        for i in 0 .. self.cols {
            self.weights[i + from] += v[i];
        }
    }

    pub fn avg(&self, result: &mut Vec<f64>, indices: Vec<usize>) {
        for col in 0 .. self.cols {
            result[col] = 0.0;
            for row in indices.iter() {
                let from = row * self.cols;
                result[col] += self.weights[col + from];
            }
            result[col] /= indices.len() as f64;
        }
    }

}


We initialise the vectors by a random number in the range $\frac{Uniform([-0.5;0.5])}{d}$, which is the standard way to initialise word2vec.  The update method simply adds a gradient to a word vector. There is also a method to compute the average of multiple word vectors, needed later during learning.

Putting It All Together: Computing the Gradients

So now we can compute windows from the text and average the word vectors of the context words. 
The averaged vectors are our hidden layer's activations. From there we compute the prediction and then back propagate the error. So we need two sets of parameters. One is the word vectors initialised as above, and one for the predictions which is initialised to zeros. 

#[derive(Serialize, Deserialize)]
pub struct CBOW {
    pub embed:   ParameterStore,
    pub predict: ParameterStore
}

Given the average embedding for the context as $h$ and the vector for word $i$ from the predictions $w_i$, the forward pass on the neural network is:

$a_i = \sigma(h * w_i)$.

We are predicting the center word of the window, that means we model the output layer as a multinomial where the center word is one hotcoded. So output should be $1$ for the center word
and $0$ for all others. Then the error for the center word is:

$e_i = log(a_i) $

and for all others it is

$e_i = log(1 - a_i)$.

The gradient on the loss is simply $\delta_i = (y - a_i) * \lambda$, with $\lambda$ being the learning rate.

 fn gradient(&self, label: usize, h: &Vecf&lt64>, truth: f64, rate: f64) -> (f64, f64) {        
        let w = self.predict.at(label);
        let a = sigmoid(dot(&h, &w));
        let d = (truth - a) * rate;
        let e = -f64::ln(if truth == 1.0 {a} else {1.0 - a});    
        (d, e)
    } 

The gradient applied to the output parameters is: $\delta_i * h$. The gradient for the embeddings is: $\delta_i * w_i$ which is applied to all context words. So all words in the same context are updated with the same gradient, in turn making them more similar, which was the main goal for word embeddings. There is one more detail. Normally, we would have to compute the losses for the complete dictionary, however for a million words, that would take very long. So instead we use something called negative sampling. We compute the gradient for the center word and then sample some negative words as the negative gradient.
The final computation for learning is shown below:
 pub fn negative_sampeling(&mut self, window: &Window, rate: f64, n_samples: usize, unigrams: &Sampler) -> f64 {
        let mut gradient_embed = vec![0.0; self.embed.cols];
        let h = self.embedding(self.ids(window));        
        let (pos, pos_err) = self.gradient(window.label(), &h, 1.0, rate);        
        let mut error = pos_err;
        let mut gradient_pos_predict = vec![0.0; self.predict.cols];
        for i in 0 .. self.embed.cols {        
            gradient_embed[i] += pos * self.predict.at(window.label())[i];
            gradient_pos_predict[i] += pos * h[i];
        }
        self.predict.update(window.label(), &gradient_pos_predict);
        for _sample in 0 .. n_samples {
            let token = unigrams.multinomial();
            let (neg, neg_error) = self.gradient(token, &h, 0.0, rate);
            error += neg_error;
            let mut gradient_neg_predict = vec![0.0; self.predict.cols];
            for i in 0 .. self.embed.cols {
                gradient_embed[i]       += neg * self.predict.at(token)[i];
                gradient_neg_predict[i] += neg * h[i];
            }
            self.predict.update(token, &gradient_neg_predict);
        }
        let ids = self.ids(window);
        for i in 0 .. ids.len() {
            self.embed.update(ids[i], &gradient_embed);
        }
        error
    }

Results

I trained the net on quora questions and when searching for similar vectors we will find similar words:

Searching: India
kashmir: 11.636631273392045
singapore: 12.155846288647625
kerala: 12.31677547739736
dubai: 12.597659271137639
syria: 12.630908898646624
pradesh: 12.657158343619573
pok: 12.667834993874745
malaysia: 12.708228117809442
taiwan: 12.75190112716203
afghanistan: 12.772290465531222
africa: 12.772372266369985
bjp: 12.802207718902666
karnataka: 12.817502941769622
europe: 12.825095372204384

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.