Friday, October 28, 2016

Reading through the word2vec code: Continuouse Bag Of Word with Negative Sampling

While working at Xing, I got interested in word and document embeddings. In our initial experiments we used the average word vectors to rerank job recommendations for the job recommenders. However, I started implementing and experimenting with my own implementation since I needed to integrate into our current technology stack better (scala) and also needed to find embeddings for whole documents. Therefore, since the original papers [1,2,3] are a little vague about the implementation details, I started reading the source code [4,5] and other details provided by an explanation paper[6] as well as a slide deck. In the following I will go over some details.
For readers that know word to vec I focused on the implementation of continuous bag of words with negative sampling and document embedding.

Continuous Bag of Words

Continuous bag of words uses a small neural network to predict the center word in a sliding window over a text. In the example above we use a version of the Xing slogan "Xing enables professionals to grow" as the window and try to predict the word professionals from the context.

The bottom layer is a matrix of word vectors, that is for every word in a dictionary of all the words in our corpus, we have a low dimensional (typically 100-500 dimensions) vector representing said word.
To compute the hidden layer, we calculate the average word vector over all the words in the context.
In the example we would average the word vectors for: Xing, enables, to and grow. In order to compute the probability for the target word( here professionals ) we use the resulting average
as input to a softmax layer. Using back propagation, the prediction error is applied to the softmax layer AND the word vectors.

From the c code provided original by google we can see that the window size is not defined as the actual size of the window but the number of words to the left and right of the center word. Furthermore, the window size is not fixed but is down sampled (variable b). The reason for down sampling is that the significance of a word in a context should decrease the further it appears.

      b = next_random % window;
      if (cbow) {  //train the cbow architecture
      // in -> hidden
      cw = 0;
      for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
        c = sentence_position - window + a;
        if (c < 0) continue;
        if (c >= sentence_length) continue;
        last_word = sen[c];
        if (last_word == -1) continue;
        for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
     for (c = 0; c < layer1_size; c++) neu1[c] /= cw;

Afterwards, the algorithm averages the word vectors for the context window and uses the result as the hidden "activation". The hidden activation is therefore the same size as the word vectors.

Subsampling and Preprocessing

Before the actual learning, the sentences are down sampled. That is, words that appear more often in the whole corpus are deleted  with probability proportional to their frequency. In the gensim code, this is implemented as:
for sentence in sentences:
   word_vocabs = [model.vocab[w] for w in sentence if w in model.vocab and
   model.vocab[w].sample_int > model.random.rand() * 2**32]

In each input sentence, the words are subsampled proportional to their frequency. The confusing part might be the (2**32). If we look at how sample_int is created we find:
self.vocab[w].sample_int = int(round(word_probability * 2**32))

Basically, the sampling is mapped into integers for sampling. Since the code is ported from the c version, I suspect this code is an artifact from the port. Normally (in java / scala) I would either stay in double range and use:

Random.nextDouble() < prob[word]

Or if the integer range is required:

Random.nextInt() < prob[word]

A further preprocessing step is to remove all words that appear less then 5 times in the corpus.

Negative sampling

As stated earlier the average word vector is used to predict the center word using a softmax regressor.

However, in problems with large output spaces (for example > 1m labels), computing the desired probabilities becomes computationally challenging since we have to compute the scaler by summing over the complete output space. In the word2vec setup we 1-hot code the words to predict. That means our output vectors is all zeros except the position for the word (that is 1). We only need
the error for the predicted word to start back propagation. However the scaler computation forces us
to compute all the probabilities. Negative sampling, as the name suggests, samples words not in the window as negative samples. In particular, the algorithm samples again proportional to the unigram counts. In that way, common words (e.g. low discriminative power) are more often negative samples
by a similar argument as the idf term in TF-IDF.
    if model.negative:
        # use this word (label = 1) + `negative` other random words not from this sentence (label = 0)
        word_indices = [predict_word.index]
        while len(word_indices) < model.negative + 1:
            w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1]))
            if w != predict_word.index:
        l2b = model.syn1neg[word_indices]  # 2d matrix, k+1 x layer1_size
        fb = 1. / (1. + exp(-dot(l1, l2b.T)))  # propagate hidden -> output
        gb = (model.neg_labels - fb) * alpha  # vector of error gradients multiplied by the learning rate
        if learn_hidden:
            model.syn1neg[word_indices] += outer(gb, l1)  # learn hidden -> output
        neu1e += dot(gb, l2b)  # save error
The above code first samples the negative word and then computes an approximate
gradient using the positive example and the negative samples. The approximation used by negative sampling is:

The lower part of the code basically computes the sigmoid function based on a pre computed lookup table and computes the gradient as :

  • (1 - p) * learning rate  - For the positive example
  • (0 - p) * learning rate  - For the negative example
The weights for the prediction layer (l1) can then be updated directly using the hidden layers
input. The sum of all the gradients is gives the update for the word vectors, which is applied to all word vectors involved in the initial average of the context words.


Finally to document embeddings. Instead of using only vectors for words in the context and the averaging, we can also add 'labels' such as a document id. A document id can be regarded as a special word or token that is added to every window belonging to that document. Then the vector for this token also contributes to the average and also receives the gradient update. In this way the document token should be close to the content words of its documents.


[1] Mikolav, Chen, Corrado, Dean "Efficient Estimation of Word Representations in Vector Space", ICLR, 2013
[2] Mikolav, Sutskever, Chen, Corrado, Dean: "Distributed Representations of Words and Phrases and their Compositionality", Nips, 2013
[3] Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
[4] Original google code fork on github:

No comments:

Post a Comment