Friday, October 28, 2016

Reading through the word2vec code: Continuouse Bag Of Word with Negative Sampling

While working at Xing, I got interested in word and document embeddings. In our initial experiments we used the average word vectors to rerank job recommendations for the job recommenders. However, I started implementing and experimenting with my own implementation since I needed to integrate into our current technology stack better (scala) and also needed to find embeddings for whole documents. Therefore, since the original papers [1,2,3] are a little vague about the implementation details, I started reading the source code [4,5] and other details provided by an explanation paper[6] as well as a slide deck. In the following I will go over some details.
For readers that know word to vec I focused on the implementation of continuous bag of words with negative sampling and document embedding.

Continuous Bag of Words

Continuous bag of words uses a small neural network to predict the center word in a sliding window over a text. In the example above we use a version of the Xing slogan "Xing enables professionals to grow" as the window and try to predict the word professionals from the context.

The bottom layer is a matrix of word vectors, that is for every word in a dictionary of all the words in our corpus, we have a low dimensional (typically 100-500 dimensions) vector representing said word.
To compute the hidden layer, we calculate the average word vector over all the words in the context.
In the example we would average the word vectors for: Xing, enables, to and grow. In order to compute the probability for the target word( here professionals ) we use the resulting average
as input to a softmax layer. Using back propagation, the prediction error is applied to the softmax layer AND the word vectors.

From the c code provided original by google we can see that the window size is not defined as the actual size of the window but the number of words to the left and right of the center word. Furthermore, the window size is not fixed but is down sampled (variable b). The reason for down sampling is that the significance of a word in a context should decrease the further it appears.

      b = next_random % window;
      if (cbow) {  //train the cbow architecture
      // in -> hidden
      cw = 0;
      for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
        c = sentence_position - window + a;
        if (c < 0) continue;
        if (c >= sentence_length) continue;
        last_word = sen[c];
        if (last_word == -1) continue;
        for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
     for (c = 0; c < layer1_size; c++) neu1[c] /= cw;

Afterwards, the algorithm averages the word vectors for the context window and uses the result as the hidden "activation". The hidden activation is therefore the same size as the word vectors.

Subsampling and Preprocessing

Before the actual learning, the sentences are down sampled. That is, words that appear more often in the whole corpus are deleted  with probability proportional to their frequency. In the gensim code, this is implemented as:
for sentence in sentences:
   word_vocabs = [model.vocab[w] for w in sentence if w in model.vocab and
   model.vocab[w].sample_int > model.random.rand() * 2**32]

In each input sentence, the words are subsampled proportional to their frequency. The confusing part might be the (2**32). If we look at how sample_int is created we find:
self.vocab[w].sample_int = int(round(word_probability * 2**32))

Basically, the sampling is mapped into integers for sampling. Since the code is ported from the c version, I suspect this code is an artifact from the port. Normally (in java / scala) I would either stay in double range and use:

Random.nextDouble() < prob[word]

Or if the integer range is required:

Random.nextInt() < prob[word]

A further preprocessing step is to remove all words that appear less then 5 times in the corpus.

Negative sampling

As stated earlier the average word vector is used to predict the center word using a softmax regressor.

However, in problems with large output spaces (for example > 1m labels), computing the desired probabilities becomes computationally challenging since we have to compute the scaler by summing over the complete output space. In the word2vec setup we 1-hot code the words to predict. That means our output vectors is all zeros except the position for the word (that is 1). We only need
the error for the predicted word to start back propagation. However the scaler computation forces us
to compute all the probabilities. Negative sampling, as the name suggests, samples words not in the window as negative samples. In particular, the algorithm samples again proportional to the unigram counts. In that way, common words (e.g. low discriminative power) are more often negative samples
by a similar argument as the idf term in TF-IDF.
    if model.negative:
        # use this word (label = 1) + `negative` other random words not from this sentence (label = 0)
        word_indices = [predict_word.index]
        while len(word_indices) < model.negative + 1:
            w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1]))
            if w != predict_word.index:
        l2b = model.syn1neg[word_indices]  # 2d matrix, k+1 x layer1_size
        fb = 1. / (1. + exp(-dot(l1, l2b.T)))  # propagate hidden -> output
        gb = (model.neg_labels - fb) * alpha  # vector of error gradients multiplied by the learning rate
        if learn_hidden:
            model.syn1neg[word_indices] += outer(gb, l1)  # learn hidden -> output
        neu1e += dot(gb, l2b)  # save error
The above code first samples the negative word and then computes an approximate
gradient using the positive example and the negative samples. The approximation used by negative sampling is:

The lower part of the code basically computes the sigmoid function based on a pre computed lookup table and computes the gradient as :

  • (1 - p) * learning rate  - For the positive example
  • (0 - p) * learning rate  - For the negative example
The weights for the prediction layer (l1) can then be updated directly using the hidden layers
input. The sum of all the gradients is gives the update for the word vectors, which is applied to all word vectors involved in the initial average of the context words.


Finally to document embeddings. Instead of using only vectors for words in the context and the averaging, we can also add 'labels' such as a document id. A document id can be regarded as a special word or token that is added to every window belonging to that document. Then the vector for this token also contributes to the average and also receives the gradient update. In this way the document token should be close to the content words of its documents.


[1] Mikolav, Chen, Corrado, Dean "Efficient Estimation of Word Representations in Vector Space", ICLR, 2013
[2] Mikolav, Sutskever, Chen, Corrado, Dean: "Distributed Representations of Words and Phrases and their Compositionality", Nips, 2013
[3] Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
[4] Original google code fork on github:

ACM Recsys Challange hosted by Xing.

A month or so ago Xing hosted the ACM RecSys Challenge Workshop in Boston, concluding a great challenge on predicting Job Postings for our jobs market. A selected group of participants including the top 5 winners presented their solution in the workshop. In the following I will present some of my impressions on the challenge and the workshop.

The winners and organisers of the challenge

The Challenge

The details of the challenge can be found in our paper [1], however I will still describe some of the task for the convenience of the reader.  

As some of you know I currently work as a data scientist Xing is a social network in Hamburg for professionals. The network also includes a job market where users can browse one million job offers
from other companies. For the challenge we created a semi- synthetic dataset, handing out anonymised user profiles, job postings as well as interaction data.

The task of the challenge was to predict a users future interactions using the provided dataset.
In order to anonymise the dataset, we created synthetic (random) users and their interactions.

From Xing's perspective the challenge was a huge success with the top solutions providing new insight and solutions into our data. In my mind the "coolest" approach to the challenge was that multiple teams used gradient boosting of trees (xgboost [2]) to predict the users clicks. In both cases
the algorithm was fed user item pairs and their features from the interaction data and then the tree
learned to predict users clicks.  Another team with a high rank used the temporal data to train a recurrent neural network (LSTM).

Happily we are also confirmed to host the next years challenge. For 2017 we plan to release the dataset periodically and evaluate high scoring solutions in our live recommendation system.

All information about the 2016 challenge and the upcoming 2017 challenge can be found at [3] 


[1] Abel, Benczur, Kohlsdorf, Larson, Palovics: "RecSys Challenge 2016: Job Recommendations"