## Wednesday, November 16, 2016

### Journal Paper from Thesis: Less Math More Context

The paper "A Case Study with Wild Atlantic Spotted Dolphins, Animal Behavior and Cognition" is published now in the Journal of Animal Behavior and Cognition. The paper includes less technical detail than my thesis and is written for the Biology Crowd. The main focus is on developing software and tools that enables biologists to perform pattern discovery experiments on behavioral data.

## Friday, October 28, 2016

### Reading through the word2vec code: Continuouse Bag Of Word with Negative Sampling

While working at Xing, I got interested in word and document embeddings. In our initial experiments we used the average word vectors to rerank job recommendations for the job recommenders. However, I started implementing and experimenting with my own implementation since I needed to integrate into our current technology stack better (scala) and also needed to find embeddings for whole documents. Therefore, since the original papers [1,2,3] are a little vague about the implementation details, I started reading the source code [4,5] and other details provided by an explanation paper[6] as well as a slide deck. In the following I will go over some details.
For readers that know word to vec I focused on the implementation of continuous bag of words with negative sampling and document embedding.

## Continuous Bag of Words

Continuous bag of words uses a small neural network to predict the center word in a sliding window over a text. In the example above we use a version of the Xing slogan "Xing enables professionals to grow" as the window and try to predict the word professionals from the context.

The bottom layer is a matrix of word vectors, that is for every word in a dictionary of all the words in our corpus, we have a low dimensional (typically 100-500 dimensions) vector representing said word.
To compute the hidden layer, we calculate the average word vector over all the words in the context.
In the example we would average the word vectors for: Xing, enables, to and grow. In order to compute the probability for the target word( here professionals ) we use the resulting average
as input to a softmax layer. Using back propagation, the prediction error is applied to the softmax layer AND the word vectors.

From the c code provided original by google we can see that the window size is not defined as the actual size of the window but the number of words to the left and right of the center word. Furthermore, the window size is not fixed but is down sampled (variable b). The reason for down sampling is that the significance of a word in a context should decrease the further it appears.


b = next_random % window;
if (cbow) {  //train the cbow architecture
// in -> hidden
cw = 0;
for (a = b; a < window * 2 + 1 - b; a++) if (a != window) {
c = sentence_position - window + a;
if (c < 0) continue;
if (c >= sentence_length) continue;
last_word = sen[c];
if (last_word == -1) continue;
for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size];
cw++;
}
for (c = 0; c < layer1_size; c++) neu1[c] /= cw;
...


Afterwards, the algorithm averages the word vectors for the context window and uses the result as the hidden "activation". The hidden activation is therefore the same size as the word vectors.

## Subsampling and Preprocessing

Before the actual learning, the sentences are down sampled. That is, words that appear more often in the whole corpus are deleted  with probability proportional to their frequency. In the gensim code, this is implemented as:

for sentence in sentences:
word_vocabs = [model.vocab[w] for w in sentence if w in model.vocab and
model.vocab[w].sample_int > model.random.rand() * 2**32]


In each input sentence, the words are subsampled proportional to their frequency. The confusing part might be the (2**32). If we look at how sample_int is created we find:

self.vocab[w].sample_int = int(round(word_probability * 2**32))


Basically, the sampling is mapped into integers for sampling. Since the code is ported from the c version, I suspect this code is an artifact from the port. Normally (in java / scala) I would either stay in double range and use:

Random.nextDouble() < prob[word]

Or if the integer range is required:

Random.nextInt() < prob[word]

A further preprocessing step is to remove all words that appear less then 5 times in the corpus.

## Negative sampling

As stated earlier the average word vector is used to predict the center word using a softmax regressor.

However, in problems with large output spaces (for example > 1m labels), computing the desired probabilities becomes computationally challenging since we have to compute the scaler by summing over the complete output space. In the word2vec setup we 1-hot code the words to predict. That means our output vectors is all zeros except the position for the word (that is 1). We only need
the error for the predicted word to start back propagation. However the scaler computation forces us
to compute all the probabilities. Negative sampling, as the name suggests, samples words not in the window as negative samples. In particular, the algorithm samples again proportional to the unigram counts. In that way, common words (e.g. low discriminative power) are more often negative samples
by a similar argument as the idf term in TF-IDF.

if model.negative:
# use this word (label = 1) + negative other random words not from this sentence (label = 0)
word_indices = [predict_word.index]
while len(word_indices) < model.negative + 1:
w = model.cum_table.searchsorted(model.random.randint(model.cum_table[-1]))
if w != predict_word.index:
word_indices.append(w)
l2b = model.syn1neg[word_indices]  # 2d matrix, k+1 x layer1_size
fb = 1. / (1. + exp(-dot(l1, l2b.T)))  # propagate hidden -> output
gb = (model.neg_labels - fb) * alpha  # vector of error gradients multiplied by the learning rate
if learn_hidden:
model.syn1neg[word_indices] += outer(gb, l1)  # learn hidden -> output
neu1e += dot(gb, l2b)  # save error

The above code first samples the negative word and then computes an approximate
gradient using the positive example and the negative samples. The approximation used by negative sampling is:

The lower part of the code basically computes the sigmoid function based on a pre computed lookup table and computes the gradient as :

• (1 - p) * learning rate  - For the positive example
• (0 - p) * learning rate  - For the negative example
The weights for the prediction layer (l1) can then be updated directly using the hidden layers
input. The sum of all the gradients is gives the update for the word vectors, which is applied to all word vectors involved in the initial average of the context words.

## Doc2Vec

Finally to document embeddings. Instead of using only vectors for words in the context and the averaging, we can also add 'labels' such as a document id. A document id can be regarded as a special word or token that is added to every window belonging to that document. Then the vector for this token also contributes to the average and also receives the gradient update. In this way the document token should be close to the content words of its documents.

## References

[1] Mikolav, Chen, Corrado, Dean "Efficient Estimation of Word Representations in Vector Space", ICLR, 2013
[2] Mikolav, Sutskever, Chen, Corrado, Dean: "Distributed Representations of Words and Phrases and their Compositionality", Nips, 2013
[3] Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
[4] Original google code fork on github: https://github.com/dav/word2vec

### ACM Recsys Challange hosted by Xing.

A month or so ago Xing hosted the ACM RecSys Challenge Workshop in Boston, concluding a great challenge on predicting Job Postings for our jobs market. A selected group of participants including the top 5 winners presented their solution in the workshop. In the following I will present some of my impressions on the challenge and the workshop.

 The winners and organisers of the challenge

## The Challenge

The details of the challenge can be found in our paper [1], however I will still describe some of the task for the convenience of the reader.

As some of you know I currently work as a data scientist Xing is a social network in Hamburg for professionals. The network also includes a job market where users can browse one million job offers
from other companies. For the challenge we created a semi- synthetic dataset, handing out anonymised user profiles, job postings as well as interaction data.

The task of the challenge was to predict a users future interactions using the provided dataset.
In order to anonymise the dataset, we created synthetic (random) users and their interactions.

From Xing's perspective the challenge was a huge success with the top solutions providing new insight and solutions into our data. In my mind the "coolest" approach to the challenge was that multiple teams used gradient boosting of trees (xgboost [2]) to predict the users clicks. In both cases
the algorithm was fed user item pairs and their features from the interaction data and then the tree
learned to predict users clicks.  Another team with a high rank used the temporal data to train a recurrent neural network (LSTM).

Happily we are also confirmed to host the next years challenge. For 2017 we plan to release the dataset periodically and evaluate high scoring solutions in our live recommendation system.

All information about the 2016 challenge and the upcoming 2017 challenge can be found at [3]

## References

[1] Abel, Benczur, Kohlsdorf, Larson, Palovics: "RecSys Challenge 2016: Job Recommendations" http://dl.acm.org/citation.cfm?id=2959207

## Thursday, March 24, 2016

### POS Tagging with Word Embeddings

I found an interesting paper on Part-Of-Speech (POS) tagging with word embedding. Through my current work at Xing and my previous work on hidden Markov Models, I find this new model very interesting for several reasons. As mentioned in previous posts, word embeddings such as word2vec map all words in a dictionary into a d-dimensional, real valued vector space. In this new space, the similarity of two words decreases with distance of their respective word vectors. The idea is now to
see a document as a d- dimensional time series and use a hidden Markov model with Gaussian observations to model this real valued sequence. Now each state of the hidden Markov model
can be regarded as a POS category. Now this model can be trained using the Baum Welch algorithm and decoding can be performed using the Viterbi path. In my opinion this model is superior to classic hidden Markov models with multinomial observations and word embeddings alone in several ways.
First, the model's observation space has way fewer dimensions. While the multinomial observation for text can have thousands of parameters to estimate (one dimension / word), the word embeddings
only need several hundred dimensions and the semantic representation of each state can be captured more efficiently. Second, a recent article on kaggle suggested summing up all word vectors in a document as a representation. However, the average word vector might not be a sufficient representation, since a lot of the fine differences in a document are lost. In the hidden Markov model,
we still have to average vectors but the average is taken per state. In this way, the model introduces structure. In the POS paper, the temporal model with word embeddings outperformed the other multinomial approach by a large margin. It would be interesting to see such a model used in different NLP tasks.

## Sunday, February 28, 2016

### Speed Bake-Off Tensorflow / (Numpy | SciPy | Matplotlib) / OpenCV

I recently got interested into scaling some of the algorithms from my PhD thesis. Especially the feature learning part. I used a K-Means to learn a convolutional codebook over audio spectrogram.
Why you can use K-Means to learn something like a convolutional neural net is described in: Coates, Lee and NG as well as my thesis. Anyways in my thesis experiments the bottle neck was the feature learning. So I need a fast clustering implementation as well as a fast convolution in order to resolve the speed issues. The three main implementations I look at are:
• SciKit Learn
• OpenCV
All three provide a python interface but use a low level, speedy C or Fortran implementation under the hood. My goal is to evaluate these three libraries for their performance (speed) under the following conditions:

The library is fastest when testing on a single Macbook. It can use the available parallel devices such as a GPU and multi threading. Furthermore, the library should be fastest on medium sized problems.

I choose these conditions since I plan to design programs that animal communication researchers can use in the field. Meaning, if you are to analyse dolphin communication on a boat in the Bahamas or
you observe Birds in the rainforest, the only thing you might have is your laptop with no internet connection.

### Convolution

The convolution test is based on the three functions:

• Tensorflow: tensorflow.nn.conv2d
• SciPy: scipy.signal.convolve2d
• OpenCV: cv2.filter2d
My test is to compute the edge image using a sobel filter in x-direction and on in y-direction:

I am using the following image and expect the following edge image:

The results in seconds:
• Tensorflow: 0.125 [s]
• SKLearn:     0.049 [s]
• OpenCV:     0.019 [s]
Here it looks like the OpenCV implementation is the fastest. So openCV it is. For the clustering there are a lot of libraries and even the sklearn implementation is very fast.

Update:

After Himanshu's comment I chose to check Tensorflow not including the variable assignment
and then it took 0.021 seconds. However, the image copying into the variable and the session setup
matter in my use case, too. And it is still lower than Open CV. It is also interesting that for these problems, the speed between the libraries is not that different. However, I belief that for larger problems, the tensor flow version that can run on a cluster will show way better performance. Also I don't know right now if the current tensor flow version works with opencl.

## Wednesday, February 17, 2016

### Word Embeddings

Since my grad school interest was focused on machine learning for perception,  I did not notice a class of methods called word embeddings. However, recently I got interested more into text mining so I started to read up on these method and implement some.
A word embedding maps words into a multi dimensional euclidean space in which semantically similar words are close. In other words, each word in your dictionary is represented by a multi dimensional vector.

A word embedding can capture many semantics. For example, on the word2vec webpage,
an embedding from google, it is noticed that:

1. "vector('king') - vector('man') + vector('woman') is close to vector('queen')"
2. "vector('Paris') - vector('France') + vector('Italy') [...] is very close to vector('Rome')"

In other words, the representation is capturing concepts such as gender and captial city.
The two most prominent methods so far seem to be word2vec (by google), glove (by stanford's nlp group). Furthermore, there is a very recent combination of the two called swivel (again by google).

All the methods are based on the idea that the usage of a word gives insight into the words meaning or that similar words are used in a similar context. Here context can be defined as a small neighbourhood. For example, a context definition could be defined as the words to the left of the target word and the three words to the right.

Google's Word2Vec obtains the vectors by using a simple neural net that predicts the target word from it's context (Continuous Bag Of Words) or vice versa (Skip Gram). As usual the neural net can be trained using stochastic gradient descent (back propagation). The neural net can be seen as learning a representation for words (word vectors) and a representation for contexts (context vectors).
Glove sloves a similar problem. However, instead of predicting the actual context around a word, glove learns vectors predictive of the a global coocurance matrix, extracted from the complete corups. Glove is trained using adagrad. In general we solve an optimisation problem of the form:

Basically the two methods differ in the way f(.) and C are defined. C is the target function and we aim to minimise the difference between the prediction and the target, scaled by a function of the co-occurrence count f(.). For word2vec the target function is the pointwise mutual information between two words and for glove it is the log  co-occurrence count.

Both method come with several implementations already. Semantic similarity can be used for several
NLP tasks. For example, sentiment analysis or query expansion.