Wednesday, February 17, 2016

Word Embeddings

Since my grad school interest was focused on machine learning for perception,  I did not notice a class of methods called word embeddings. However, recently I got interested more into text mining so I started to read up on these method and implement some.
A word embedding maps words into a multi dimensional euclidean space in which semantically similar words are close. In other words, each word in your dictionary is represented by a multi dimensional vector.

A word embedding can capture many semantics. For example, on the word2vec webpage,
an embedding from google, it is noticed that:

  1. "vector('king') - vector('man') + vector('woman') is close to vector('queen')"
  2. "vector('Paris') - vector('France') + vector('Italy') [...] is very close to vector('Rome')"

In other words, the representation is capturing concepts such as gender and captial city.
The two most prominent methods so far seem to be word2vec (by google), glove (by stanford's nlp group). Furthermore, there is a very recent combination of the two called swivel (again by google).

All the methods are based on the idea that the usage of a word gives insight into the words meaning or that similar words are used in a similar context. Here context can be defined as a small neighbourhood. For example, a context definition could be defined as the words to the left of the target word and the three words to the right.

Google's Word2Vec obtains the vectors by using a simple neural net that predicts the target word from it's context (Continuous Bag Of Words) or vice versa (Skip Gram). As usual the neural net can be trained using stochastic gradient descent (back propagation). The neural net can be seen as learning a representation for words (word vectors) and a representation for contexts (context vectors).
Glove sloves a similar problem. However, instead of predicting the actual context around a word, glove learns vectors predictive of the a global coocurance matrix, extracted from the complete corups. Glove is trained using adagrad. In general we solve an optimisation problem of the form:

Basically the two methods differ in the way f(.) and C are defined. C is the target function and we aim to minimise the difference between the prediction and the target, scaled by a function of the co-occurrence count f(.). For word2vec the target function is the pointwise mutual information between two words and for glove it is the log  co-occurrence count. 

Both method come with several implementations already. Semantic similarity can be used for several 
NLP tasks. For example, sentiment analysis or query expansion. 

No comments:

Post a Comment