I found an interesting paper on Part-Of-Speech (POS) tagging with word embedding. Through my current work at Xing and my previous work on hidden Markov Models, I find this new model very interesting for several reasons. As mentioned in previous posts, word embeddings such as word2vec map all words in a dictionary into a d-dimensional, real valued vector space. In this new space, the similarity of two words decreases with distance of their respective word vectors. The idea is now to

see a document as a d- dimensional time series and use a hidden Markov model with Gaussian observations to model this real valued sequence. Now each state of the hidden Markov model

can be regarded as a POS category. Now this model can be trained using the Baum Welch algorithm and decoding can be performed using the Viterbi path. In my opinion this model is superior to classic hidden Markov models with multinomial observations and word embeddings alone in several ways.

First, the model's observation space has way fewer dimensions. While the multinomial observation for text can have thousands of parameters to estimate (one dimension / word), the word embeddings

only need several hundred dimensions and the semantic representation of each state can be captured more efficiently. Second, a recent article on kaggle suggested summing up all word vectors in a document as a representation. However, the average word vector might not be a sufficient representation, since a lot of the fine differences in a document are lost. In the hidden Markov model,

we still have to average vectors but the average is taken per state. In this way, the model introduces structure. In the POS paper, the temporal model with word embeddings outperformed the other multinomial approach by a large margin. It would be interesting to see such a model used in different NLP tasks.