Thursday, October 10, 2013

My thoughts on the value of data processing in practical applications.

Just a quick post about a few thoughts about data engineering vs machine learning.
While I definitely like complex algorithms for Machine Learning, my recent results on applications such as Dolphin Communications or Typing Error Detection show me that most of the time the data preprocessing is doing a lot of the work while I get away with more or less simple models such as Gaussian Mixture Models, Hidden Markov Models, Decision Trees or Random Forests (okay the last one just works great in general). Now the question is where the tradeoff is between data engineering and actual learning. Deep Learning promises to extract features on their own and are on the learning heavy sides of things. I think Speech Recognition is an example that is traditionally approached with carefully crafted features such as Mel Cepstrum Components. I observe the same difference for carefully modeling such as Hidden Markov Models and more or less structureless models.
So for me it seems fitting an existing algorithm to new domains under heavy pre processing can lead to good results faster than the search for new models. Maybe I should call it Machine Learning Prototyping. While developing learning methods to solve new problems is the higher goal, whenever one needs a learning "prototype" for a specific domain, hacking the data and some existing model might be sufficient.

Sunday, July 14, 2013

Baum Welch Initialisation: Flat Start vs Randomisation vs Viterbi Training

As most of the readers will know Baum Welch is a parameter estimation Algorithm for Hidden Markov Models. Given a initial configuration of the HMM it will converge to a locally optimal parameter configuration. The important bit is that the initial configuration will determine "how good" the final parameter set will be. To my knowledge there are three initialisation methods two are implemented in the HTK3 toolkit:

Viterbi Training
Viterbi training is implemented in HTK3 in the tool HInit. First the data is uniformly segmented
with the number of segments equal to the number of states. Then we get the optimal path through the HMM using Viterbi + Backtracking. Using the path we reestimate the model.

Flat Start
Flat start simply initialises the model with the observations uniform. For example we estimate a uniform Gaussian from all data available and use it for all initial distributions. In both cases, Flat Start and Viterbi Training we set the initial transition probabilities ourselves. Flat Start is implemented in HTK3 in the HCompV tool. The problem of Viterbi training is that it requires labeled data, while Flat Start does not.

Random Initialisation with Random Restarts 
Viterbi Training needs labeled data and Flat Start is a simple initialization but can give a bad initial estimate. Since Baum Welch is dependent on the initial starting condition, we can also use random initializations with random restarts. In that way we can initialize the transition matrix by setting all possible transitions to random values. The Gaussians can be initialized as in Flat Start and then we add some small random number to the mean. We construct multiple random initializations and run Baum Welch. In the end we keep the model with the best likelihood. We can also use a random subset for the initial Gaussians. Unfortunately there is no implementation for this method in HTK. In my experiments this initialization works best in the scenario where no labels are available and Flat Start gives a bad estimates.

Monday, July 8, 2013

On different depictions of Hidden Markov Models

During my Diploma thesis I read most literature about Hidden Markov Models from the speech community. Since I arrived at Tech and started reading more about the Machine Learning perspective on these models I noticed a difference in depiction of Hidden Markov Models. In this post I want to argue why I think the "old" speech depiction is a bit more expressive.

The old depiction is a state machine. Each state of the HMM is depicted as a circle and the arrows
show how one can "travel" through the model. I saw this depiction in the classic Rabiner papers.

In the second representation is a Hidden Markov Model as a Bayes Network. There are two kinds of nodes. The one at the top show a discrete variable, only dependent on it's successor. Encoded in this node are the transition probabilities. Or in other words the state machine is encoded in each of these nodes. There are as many transition nodes as there are samples in the time series. The nodes at the bottom are the observation probabilities which are not seen in the model from the speech community. 

I think if one only publishes the second depiction it does not tell much about the model itself. One already assumes that there are observation distributions in an HMM so why depict them. Second just drawing as many nodes as there are samples is weird for me for two reasons. The first reason is that we do not depict anything about the transition model, which is the only thing to model in an HMM. The second reason is that the number of samples or time slices does not matter at all for an HMM. So I think  drawing the state machine is a way better way of showing your model then making the ML depiction that does not tell much of your specific model except that it is an HMM. Just to make my point more clear. The second model, even when specifying in the text that we use 3 states could be the top model or the one at the bottom, without the reader knowing.

However, explaining inference algorithms such as variable elimination is easier in the middle model, but for specific models I prefer the state machine depiction.

Tuesday, July 2, 2013

log-math: Dirichlet Distribution

A long time since my last post again. But I am working on one for Gaussian Processes but I am procrastinate there. So I decided to write another one.

As my series on Expectation Maximisation ("Thoughts on EM") evolved from research, implementations and readings, I will start a new series posting probability functions in their log likelihood form. So in the first post I will post my "log calculations" for the Dirichlet Distribution.
The Dirichlet Distribution is often used as a prior on a multinomial distribution. Using that prior one can  estimate Multinomial Distributions more stable, meaning your probability estimates do not go wild. In simple words the Dirichlet Distribution is a distribution over a Multinomial Distribution with parameters
alpha. The alpha parameter is a vector of the same size as the Multinomial Distribution. The probability
density function is:

The scaler can be expressed using the beta function:

The gamma is well the gamma function,  a numerical stable log implementation is available in the submission: "Algorithm 291: logarithm of gamma function". So let us start by converting the actual distribution to log:

Converting the scaler is also straight forwards:

And that is actually all already :D. The log - gamma function can be implemented from the paper :D

Thursday, March 14, 2013

Thoughts on EM - Clustering: kmeans++

In this post I will talk a bit about initialization of the EM algorithm for Gaussian Mixtures and numerical  stability. It might be the last post on clustering since I have a stable enough version for me now.

The initialization I want to talk about is kmeans++, an algorithm for initializing classic kmeans, which can also be used for Gaussian Mixtures. The high level bit is instead of starting with random samples, we prefer samples that are far from the already chosen centers.

However if we us the furthest point algorithm, we are more sensible to outliers since the furthest points might not belong to any cluster. Kmeans++ is combining ideas for classic kmeans initialization (pick k points at random) and furthest point. We do so by defining a probability distribution that assigns a probability to each sample being a good initial mean. The distribution is based on the distance to the closest center that is already assigned.

So the algorithm for initialization picks the first center at random and then samples the next center based on the probabilities. As one can see if the distance of a sample to the closest distance is large the probability will be large, too and vice versa. We stop drawing samples when we have k centers. Based on these centers we can run classic kmeans or Gaussian Mixture EM.

In my experiments kmeans++ performed great as an initialization for EM while being way faster then hierarchical clustering.

Thursday, February 14, 2013

A nice linear algebra library for C++

Normally when coding in C++ you will use the interface of LAPACK or the boost library uBlas. However these are not really nice to use. I found an alternative and after spending an afternoon with it I must say I am impressed. The library is called Armadillo. The nice things are the initialization methods provided, the element access as well as the general operators. For example if I was to create
a 2 x 2 matrix with some values I can simply do this:

matrix << 1 << 2 << endr
           << 0 << 1 << endr;

Accessing an element is simply:

matrix(i, j)

And operations can be performed as:

X =  A * B
X = A + B

The functionality seems to include everything I needed so far:


In the end there is a speed comparison in which armadillo is always the fastest.

Monday, February 11, 2013

Thoughts on EM - Clustering: Growing A Mixture

Since I am still working a lot on my clustering algorithms, especially Expectation Maximization for Gaussian Mixture Models, I will share another thought on how to improve EM's performance. In general the problem is that EM will converge on a local maximum. That menas when we start with initial bad gaussians, we will end up on a local maximum close to where we started. Solutions are run k-means 10 x and use the clusters as the inital means (like weka) or cluster hierarchical first (see older blog post). The problem with k-means is that it is localized, too. The problem with the hierarchical version is that it is slow. In the end the initialization was slower for me then the actual clustering. Something I found today is "growing" the mixture incrementally [1].

Just to remind you EM is a two stage algorithm. First we calculate the probabilities
for each data point using the current model parameters (E -step, expectation), and then use these probability estimates to refine the parameters (M-step, maximization). Now the trick is to split the gaussian with the highest rank (highest weight or prior) after each M-step and delete all components
with a too small prior or a too small variance. We repeat that process until we reach the desired k.

[1] Kevin Murphy: Machine Learning: a Probabilistic Perspective, MIT Press, 2012

Monday, February 4, 2013

Conditional Random Fields: I finally got it!

As a combination of trying again and again and the classes I take I was able to wrap my head around why CRFs are great and how they work. At first I was a bit sceptic since I love my HMMs but maybe in the future I give it a try. I think the main reason I got what these models do is that there is a very deep discussion on Markov Random Fields in our Probabilistic Graphical Models lecture (by Jim Rehg) CS 8803 at Georgia Tech. Since I will go a bit deeper in the model itself in a second here is the important bit. The first difference is that Random Fields are a distribution over potentials which are just any positive function defined over a clique of variables in the model. The joint probability is then defined as a scaled multiplication of these potentials. In Hidden Markov Models we have a graph structure over probability distributions instead. The difference of a Conditional Random Field to a Markov Random Field is that it models the conditional probability instead of the joint probability. The only difference however seems to be the definition of the scaler.

Now to the details. In contrast to a Bayes Network or a Hidden Markov Model a Random field is a undirected model. So instead of talking about a parent child relationship we talk about interaction between potentials. A potential is a function defined over variables of a clique. Just as a reminder,
in Computer Science a clique is a fully connected subgraph. For example if we were to label each frame in a sequence we could use the model below. The Y are the labels over time and X is the whole time series.

The potential function is defined as a three- clique over the last label, the current label and the whole time series.

While the transition probability (label sequence) is the same as in Hidden Markov Models as it models a  
first order Markov Chain, the observations are fully accessible. In a Hidden Markov Model the observations have to be independent by definition. So now can define more powerful observations by designing potential functions that take previous values into account. Furthermore, we can use the labels in order to behave differently from label to label. The joint probability is then defined over these potentials. However for labeling the sequence we are not interested in the joint probability but in the conditional as P(Y | X) instead of P(Y, X). The conditional probability is then defined as:

And this is a Conditional Random Field. In my opinion and from what I read, this model is more powerful then Hidden Markov Models but has some drawbacks. The first one is that in order to train it, we need fully labeled data. While in a Hidden Markov Model, we can automatically distribute the data over the states by inferring the label sequence, this is not possible here. The reason is that instead of a Expectation Maximisation training, most people seem to use a Gradient Based optimization on the log-likelihood. Acquiring labels on that level might not be possible in all cases. The other thing I read is that these models need a lot more training data for training and the training is much slower. In the end that means use cases where only a few training instances can be collected this model won't be able to converge.  

Magic Summon: Building false positive free Gestures from just negative examples

Thad and I have a new publication in the Journal Of Machine Learning Research. The nugget for me in this paper is that we were able to build a tool that automatically suggests gestures that won't false trigger in everyday life. The paper can be found in Volume 14. I attached the abstract for everyone interested:

"Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users' normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an "Everyday Gesture Library" or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC's effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers.

Thursday, January 31, 2013

First Coding Dojo on Artificial Intelligence

Last week on Wednesday I kickstarted the first AI coding dojo at our lab with a group of interested friends. The goal is to organize a group of interested students with different backgrounds and train to implement Artificial Intelligence algorithms. In a coding dojo the goal is not to build working software but to improve our coding and style. Why especially on Artificial Intelligence? Well in my opinion implementing AI algorithms is much different from implementing regular software. While doing so we also try to improve our coding and style and implement everything in a test driven manner. The setup is as follows. One person brings an interesting algorithm that is the topic of the dojo. It should not be too hard since we restricted a session to 2 hours. In the first session we implemented Expectation Maximization for a Mixture of Gaussians. The participants of the dojo gather around a projector or large screen connected to a single computer. Two participants sit on the computer and implement part of the problem using pair programming. The rest of the group watches what happens on the screen and gives helpful comments when needed. The two programmers switch after 10 minutes. 

The first dojo went slow but I think people learned from each other and enjoyed it. The next dojos will include advanced search algorithms (D*), collaborative filtering (Matrix Factorization) and the cognitive architecture SOAR.

The status of this week is in:

Tuesday, January 29, 2013

Granger Causality

Granger causality allows to test if a variable X is causally linked to another variable Y. We say Y G-causes X if predicting X with only past information of X gives worse results then using past values of X and Y. Clearly X and Y have to be defined over time. Which means X and Y are time series. If X is causally linked with Y, then we can can model X as the following linear Auto Regressive model:

The prediction is the sum of the AR model for X, the AR model for Y and an error term. The idea is if the error for the prediction is minimal using the above predictor, then Y G-causes X if the weight for y is significantly different from 0. As one can see if the weight is 0 then Y has no effect on the predictor at all, so
it is not causing X. So if we want to test if Y G-causes X given some data, we estimate the weights and perform a F-Test of the weight vector of Y and 0. So our null-hypothesis is:

In conclusion we can test if two variables X and Y are causally linked by performing a statistical test on the weights (or influence) of one variable to another.

Tuesday, January 22, 2013

Thoughts on EM - Clustering: Hierarchical Agglomerative Clustering

A common clustering algorithm for continues data is using Expectation-Maximisation of Gaussian Mixture Models. In short a Gaussian Mixture Model is a joint distribution of k- Gaussians. Each of them has a prior probability of picking that gaussian. The joint probability of that model is given as:

Given some data the task is to set the parameters of the model (priors, means and covariances) such that  the above likelihood is maximised. In my opinion the hardest problem is to find "good" initial parameters. EM will settle on some global maximum. So if the initial parameters are close to one of the very bad local optima the algorithm will converge on a bad model. The prior probability is not really a problem since it can be set to 1 / k for all priors. The initial normal distribution is more difficult. One solution is to estimate global mean and variance of the data first and then add some random noise to it. However this does not give a good initial estimate, since everything will be close to the global mean. Another option is to run k-means first and use the clusters as an initial estimate that is later refined by Expectation-Maximisation. However the problem remains. Now we have to come up with some initial estimate for k- means. Today I found a cool paper describing a method called Hierarchical Agglomerative Clustering [1]. For a data set of size N it starts by defining a single cluster for each instance. In each iteration we merge the two clusters that are closest, given some distance measure. Using that method the numbers of clusters will decrease by one in each iteration. We simply stop when the number of clusters is equal to the number of mixture components for our mixture gaussian. In my opinion that method is nice since defining the initial set of clusters is always possible and deterministic. That means for all data sets we can expect the same clusters. 


I implemented the algorithm on some demo data I made. Before I got the right result most of the time
while using the K-means initialization or the random one. Now I get a very good result in all runs.


[1] Marina Meila and David Heckerman: "An Experimental Comparison of Several Clustering and Initialization Methods", Machine Learning 42, Jab 2001

Wednesday, January 2, 2013

Forward Recursion of Variable Duration Hidden Markov Models

Somewhen last October / November I started wrapping my head around variable duration Hidden Markov Models. Especially the forward recursion. Again the formulas and most of my Hidden Markov Model knowledge are from the Rabiner paper [1]. In regular Hidden Markov Models the duration of staying in a state is modeled by a geometric distribution, imposed by the structure of HMMs:

However, in practice the duration is not geometrically distributed. Variable duration Hidden Markov Models fix this weakness by allowing each state to generated multiple observations at once. The length of this generated sequence is modeled by a non parametric distribution: p(d). Furthermore these models forbid the self transition: aii. Now the forward recursion becomes the following:

The first difference is that we have to sum out the duration on top of the states, since we don't know
how long the observed sequence for that state is. The recursion is then not only looking at the last state but looks back d steps. So the model is not truly "markovian" but "semi markovian". The other difference is that the observation probability is not only the current frame with the current state's observation probability, but the sequence from d steps back to the current one.

The plus side of that model is that it is able to model the duration of each state better. However the downside is that the inference time is much higher since we have to sum out the duration, too.

[1] Lawrence Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE Vol 77, No 2, 1989

Building a Gesture Recognizer - Part 2 Hidden Markov Models

In the last post on gesture recognition, we used a simple instance based recognizer. The distance measure we used is Dynamic Time Warping. This distance accounts for warping in the gesture.

A more prominent and realistic recognizer is using Hidden Markov Models.
Many Gesture and Speech or Handwritten recognition systems use these models.

A Hidden Markov Model is a probabilistic model. As Dynamic Time Warping, it models
time series while accounting for warping in the gesture.
In order to get an intuition on what a Hidden Markov Model is and how it works I will give
a brief introduction. I will not go over the math behind it since other people did a very good job at it [1].
A Hidden Markov Model is a probabilistic state machine. Each state has two probability distributions.
A continues one that models the data and a discrete one modeling the probability of transitioning
from a state to another. The continues one is called observation probability since it models the observed data. The probability distribution is a gaussian centered around a mean that discribes the data under that state best. For example in the image below we can see a three state Hidden Markov Model.
The time series has clearly three distinct parts. We model each part as one state, while the mean of the
state is the mean estimated from that part. The goal is to find the maximum likelihood state sequence
given the data and the model. We can infer which frame of the time series belongs to which
state using the observation probabilities and the transition probabilities.

How to perform that inference is described in the Rabiner paper on Hidden Markov Models on page 263 "Solution to Problem 2". The procedure's name is "Viterbi decoding". Once we have the state sequence we can calculate the probability of the time series being generated by a particular model.

For a gesture recognizer we build multiple of these models, one for each gesture. Then we collect a training set and use it to estimate the parameters of that model. During recognition we simply pick the model describing the data best.

In most cases Hidden Markov Models outperform Dynamic Time Warping solutions in accuracy and
speed. If you want to build a gesture recognizer using Hidden Markov Models you can use GART or
GT2K built by the research group I am currently working in. Both are build on the Hidden Markov Model Toolkit 3 (HTK3) so you have to install it first. I also recommend reading the HTK3 Book if you are interested in Hidden Markov Models in general. GT2K is a collection of shell scripts wrapping
HTK and it can be downloaded from the Contextual Computing Group. GART is a Java wrapper which is downloadable at the Contextual Computing Group, too.

[1] Lawrence Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE Vol 77, No 2, 1989