Thursday, January 31, 2013

First Coding Dojo on Artificial Intelligence

Last week on Wednesday I kickstarted the first AI coding dojo at our lab with a group of interested friends. The goal is to organize a group of interested students with different backgrounds and train to implement Artificial Intelligence algorithms. In a coding dojo the goal is not to build working software but to improve our coding and style. Why especially on Artificial Intelligence? Well in my opinion implementing AI algorithms is much different from implementing regular software. While doing so we also try to improve our coding and style and implement everything in a test driven manner. The setup is as follows. One person brings an interesting algorithm that is the topic of the dojo. It should not be too hard since we restricted a session to 2 hours. In the first session we implemented Expectation Maximization for a Mixture of Gaussians. The participants of the dojo gather around a projector or large screen connected to a single computer. Two participants sit on the computer and implement part of the problem using pair programming. The rest of the group watches what happens on the screen and gives helpful comments when needed. The two programmers switch after 10 minutes. 

The first dojo went slow but I think people learned from each other and enjoyed it. The next dojos will include advanced search algorithms (D*), collaborative filtering (Matrix Factorization) and the cognitive architecture SOAR.

The status of this week is in:

Tuesday, January 29, 2013

Granger Causality

Granger causality allows to test if a variable X is causally linked to another variable Y. We say Y G-causes X if predicting X with only past information of X gives worse results then using past values of X and Y. Clearly X and Y have to be defined over time. Which means X and Y are time series. If X is causally linked with Y, then we can can model X as the following linear Auto Regressive model:

The prediction is the sum of the AR model for X, the AR model for Y and an error term. The idea is if the error for the prediction is minimal using the above predictor, then Y G-causes X if the weight for y is significantly different from 0. As one can see if the weight is 0 then Y has no effect on the predictor at all, so
it is not causing X. So if we want to test if Y G-causes X given some data, we estimate the weights and perform a F-Test of the weight vector of Y and 0. So our null-hypothesis is:

In conclusion we can test if two variables X and Y are causally linked by performing a statistical test on the weights (or influence) of one variable to another.

Tuesday, January 22, 2013

Thoughts on EM - Clustering: Hierarchical Agglomerative Clustering

A common clustering algorithm for continues data is using Expectation-Maximisation of Gaussian Mixture Models. In short a Gaussian Mixture Model is a joint distribution of k- Gaussians. Each of them has a prior probability of picking that gaussian. The joint probability of that model is given as:

Given some data the task is to set the parameters of the model (priors, means and covariances) such that  the above likelihood is maximised. In my opinion the hardest problem is to find "good" initial parameters. EM will settle on some global maximum. So if the initial parameters are close to one of the very bad local optima the algorithm will converge on a bad model. The prior probability is not really a problem since it can be set to 1 / k for all priors. The initial normal distribution is more difficult. One solution is to estimate global mean and variance of the data first and then add some random noise to it. However this does not give a good initial estimate, since everything will be close to the global mean. Another option is to run k-means first and use the clusters as an initial estimate that is later refined by Expectation-Maximisation. However the problem remains. Now we have to come up with some initial estimate for k- means. Today I found a cool paper describing a method called Hierarchical Agglomerative Clustering [1]. For a data set of size N it starts by defining a single cluster for each instance. In each iteration we merge the two clusters that are closest, given some distance measure. Using that method the numbers of clusters will decrease by one in each iteration. We simply stop when the number of clusters is equal to the number of mixture components for our mixture gaussian. In my opinion that method is nice since defining the initial set of clusters is always possible and deterministic. That means for all data sets we can expect the same clusters. 


I implemented the algorithm on some demo data I made. Before I got the right result most of the time
while using the K-means initialization or the random one. Now I get a very good result in all runs.


[1] Marina Meila and David Heckerman: "An Experimental Comparison of Several Clustering and Initialization Methods", Machine Learning 42, Jab 2001

Wednesday, January 2, 2013

Forward Recursion of Variable Duration Hidden Markov Models

Somewhen last October / November I started wrapping my head around variable duration Hidden Markov Models. Especially the forward recursion. Again the formulas and most of my Hidden Markov Model knowledge are from the Rabiner paper [1]. In regular Hidden Markov Models the duration of staying in a state is modeled by a geometric distribution, imposed by the structure of HMMs:

However, in practice the duration is not geometrically distributed. Variable duration Hidden Markov Models fix this weakness by allowing each state to generated multiple observations at once. The length of this generated sequence is modeled by a non parametric distribution: p(d). Furthermore these models forbid the self transition: aii. Now the forward recursion becomes the following:

The first difference is that we have to sum out the duration on top of the states, since we don't know
how long the observed sequence for that state is. The recursion is then not only looking at the last state but looks back d steps. So the model is not truly "markovian" but "semi markovian". The other difference is that the observation probability is not only the current frame with the current state's observation probability, but the sequence from d steps back to the current one.

The plus side of that model is that it is able to model the duration of each state better. However the downside is that the inference time is much higher since we have to sum out the duration, too.

[1] Lawrence Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE Vol 77, No 2, 1989

Building a Gesture Recognizer - Part 2 Hidden Markov Models

In the last post on gesture recognition, we used a simple instance based recognizer. The distance measure we used is Dynamic Time Warping. This distance accounts for warping in the gesture.

A more prominent and realistic recognizer is using Hidden Markov Models.
Many Gesture and Speech or Handwritten recognition systems use these models.

A Hidden Markov Model is a probabilistic model. As Dynamic Time Warping, it models
time series while accounting for warping in the gesture.
In order to get an intuition on what a Hidden Markov Model is and how it works I will give
a brief introduction. I will not go over the math behind it since other people did a very good job at it [1].
A Hidden Markov Model is a probabilistic state machine. Each state has two probability distributions.
A continues one that models the data and a discrete one modeling the probability of transitioning
from a state to another. The continues one is called observation probability since it models the observed data. The probability distribution is a gaussian centered around a mean that discribes the data under that state best. For example in the image below we can see a three state Hidden Markov Model.
The time series has clearly three distinct parts. We model each part as one state, while the mean of the
state is the mean estimated from that part. The goal is to find the maximum likelihood state sequence
given the data and the model. We can infer which frame of the time series belongs to which
state using the observation probabilities and the transition probabilities.

How to perform that inference is described in the Rabiner paper on Hidden Markov Models on page 263 "Solution to Problem 2". The procedure's name is "Viterbi decoding". Once we have the state sequence we can calculate the probability of the time series being generated by a particular model.

For a gesture recognizer we build multiple of these models, one for each gesture. Then we collect a training set and use it to estimate the parameters of that model. During recognition we simply pick the model describing the data best.

In most cases Hidden Markov Models outperform Dynamic Time Warping solutions in accuracy and
speed. If you want to build a gesture recognizer using Hidden Markov Models you can use GART or
GT2K built by the research group I am currently working in. Both are build on the Hidden Markov Model Toolkit 3 (HTK3) so you have to install it first. I also recommend reading the HTK3 Book if you are interested in Hidden Markov Models in general. GT2K is a collection of shell scripts wrapping
HTK and it can be downloaded from the Contextual Computing Group. GART is a Java wrapper which is downloadable at the Contextual Computing Group, too.

[1] Lawrence Rabiner: "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE Vol 77, No 2, 1989