Sunday, July 14, 2013

Baum Welch Initialisation: Flat Start vs Randomisation vs Viterbi Training

As most of the readers will know Baum Welch is a parameter estimation Algorithm for Hidden Markov Models. Given a initial configuration of the HMM it will converge to a locally optimal parameter configuration. The important bit is that the initial configuration will determine "how good" the final parameter set will be. To my knowledge there are three initialisation methods two are implemented in the HTK3 toolkit:

Viterbi Training
Viterbi training is implemented in HTK3 in the tool HInit. First the data is uniformly segmented
with the number of segments equal to the number of states. Then we get the optimal path through the HMM using Viterbi + Backtracking. Using the path we reestimate the model.

Flat Start
Flat start simply initialises the model with the observations uniform. For example we estimate a uniform Gaussian from all data available and use it for all initial distributions. In both cases, Flat Start and Viterbi Training we set the initial transition probabilities ourselves. Flat Start is implemented in HTK3 in the HCompV tool. The problem of Viterbi training is that it requires labeled data, while Flat Start does not.

Random Initialisation with Random Restarts 
Viterbi Training needs labeled data and Flat Start is a simple initialization but can give a bad initial estimate. Since Baum Welch is dependent on the initial starting condition, we can also use random initializations with random restarts. In that way we can initialize the transition matrix by setting all possible transitions to random values. The Gaussians can be initialized as in Flat Start and then we add some small random number to the mean. We construct multiple random initializations and run Baum Welch. In the end we keep the model with the best likelihood. We can also use a random subset for the initial Gaussians. Unfortunately there is no implementation for this method in HTK. In my experiments this initialization works best in the scenario where no labels are available and Flat Start gives a bad estimates.

Monday, July 8, 2013

On different depictions of Hidden Markov Models

During my Diploma thesis I read most literature about Hidden Markov Models from the speech community. Since I arrived at Tech and started reading more about the Machine Learning perspective on these models I noticed a difference in depiction of Hidden Markov Models. In this post I want to argue why I think the "old" speech depiction is a bit more expressive.

The old depiction is a state machine. Each state of the HMM is depicted as a circle and the arrows
show how one can "travel" through the model. I saw this depiction in the classic Rabiner papers.

In the second representation is a Hidden Markov Model as a Bayes Network. There are two kinds of nodes. The one at the top show a discrete variable, only dependent on it's successor. Encoded in this node are the transition probabilities. Or in other words the state machine is encoded in each of these nodes. There are as many transition nodes as there are samples in the time series. The nodes at the bottom are the observation probabilities which are not seen in the model from the speech community. 

I think if one only publishes the second depiction it does not tell much about the model itself. One already assumes that there are observation distributions in an HMM so why depict them. Second just drawing as many nodes as there are samples is weird for me for two reasons. The first reason is that we do not depict anything about the transition model, which is the only thing to model in an HMM. The second reason is that the number of samples or time slices does not matter at all for an HMM. So I think  drawing the state machine is a way better way of showing your model then making the ML depiction that does not tell much of your specific model except that it is an HMM. Just to make my point more clear. The second model, even when specifying in the text that we use 3 states could be the top model or the one at the bottom, without the reader knowing.

However, explaining inference algorithms such as variable elimination is easier in the middle model, but for specific models I prefer the state machine depiction.

Tuesday, July 2, 2013

log-math: Dirichlet Distribution

A long time since my last post again. But I am working on one for Gaussian Processes but I am procrastinate there. So I decided to write another one.

As my series on Expectation Maximisation ("Thoughts on EM") evolved from research, implementations and readings, I will start a new series posting probability functions in their log likelihood form. So in the first post I will post my "log calculations" for the Dirichlet Distribution.
The Dirichlet Distribution is often used as a prior on a multinomial distribution. Using that prior one can  estimate Multinomial Distributions more stable, meaning your probability estimates do not go wild. In simple words the Dirichlet Distribution is a distribution over a Multinomial Distribution with parameters
alpha. The alpha parameter is a vector of the same size as the Multinomial Distribution. The probability
density function is:

The scaler can be expressed using the beta function:

The gamma is well the gamma function,  a numerical stable log implementation is available in the submission: "Algorithm 291: logarithm of gamma function". So let us start by converting the actual distribution to log:

Converting the scaler is also straight forwards:

And that is actually all already :D. The log - gamma function can be implemented from the paper :D