Tuesday, January 22, 2013

Thoughts on EM - Clustering: Hierarchical Agglomerative Clustering

A common clustering algorithm for continues data is using Expectation-Maximisation of Gaussian Mixture Models. In short a Gaussian Mixture Model is a joint distribution of k- Gaussians. Each of them has a prior probability of picking that gaussian. The joint probability of that model is given as:

Given some data the task is to set the parameters of the model (priors, means and covariances) such that  the above likelihood is maximised. In my opinion the hardest problem is to find "good" initial parameters. EM will settle on some global maximum. So if the initial parameters are close to one of the very bad local optima the algorithm will converge on a bad model. The prior probability is not really a problem since it can be set to 1 / k for all priors. The initial normal distribution is more difficult. One solution is to estimate global mean and variance of the data first and then add some random noise to it. However this does not give a good initial estimate, since everything will be close to the global mean. Another option is to run k-means first and use the clusters as an initial estimate that is later refined by Expectation-Maximisation. However the problem remains. Now we have to come up with some initial estimate for k- means. Today I found a cool paper describing a method called Hierarchical Agglomerative Clustering [1]. For a data set of size N it starts by defining a single cluster for each instance. In each iteration we merge the two clusters that are closest, given some distance measure. Using that method the numbers of clusters will decrease by one in each iteration. We simply stop when the number of clusters is equal to the number of mixture components for our mixture gaussian. In my opinion that method is nice since defining the initial set of clusters is always possible and deterministic. That means for all data sets we can expect the same clusters. 


I implemented the algorithm on some demo data I made. Before I got the right result most of the time
while using the K-means initialization or the random one. Now I get a very good result in all runs.


[1] Marina Meila and David Heckerman: "An Experimental Comparison of Several Clustering and Initialization Methods", Machine Learning 42, Jab 2001

No comments:

Post a Comment