Thursday, March 14, 2013

Thoughts on EM - Clustering: kmeans++

In this post I will talk a bit about initialization of the EM algorithm for Gaussian Mixtures and numerical  stability. It might be the last post on clustering since I have a stable enough version for me now.

The initialization I want to talk about is kmeans++, an algorithm for initializing classic kmeans, which can also be used for Gaussian Mixtures. The high level bit is instead of starting with random samples, we prefer samples that are far from the already chosen centers.

However if we us the furthest point algorithm, we are more sensible to outliers since the furthest points might not belong to any cluster. Kmeans++ is combining ideas for classic kmeans initialization (pick k points at random) and furthest point. We do so by defining a probability distribution that assigns a probability to each sample being a good initial mean. The distribution is based on the distance to the closest center that is already assigned.

So the algorithm for initialization picks the first center at random and then samples the next center based on the probabilities. As one can see if the distance of a sample to the closest distance is large the probability will be large, too and vice versa. We stop drawing samples when we have k centers. Based on these centers we can run classic kmeans or Gaussian Mixture EM.

In my experiments kmeans++ performed great as an initialization for EM while being way faster then hierarchical clustering.