Daniel@World: clustering

Showing posts with label clustering. Show all posts

Sunday, February 28, 2016

Speed Bake-Off Tensorflow / (Numpy | SciPy | Matplotlib) / OpenCV

I recently got interested into scaling some of the algorithms from my PhD thesis. Especially the feature learning part. I used a K-Means to learn a convolutional codebook over audio spectrogram.
Why you can use K-Means to learn something like a convolutional neural net is described in: Coates, Lee and NG as well as my thesis. Anyways in my thesis experiments the bottle neck was the feature learning. So I need a fast clustering implementation as well as a fast convolution in order to resolve the speed issues. The three main implementations I look at are:

Google Tensor Flow
SciKit Learn
OpenCV

All three provide a python interface but use a low level, speedy C or Fortran implementation under the hood. My goal is to evaluate these three libraries for their performance (speed) under the following conditions:

The library is fastest when testing on a single Macbook. It can use the available parallel devices such as a GPU and multi threading. Furthermore, the library should be fastest on medium sized problems.

I choose these conditions since I plan to design programs that animal communication researchers can use in the field. Meaning, if you are to analyse dolphin communication on a boat in the Bahamas or

you observe Birds in the rainforest, the only thing you might have is your laptop with no internet connection.

Convolution

The convolution test is based on the three functions:

Tensorflow: tensorflow.nn.conv2d
SciPy: scipy.signal.convolve2d
OpenCV: cv2.filter2d

My test is to compute the edge image using a sobel filter in x-direction and on in y-direction:

I am using the following image and expect the following edge image:

The results in seconds:

Tensorflow: 0.125 [s]
SKLearn: 0.049 [s]
OpenCV: 0.019 [s]

Here it looks like the OpenCV implementation is the fastest. So openCV it is. For the clustering there are a lot of libraries and even the sklearn implementation is very fast.

Update:

After Himanshu's comment I chose to check Tensorflow not including the variable assignment
and then it took 0.021 seconds. However, the image copying into the variable and the session setup
matter in my use case, too. And it is still lower than Open CV. It is also interesting that for these problems, the speed between the libraries is not that different. However, I belief that for larger problems, the tensor flow version that can run on a cluster will show way better performance. Also I don't know right now if the current tensor flow version works with opencl.

Sunday, December 28, 2014

K-Means works just fine

It is interesting how powerful vector quantization can be. Since I like the quantization idea a lot, I think the following insight from a 2011 paper was notable: K-Means outperforms single layer neural nets for self taught learning. The task is to learn features of an image from small patches extracted in a sliding window. Features are extracted over these patches by:

1) Learning an auto encoder over these patches
We learn a neural net with the input layer
connected to a smaller hidden layer that connects
to an output layer. The input and output are the same
so we try to reconstruct the input using feature detectors
in the hidden layer.

2) Learning a restricted Boltzmann machine
Probabilistic version using a Markov random field of the above method.
No output layer is needed since the model is undirected. Learning can be
performed using Gibbs sampling.

3) K-Means:Vector quantization. Features are soft assignments to cluster centers.

4) Gaussian Mixtures: Fitting a gaussian mixture using Expectation Maximization. The features are
the posterior.

These features are used for classification using a Support Vector Machine. Interestingly enough, k-means outperforms all other methods by at least three to four percent accuracy on the CIFAR and NORB data set.

However, for a lot of the bigger task more recent results suggest that deep convolutional neural nets outperform everyone else.

Thursday, May 1, 2014

Normalized Graph Cuts and Spectral Algorithms

I recently read multiple papers about normalized graph cuts for segmentation and spectral algorithms. In this post I will highlight some of the similarities between the two. Both algorithms are graph based algorithms for clustering data. I will first show how a data set can used in such a graph environment using gram matrices and then talk more about graph cuts and spectral algorithms.

For N data points X = {x1 ... xN} we can construct a graph G=(V, E) by using every example xi as a vertex and connect every vertex to every other vertex. The weight of this edge eij is defined by a similarity function between xi and xj. Often we use a gaussian kernel as the similarity function:

If we represent this graph as a connectivity matrix, we end with a dense matrix called the gram matrix. In the gram matrix every entry Wij is set to the edge weight between vertex i and vertex j.

A cut in a graph partitions a graph G into two disjoint subsets A, B of vertices that share no edges.
This can be achieved by removing edges from the graph. So a cut is a set of edges when removed partitions the graph into A and B. The weight of the cut is the sum of the weights of this edge set:

For classification we could search the cut in the graph with minimum weight. This means we remove the edges from the graph where the similarity is lowest. In that way, the two remaining sets A and B are the clusters. However, now the minimum cut might just separate only a few nodes from the rest of the graph. This can be avoided using normalized graph cuts which include the connectivity or degree of each vertex:

We can capture the connectivity in the denominator as the degree matrix. The degree matrix is a matrix with all zeros except the diagonal. Along the diagonal we capture the connectivity of each node:

Going towards the solution, the graph Laplacian is defined as:

With some math, we can find the solution by extracting the top k eigenvectors from the matrix:

If we have N examples and k vectors, the result is a k x N matrix. If we transpose it we have N vectors

of size k. Now we can cluster these new instances using k-means. This method is known as k-way approximate graph cut. In my opinion this algorithm is closely related to spectral clustering, which uses a very similar formulation [1]. Most of the post follows[2].

REFERENCES

[1] Andrew Y. Ng, Michael I. Jordan, Yair Weiss: On Spectral Clustering: Analysis and an algorithm. NIPS, 2001
[2] Jianbo Shi, Jitendra Malik: Normalized Cuts and Image Segmentation, PAMI, 2000

Thursday, March 14, 2013

Thoughts on EM - Clustering: kmeans++

In this post I will talk a bit about initialization of the EM algorithm for Gaussian Mixtures and numerical stability. It might be the last post on clustering since I have a stable enough version for me now.

The initialization I want to talk about is kmeans++, an algorithm for initializing classic kmeans, which can also be used for Gaussian Mixtures. The high level bit is instead of starting with random samples, we prefer samples that are far from the already chosen centers.

However if we us the furthest point algorithm, we are more sensible to outliers since the furthest points might not belong to any cluster. Kmeans++ is combining ideas for classic kmeans initialization (pick k points at random) and furthest point. We do so by defining a probability distribution that assigns a probability to each sample being a good initial mean. The distribution is based on the distance to the closest center that is already assigned.

So the algorithm for initialization picks the first center at random and then samples the next center based on the probabilities. As one can see if the distance of a sample to the closest distance is large the probability will be large, too and vice versa. We stop drawing samples when we have k centers. Based on these centers we can run classic kmeans or Gaussian Mixture EM.

In my experiments kmeans++ performed great as an initialization for EM while being way faster then hierarchical clustering.