Sunday, December 28, 2014

K-Means works just fine



It is interesting how powerful vector quantization can be. Since I like the quantization idea a lot, I think the following insight from a 2011 paper was notable: K-Means outperforms single layer neural nets for self taught learning. The task is to learn features of an image from small patches extracted in a sliding window. Features are extracted over these patches by:

1) Learning an auto encoder over these patches
    We learn a neural net with the input layer
     connected to a smaller hidden layer that connects
     to an output layer. The input and output are the same
     so we try to reconstruct the input using feature detectors
     in the hidden layer.

2) Learning a restricted Boltzmann machine
    Probabilistic version  using a Markov random field of the above method.
    No output layer is needed since the model is undirected. Learning can be
    performed using Gibbs sampling.

3) K-Means:Vector quantization. Features are soft assignments to cluster centers.

4) Gaussian Mixtures: Fitting a gaussian mixture using Expectation Maximization. The features are
     the posterior.

These features are used for classification using a Support Vector Machine. Interestingly enough, k-means outperforms all other methods by at least three to four percent accuracy on the CIFAR and NORB data set.

However, for a lot of the bigger task more recent results suggest that deep convolutional neural nets outperform everyone else.



Tuesday, November 11, 2014

Spectrogram Interest Points: Shazaam

After some time, I decided to write another blog post. This time I want to talk about what one can do with local interest points in a spectrogram.

An interest point in a spectrogram is a point of high magnitude in time and frequency. Normally we use points that are a local maximum in a small region.
In that way these points group around interesting audio events in the spectrogram. One use of these features is audio indexing and retrieval. For example, the Shazaam app records audio using your phone. It proceeds to extract local features from the audio and compares them against a data base
of songs, indexed in advance. The app is capable of naming a large variety of songs from noisy recordings.



The algorithm for indexing is extracting such interest points from the spectrogram and continues to
hash combinations of these points. Therefore, the algorithm uses one of the points in a region as an anchor point and measures the offset from the anchor point to other points in a neighborhood. These combinations are to index the audio file. For an unseen song, the hashes are extracted using a sliding window and the app searches for matches with the pre recorded hashes. The image above shows local interest points in red and some combinatorial hashes in yellow grouping around a dolphin whistle. We
used the same features to build a dolphin whistle detector running on an underwater wearable computer.

The algorithm is quite robust to noise and shows good indexing performance on music data.

Friday, August 15, 2014

ICASSP: Pattern Discovery in Dolphin Whistles

Abstract of my ICASSP 2014 paper on Dolphin Communication Mining:




"The study of dolphin cognition involves intensive research of animal vocalizations. Marine mammalogists commonly study a specific sound type known as the whistle found in dolphin communication. However, one of the main problems arises from noisy underwater environments. Often waves and splash noises will partially distort the whistle making analysis or extraction difficult. Another problem is discovering fundamental units that allow research of the composition of whistles. We propose a method for whistle extraction from noisy underwater recordings using a probabilistic approach. Furthermore, we investigate discovery algorithms for fundamental units using a mixture of hidden Markov models. We evaluate our findings with a marine mammalogist on data collected in the field. Furthermore, we have evidence that our algorithms enable researchers to form hypotheses about the composition of whistles."


Symbolic Aggregate approXimation: A symbolic time series representation

I used this time series representation some years ago for a lot for my research. I still think it is an elegant way of representing time series. You can use this easy to use algorithm to convert a one dimensional time series into a string. Given a time series you split it into w equally sized segments and estimate the sample mean in each segment. So we end up with a time series of length w. We then divide the Y axis into k regions using split points or thresholds. Assigning a unique symbol to each of the regions we can
check in which region each sample mean falls into and read of the symbol. So we end with a string of size w.



You can see the performance on multiple time series data sets in the original paper. Furthermore, there is a very efficient way on how to index massive data sets using this representation.

Wednesday, July 16, 2014

SLIC: Simple super pixels

An images segmented into super pixels. Example from the paper



After searching for an easy algorithm to compute super pixels I found a cool method segment images based on color using a modified K-Means version called Simple Linear Iterative Clustering or SLIC. The input to the algorithm is a set of vectors describing each pixel in the images as a vector:

v = [l a b x y] 

The first three entries represent the color in LAB space. The last two the position in the image. The goal of the algorithm is to segment the image into k more or less equally sized regions of same color, called super pixels. The algorithm is initialized by sampling the image k times uniformly. Based on these initial centers one runs a local K-Means for each center. So a clusters influence is in a limited region.
The other novelty of the algorithm is the distance measure that scales the color and position.



Tuesday, June 17, 2014

ICASSP 2014


I was still sitting in Atlanta, when I decided to write a little post about interesting topics, papers and talks from ICASSP 2014, a speech and signal processing conference. So here are my notes on the keynotes and my impression of the conference :D.


Plenary Talk: Signal Processing in Computational Art History
Speaker: Richard Johnson

This talk described a novel field for signal processing, or computer vision called "Computational Art History". Apparently, art historians are often involved in some sort of detective work gathering information towards the authenticity of paintings. The speaker described his work so far with art historians in which he applied simple signal processing algorithms in order to gather quantifiable evidence. His three examples were authenticity of canvas paintings, photographs and laid paper. In all these examples, the research team analyzed image material from the material. In the canvas example, the authors count the number of threads on the canvas. Based on these distribution they can find evidence for two paintings being made on material from the same piece of canvas. Similar, in the photographic paper work, his team investigates if two pieces of photographic paper are from the same batch. Such indicators can help art historians to classify paintings and other pieces of art. It is interesting to see such a collaboration between art historians and signal processing folks. I hope to see more of this work in the future.

Plenary Talk: Model-Based Signal Processing
Speaker: Chris Bishop

The talk on model based signal processing was about the use of the probabilistic graphical modeling (PGM) framework or probabilistic programming for signal processing. While the talk felt a bit more like an introduction to PGMs, I like the message a lot. The main point is that instead of writing lots of code for every model, one could use PGMs as a way of specifying the problem abstractly and use general purpose inference engines (such as Microsofts infer.net or BUGS). The main example in the talk was player skill classification for the XBox system. While the talk was good, I would have wished for more insight into the actual modeling of signal processing problems. But maybe it is too much material for a keynote.

My overall conference experience was a good one. There were a lot of interesting talks and posters that were related to my signal processing interest in audio and video mining. However, since the field is pretty big there were also time slots that I found completely boring, since there was a lot of hardware research presented.


Thursday, May 1, 2014

Normalized Graph Cuts and Spectral Algorithms

I recently read multiple papers about normalized graph cuts for segmentation and spectral algorithms. In this post I will highlight some of the similarities between the two. Both algorithms are graph based algorithms for clustering data. I will first show how a data set can used in such a graph environment using gram matrices and then talk more about graph cuts and spectral algorithms.

For N data points X = {x1 ... xN} we can construct a graph G=(V, E) by using every example xi as a vertex and connect every vertex to every other vertex. The weight of this edge eij is defined by a similarity function between xi and xj. Often we use a gaussian kernel as the similarity function:



If we represent this graph as a connectivity matrix, we end with a dense matrix called the gram matrix. In the gram matrix every entry Wij is set to the edge weight between vertex i and vertex j.




A cut in a graph partitions a graph G into two disjoint subsets A, B of vertices that share no edges.
This can be achieved by removing edges from the graph. So a cut is a set of edges when removed partitions the graph into A and B. The weight of the cut is the sum of the weights of this edge set:


For classification we could search the cut in the graph with minimum weight.  This means we remove the edges from the graph where the similarity is lowest. In that way, the two remaining sets A and B are the clusters. However, now the minimum cut might just separate only a few nodes from the rest of the graph. This can be avoided using normalized graph cuts which include the connectivity or degree of each vertex:



We can capture the connectivity in the denominator as the degree matrix. The degree matrix is a matrix with all zeros except the diagonal. Along the diagonal we capture the connectivity of each node:

 

Going towards the solution, the graph Laplacian is defined as:


With some math, we can find the solution by extracting the top k eigenvectors from the matrix:


If we have N examples and k vectors, the result is a k x N matrix. If we transpose it we have N vectors
of size k. Now we can cluster these new instances using k-means. This method is known as k-way approximate graph cut. In my opinion this algorithm is closely related to spectral clustering, which uses a very similar formulation [1]. Most of the post follows[2].


REFERENCES

  • [1] Andrew Y. Ng, Michael I. Jordan, Yair Weiss: On Spectral Clustering: Analysis and an algorithm. NIPS, 2001
    [2] Jianbo Shi, Jitendra Malik: Normalized Cuts and Image Segmentation, PAMI, 2000

Thursday, January 30, 2014

Convolutional Restricted Boltzmann Machines and Restricted Boltzmann Machines

During Christmas I started reading about Convolutional Restricted Boltzmann Machines. While I was  reading and implementing Restricted Boltzmann Machines and Deep Belief Networks last semester I got curious about this other type of neural network. I will give a brief overview about both models and how they can be used for feature learning.

Restricted Boltzmann Machines
A Restricted Boltzmann Machine (RBM) is a two layer generative stochastic neural network. Each node of the bottom layer is connected to all nodes in the top layer. The restriction is that there are no connections in a layer (see Figure 1).

Figure 1: A Restricted Boltzmann Machine with four 
visible nodes (bottom layer) and two hidden nodes (top layer).

In the easiest case all nodes are binary stochastic units, meaning the value of a node or variable can only be 0 and 1. A joint configuration (meaning we have values for every hi and every vj) has the following energy:




This equation says we have a bias on every visible input, a bias on every hidden unit and a weight on all connections between visible and hidden units. Then (as common in energy based probabilistic graphical models [1]) the probability of such joint configuration is:


While the scaler is summing over all possible joint configurations. However getting an unbiased sample for a hidden unit given all input units set is easy since there are no connections between them.
And involves rejection sampling from the following probability distribution:



Where sigma is the sigmoid function. The same is true for a sample of a visible unit, given all hidden units set:



Now learning can be performed using an alternating sampling approach. We first put a data point (vector of length of the input layer) as the configuration of the visible layer, compute the probability for P(h = 1|v) and then sample a hidden layer configuration from that. Now we fix the hidden layer, 
compute the probability P(v = 1 | h) and sample from that. We can repeat this process, for some iterations. However, Hinton [2] showed performing this up and down step one and a half times (up - down - up) is sufficient to compute a gradient to improve the model. This algorithm is called contrastive divergence.

Now the idea is to stack these models into a deep network by training each layer in isolation in a greedy manner. So we start with the first RBM as described above. We use the activations for each data point at the hidden units as our new data set, and repeat this process.  Intuitively, every layer
tries to have a small reconstruction error. So by stacking these models in such a way, we learn a hierarchical feature representation. Our top layer can then be used as a new feature space for clustering or classification.

Convolutional Restricted Boltzmann Machines (CRBM)
If we think about images and we want to learn features, we could collect a large number of local patches from images and then train a deep network as our feature extractor. However, the number of parameters to learn is huge and for computational reasons this approach might be impractical. Another (more holistic) approach to image features is to learn a set of k- filters with different responses. So instead of one weight matrix as in the RBM case, we have k. However, these filters are much smaller. I saw  7 x 7 pixel filters used. 


So the input to a Convolutional Restricted Boltzmann Machine is a complete image (not patches) 
and the hidden activations are k images, the result of convolving the input image with every filter. 
These responses are compressed using "max pooling", meaning we extract the max in non overlapping regions from the responses. The probability calculations are similar to the ones for regular RBMs and are explained in Lee et al[3]. Furthermore, we can use the same inference and estimation procedure as in RBMs. The difference is, that we do not update a weight matrix, but the k filters. Now we can stack CRBMs, too. Using the compressed responses as the input to the next layer, 
we can explain, larger and larger regions per layer. For example, the first layer might learn a set of k edge detectors for different edge directions, while the highest layer might learn actual object detectors such as wheels. 


REFERENCES:
[1] Christopher M. Bishop: "Pattern Recognition and Machine Learning", Springer, 2007
[2] Geoffrey Hinton:"A Practical Guide to Training Restricted Boltzmann Machines"
 UTML TR 2010–003, Department of Computer Science, University of Toronto 
[3] Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng: "Convolutional Deep Belief Networks
for Scalable Unsupervised Learning of Hierarchical Representations", ICML '09 , 2009