Showing posts with label research. Show all posts
Showing posts with label research. Show all posts

Sunday, September 10, 2017

Variational Fashion Encoder

Variational Fashion Encoder


In this experiment I want to try out Variational Auto Encoders, an unsupervised feature learning algorithm on a new fashion classification dataset from Zalando. Since I wanted to play around with both more, I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.
Code can be found on github


Variational Auto Encoder

The bottom part of the model is embedding the input X into a mean and variance vector.
The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.


The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space.

Fashion Reconstruction

The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.
The fashion data consists of 10 - classes of cloths: 
  1. tshirt
  2. trouser
  3. pullover
  4. dress
  5. coat
  6. sandal
  7. shirt
  8. sneaker
  9. bag
  10. ankle boo
After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder.

As one can see the reconstructions work quite well indicating the latent space learned is useful.

Saturday, April 15, 2017

Three Algorithm Tricks in Data Mining

I really enjoy it when concepts can be reused or applied in several contexts. Three of the algorithms
that seem to be everywhere are quick sort, merge sort and binary search. In this post I will
just show some tricks I learned during coding. Using a similar algorithm than the merge method
of merge sort for fast sparse vector multiplication, using binary search to insert into a histogram in O(log(n)) time, and using the partition method of quick sort during C.45 decision tree training.


Fast Sparse Vector Multiplication

A sparse vector is a vector where most if the entries are zeros. A common data structure is an array saving all non zero entries. Each entry is a tuple of the position and the corresponding value. Also the array is sorted increasingly by the position. Given two such vectors, we can multiply two such vectors very efficiently. We use two variables i and j, indicating a position in one of the vectors' array x and y. The variables move at the same speed as long as the positions of x(i) and y(j) are the same. In this case, we can also aggregate the multiplication of the vector elements. If i points to a position larger than j, we only move j until i and j are equal and vice versa. Since in the previous case, x(i) or y(j) is a zero entry, we don't have to compute anything. The movement of i and j is the same in the merge method of merge sort.



object DotProduct {

  case class SparseNode(pos: Int, value: Double)

  type SVector = Array[SparseNode]  

  def dot(x: SVector, y: SVector): Double = {
    var i = 0
    var j = 0
    var product = 0.0
    while(i < x.length && j < y.length) {
      if(x(i).pos == y(j).pos) {
        product += x(i).value * y(j).value
        i += 1
        j += 1
      } else if(x(i).pos < y(j).pos) i += 1
      else j += 1
    }
    product
  }

}

Histogram Insertion

A histogram can be represented as an array that holds the counts of values that fall into 
a certain region. The range is another array containing the boundaries, which should be sorted
in increasing order. When we want to estimate the histogram, we need to find the region the value falls in. Naively that can be done by iterating the boundary array and checking if the value is in the range. However, we can perform this search using binary search in the boundary array. 
Instead of searching an element though, we are checking if the value is larger than boundary[middle - 1] and is smaller than boundary[middle], if so we can increase the count for the middle, else we recursively search the left or right side, depending if the value is smaller or larger than the middle.


class BinarySearchHistogram(val binBorders: Array[Double]) {

  final val bins = Array.fill(binBorders.length)(0.0)

  def findBin(x: Double, start: Int, stop: Int): Int = {   
    val center = ( (stop - start) / 2 ) + start
    if (center == 0 || center == binBorders.length - 1) center
    else if(x > binBorders(center - 1) && x <= binBorders(center)) center
    else if(x > binBorders(center - 1)) findBin(x, center, stop)
    else findBin(x, start, center)
  }

  def insert(x: Double): Unit = bins(findBin(x, 0, binBorders.length)) += 1.0

}

Decision Tree Learning

A dataset for decision tree learning can be represented as a 2d array X[N][D] of N vectors each of D 
dimensions. At each node in the decision tree, a node partitions the data points into two subsets, one contains all vectors whose dimension D is smaller than a threshold and the other subset contains the vectors where dimension D is larger. We can introduce an ordering array. The left side of the array from a pivot element contains all instances where the split decision is true and the right side contains all elements where the decision is false. We can swap the elements from the left to right whenever
a split decision is violated (the element should be left but is right). We can then call the partition
function recursively on the left and the right side. If we apply the function recursively, we partition the dataset into the split decisions as a decision tree does. Of course, we still have to select the correct
spit decision for every level but this is the material for another post. In the end, this function is very similar to the quick sorts partition method. Another note: The method only works well for binary decision trees, which is not a huge problem in practice.


class DataSet(val ordering: Array[Int], val data: Array[Array[Double]]) {

  def partition(dec: Decision, hi: Int, lo: Int): Int = {
    var i = lo
    var j = lo
    while(j < hi) {
      if( dec.f( data(ordering(j)) ) ) {
        val tmp = ordering(j)
        ordering(j) = ordering(i)
        ordering(i) = tmp
        i += 1
      }
      j += 1
    }
    i
  }

}

case class Decision(attr: Int, th: Double) {

  def f(x: Array[Double]): Boolean = x(attr) <= th

}

Sunday, February 28, 2016

Speed Bake-Off Tensorflow / (Numpy | SciPy | Matplotlib) / OpenCV

I recently got interested into scaling some of the algorithms from my PhD thesis. Especially the feature learning part. I used a K-Means to learn a convolutional codebook over audio spectrogram.
Why you can use K-Means to learn something like a convolutional neural net is described in: Coates, Lee and NG as well as my thesis. Anyways in my thesis experiments the bottle neck was the feature learning. So I need a fast clustering implementation as well as a fast convolution in order to resolve the speed issues. The three main implementations I look at are:
  • Google Tensor Flow
  • SciKit Learn
  • OpenCV
All three provide a python interface but use a low level, speedy C or Fortran implementation under the hood. My goal is to evaluate these three libraries for their performance (speed) under the following conditions:

The library is fastest when testing on a single Macbook. It can use the available parallel devices such as a GPU and multi threading. Furthermore, the library should be fastest on medium sized problems.

I choose these conditions since I plan to design programs that animal communication researchers can use in the field. Meaning, if you are to analyse dolphin communication on a boat in the Bahamas or  
you observe Birds in the rainforest, the only thing you might have is your laptop with no internet connection.

Convolution


The convolution test is based on the three functions:


  • Tensorflow: tensorflow.nn.conv2d
  • SciPy: scipy.signal.convolve2d
  • OpenCV: cv2.filter2d
My test is to compute the edge image using a sobel filter in x-direction and on in y-direction:




I am using the following image and expect the following edge image:





The results in seconds:
  • Tensorflow: 0.125 [s]
  • SKLearn:     0.049 [s]
  • OpenCV:     0.019 [s] 
Here it looks like the OpenCV implementation is the fastest. So openCV it is. For the clustering there are a lot of libraries and even the sklearn implementation is very fast.

Update:

After Himanshu's comment I chose to check Tensorflow not including the variable assignment
and then it took 0.021 seconds. However, the image copying into the variable and the session setup
matter in my use case, too. And it is still lower than Open CV. It is also interesting that for these problems, the speed between the libraries is not that different. However, I belief that for larger problems, the tensor flow version that can run on a cluster will show way better performance. Also I don't know right now if the current tensor flow version works with opencl.

Wednesday, September 23, 2015

Thesis Published: DATA MINING IN LARGE AUDIO COLLECTIONS OF DOLPHIN SIGNALS



The study of dolphin cognition involves intensive research of animal vocalisations recorded in the field.
In this dissertation I address the automated analysis of audible dolphin communication.
I propose a system called the signal imager that automatically discovers patterns in dolphin signals.
These patterns are invariant to frequency shifts and time warping transformations.
The discovery algorithm is based on feature learning and unsupervised time series segmentation
using hidden Markov models. Researchers can inspect the patterns visually and interactively
run comparative statistics between the distribution of dolphin signals in different behavioral contexts.
The required statistics for the comparison describe dolphin communication as a combination of
the following models: a bag-of-words model, an n-gram model and an algorithm to learn
a set of regular expressions. Furthermore, the system can use the patterns to automatically
tag dolphin signals with behavior annotations. My results indicate that the signal imager provides
meaningful patterns to the marine biologist and that the comparative statistics are aligned with
the biologists’ domain knowledge.
[PDF]

Monday, July 13, 2015

Social Network Sampling

I will start working for a social network in August. My new job is data scientist at the business network Xing. I decided to play with their open API a little. The Xing API gives access to users in the network as well as their contact list. So an external app can analyze the social graph. Well actually there are restrictions on the number of queries so it is possible to
analyze the graph partly. My goal is to extract an unbiased sample from the complete graph by performing a random walk from a start node. That means at every node the algorithm expands all
the node's successors and then picks a random one. The chosen successor is the next node. With
a probability of 0.15 the algorithm returns to the start node. This will ensure that we explore the graph around the start node. The going back is ensuring a trade off between breadth of the exploration and the depth of the exploration. I use the ruby version of the API.
A code snippet for a random walk step in the Xing API
is shown below:

##
# Random walk step: 
#   1) With probability 0.15 go back to start
#   2) Randomly choose a node from the successors
def random_walk_step(user_id, start)
  reset = false
  if Random.rand < 0.15
    user_id = start    
    reset = true
  end
  num_contacts = 5 # control branching
  contacts     = XingApi::Contact.list(user_id, limit: num_contacts)[:contacts][:users]
  next_user    = Random.rand(num_contacts)    
  contacts.map! {|x| x[:id]}
  if reset
    return random_walk_step(contacts[next_user], start)
  end
  return [contacts[next_user], contacts, user_id]
end
The random step function takes a user id as the current node and a node we started the random walk at. First we decide if we restart the search at the start node, then it samples a neighbor. Repeating the process and updating the user id with the sampled neighbor results in a graph. The function below, explores the graph for 25 nodes and returns the graph as a dot file:
##
# Print graph as graphviz file
def dot(path)
  expended = []
  File.open(path, 'w') do |file|
    file.write "digraph my_neighbors {\n"
    start_id = XingApi::User.me[:users].first[:id]
    ids = random_walk_step(start_id, start_id)  
    45.times do
      id = ids[2]
      username = XingApi::User.find(id)[:users].first[:display_name]
      username.gsub! /\W/, ""      
      if not expended.include? username
        expended = expended + [username]
        ids[1].each do |neighbor| 
          neighborname = XingApi::User.find(neighbor)[:users].first[:display_name]
          neighborname.gsub! /\W/, ""       
          file.write "#{username} -> #{neighborname};\n"        
        end
      end
      ids = random_walk_step(ids[0], start_id)      
    end
    file.write "}"
  end
end
A dot file is a graph description that can be converted into graph images. A sample graph of my network is shown below.


Wednesday, May 13, 2015

Integrating Weka's UI into your own code

Weka is a machine learning toolkit written in Java. Despite it's capability to run as a stand alone
user interface, you can also use Weka in your own Java code. That means, all algorithms you can find in the user interface are also trainable and usable in your Java application. The cool thing is, you can also integrate part of Weka's user interface into your own. So my goal was to integrate the option to choose a Weka classifier and configure it using Weka's native components.


Choosing and configuring a Weka classifier, here a Support Vector Machine.


After browsing a little through the Weka UI code I found two classes that can be used to
open the editor to choose a classifier and to configure it. Once the window closes, the classifier
with it's configuration is ready to be used by other functions. The first UI component is the GenericObjectEditor, which is responsible for choosing a classifier. The second is the PropertyDialog which is configuring the current classifier. Blow I showed some code that displays the dialog and prints out the classifier. I found the example in the Weka GenericObjectEditor class at the bottom. For the code to work you have to make sure the weka.jar is in your CLASSPATH.



package wekaui;

import java.awt.Frame;
import java.awt.event.WindowAdapter;
import java.awt.event.WindowEvent;
import java.beans.PropertyEditor;
import weka.classifiers.Classifier;
import weka.classifiers.rules.ZeroR;
import weka.gui.GenericObjectEditor;
import weka.gui.PropertyDialog;

/**
 * Choose a classifier and configure it in your own java code.
 *
 * @author Daniel Kohlsdorf
 */
public class WekaUI {

    public static void main(String[] args) {
        Object initial = new ZeroR();
        GenericObjectEditor.registerEditors();
        GenericObjectEditor editor = new GenericObjectEditor(true);
        ce.setClassType(weka.classifiers.Classifier.class);
        ce.setValue(initial);

        // Setup the property view
        PropertyDialog propertyDialog = new PropertyDialog((Frame) null, editor, 100, 100);

        // Get the classifier when the window is closed
        pd.addWindowListener(new WindowAdapter() {
            @Override
            public void windowDeactivated(WindowEvent e) {
                PropertyEditor propertyEditor = ((PropertyDialog) e.getSource()).getEditor();
                if (propertyEditor.getValue() != null) {
                    Classifier classifier = (Classifier) propertyEditor.getValue();
                    System.out.println("CLASSIFIER: " + classifier);
                }
            }
        });
        propertyDialog.setVisible(true);
    }

}

Sunday, December 28, 2014

K-Means works just fine




It is interesting how powerful vector quantization can be. Since I like the quantization idea a lot, I think the following insight from a 2011 paper was notable: K-Means outperforms single layer neural nets for self taught learning. The task is to learn features of an image from small patches extracted in a sliding window. Features are extracted over these patches by:

1) Learning an auto encoder over these patches
    We learn a neural net with the input layer
     connected to a smaller hidden layer that connects
     to an output layer. The input and output are the same
     so we try to reconstruct the input using feature detectors
     in the hidden layer.

2) Learning a restricted Boltzmann machine
    Probabilistic version  using a Markov random field of the above method.
    No output layer is needed since the model is undirected. Learning can be
    performed using Gibbs sampling.

3) K-Means:Vector quantization. Features are soft assignments to cluster centers.

4) Gaussian Mixtures: Fitting a gaussian mixture using Expectation Maximization. The features are
     the posterior.

These features are used for classification using a Support Vector Machine. Interestingly enough, k-means outperforms all other methods by at least three to four percent accuracy on the CIFAR and NORB data set.

However, for a lot of the bigger task more recent results suggest that deep convolutional neural nets outperform everyone else.



Tuesday, November 11, 2014

Spectrogram Interest Points: Shazaam

After some time, I decided to write another blog post. This time I want to talk about what one can do with local interest points in a spectrogram.

An interest point in a spectrogram is a point of high magnitude in time and frequency. Normally we use points that are a local maximum in a small region.
In that way these points group around interesting audio events in the spectrogram. One use of these features is audio indexing and retrieval. For example, the Shazaam app records audio using your phone. It proceeds to extract local features from the audio and compares them against a data base
of songs, indexed in advance. The app is capable of naming a large variety of songs from noisy recordings.



The algorithm for indexing is extracting such interest points from the spectrogram and continues to
hash combinations of these points. Therefore, the algorithm uses one of the points in a region as an anchor point and measures the offset from the anchor point to other points in a neighborhood. These combinations are to index the audio file. For an unseen song, the hashes are extracted using a sliding window and the app searches for matches with the pre recorded hashes. The image above shows local interest points in red and some combinatorial hashes in yellow grouping around a dolphin whistle. We
used the same features to build a dolphin whistle detector running on an underwater wearable computer.

The algorithm is quite robust to noise and shows good indexing performance on music data.

Friday, August 15, 2014

ICASSP: Pattern Discovery in Dolphin Whistles

Abstract of my ICASSP 2014 paper on Dolphin Communication Mining:




"The study of dolphin cognition involves intensive research of animal vocalizations. Marine mammalogists commonly study a specific sound type known as the whistle found in dolphin communication. However, one of the main problems arises from noisy underwater environments. Often waves and splash noises will partially distort the whistle making analysis or extraction difficult. Another problem is discovering fundamental units that allow research of the composition of whistles. We propose a method for whistle extraction from noisy underwater recordings using a probabilistic approach. Furthermore, we investigate discovery algorithms for fundamental units using a mixture of hidden Markov models. We evaluate our findings with a marine mammalogist on data collected in the field. Furthermore, we have evidence that our algorithms enable researchers to form hypotheses about the composition of whistles."


Symbolic Aggregate approXimation: A symbolic time series representation

I used this time series representation some years ago for a lot for my research. I still think it is an elegant way of representing time series. You can use this easy to use algorithm to convert a one dimensional time series into a string. Given a time series you split it into w equally sized segments and estimate the sample mean in each segment. So we end up with a time series of length w. We then divide the Y axis into k regions using split points or thresholds. Assigning a unique symbol to each of the regions we can
check in which region each sample mean falls into and read of the symbol. So we end with a string of size w.



You can see the performance on multiple time series data sets in the original paper. Furthermore, there is a very efficient way on how to index massive data sets using this representation.

Wednesday, July 16, 2014

SLIC: Simple super pixels




After searching for an easy algorithm to compute super pixels I found a cool method segment images based on color using a modified K-Means version called Simple Linear Iterative Clustering or SLIC. The input to the algorithm is a set of vectors describing each pixel in the images as a vector:

v = [l a b x y] 

The first three entries represent the color in LAB space. The last two the position in the image. The goal of the algorithm is to segment the image into k more or less equally sized regions of same color, called super pixels. The algorithm is initialized by sampling the image k times uniformly. Based on these initial centers one runs a local K-Means for each center. So a clusters influence is in a limited region.
The other novelty of the algorithm is the distance measure that scales the color and position.




Tuesday, June 17, 2014

ICASSP 2014


I was still sitting in Atlanta, when I decided to write a little post about interesting topics, papers and talks from ICASSP 2014, a speech and signal processing conference. So here are my notes on the keynotes and my impression of the conference :D.


Plenary Talk: Signal Processing in Computational Art History
Speaker: Richard Johnson

This talk described a novel field for signal processing, or computer vision called "Computational Art History". Apparently, art historians are often involved in some sort of detective work gathering information towards the authenticity of paintings. The speaker described his work so far with art historians in which he applied simple signal processing algorithms in order to gather quantifiable evidence. His three examples were authenticity of canvas paintings, photographs and laid paper. In all these examples, the research team analyzed image material from the material. In the canvas example, the authors count the number of threads on the canvas. Based on these distribution they can find evidence for two paintings being made on material from the same piece of canvas. Similar, in the photographic paper work, his team investigates if two pieces of photographic paper are from the same batch. Such indicators can help art historians to classify paintings and other pieces of art. It is interesting to see such a collaboration between art historians and signal processing folks. I hope to see more of this work in the future.

Plenary Talk: Model-Based Signal Processing
Speaker: Chris Bishop

The talk on model based signal processing was about the use of the probabilistic graphical modeling (PGM) framework or probabilistic programming for signal processing. While the talk felt a bit more like an introduction to PGMs, I like the message a lot. The main point is that instead of writing lots of code for every model, one could use PGMs as a way of specifying the problem abstractly and use general purpose inference engines (such as Microsofts infer.net or BUGS). The main example in the talk was player skill classification for the XBox system. While the talk was good, I would have wished for more insight into the actual modeling of signal processing problems. But maybe it is too much material for a keynote.

My overall conference experience was a good one. There were a lot of interesting talks and posters that were related to my signal processing interest in audio and video mining. However, since the field is pretty big there were also time slots that I found completely boring, since there was a lot of hardware research presented.


Thursday, May 1, 2014

Normalized Graph Cuts and Spectral Algorithms

I recently read multiple papers about normalized graph cuts for segmentation and spectral algorithms. In this post I will highlight some of the similarities between the two. Both algorithms are graph based algorithms for clustering data. I will first show how a data set can used in such a graph environment using gram matrices and then talk more about graph cuts and spectral algorithms.

For N data points X = {x1 ... xN} we can construct a graph G=(V, E) by using every example xi as a vertex and connect every vertex to every other vertex. The weight of this edge eij is defined by a similarity function between xi and xj. Often we use a gaussian kernel as the similarity function:



If we represent this graph as a connectivity matrix, we end with a dense matrix called the gram matrix. In the gram matrix every entry Wij is set to the edge weight between vertex i and vertex j.




A cut in a graph partitions a graph G into two disjoint subsets A, B of vertices that share no edges.
This can be achieved by removing edges from the graph. So a cut is a set of edges when removed partitions the graph into A and B. The weight of the cut is the sum of the weights of this edge set:


For classification we could search the cut in the graph with minimum weight.  This means we remove the edges from the graph where the similarity is lowest. In that way, the two remaining sets A and B are the clusters. However, now the minimum cut might just separate only a few nodes from the rest of the graph. This can be avoided using normalized graph cuts which include the connectivity or degree of each vertex:



We can capture the connectivity in the denominator as the degree matrix. The degree matrix is a matrix with all zeros except the diagonal. Along the diagonal we capture the connectivity of each node:

 

Going towards the solution, the graph Laplacian is defined as:


With some math, we can find the solution by extracting the top k eigenvectors from the matrix:


If we have N examples and k vectors, the result is a k x N matrix. If we transpose it we have N vectors
of size k. Now we can cluster these new instances using k-means. This method is known as k-way approximate graph cut. In my opinion this algorithm is closely related to spectral clustering, which uses a very similar formulation [1]. Most of the post follows[2].


REFERENCES

  • [1] Andrew Y. Ng, Michael I. Jordan, Yair Weiss: On Spectral Clustering: Analysis and an algorithm. NIPS, 2001
    [2] Jianbo Shi, Jitendra Malik: Normalized Cuts and Image Segmentation, PAMI, 2000

Thursday, January 30, 2014

Convolutional Restricted Boltzmann Machines and Restricted Boltzmann Machines

During Christmas I started reading about Convolutional Restricted Boltzmann Machines. While I was  reading and implementing Restricted Boltzmann Machines and Deep Belief Networks last semester I got curious about this other type of neural network. I will give a brief overview about both models and how they can be used for feature learning.

Restricted Boltzmann Machines
A Restricted Boltzmann Machine (RBM) is a two layer generative stochastic neural network. Each node of the bottom layer is connected to all nodes in the top layer. The restriction is that there are no connections in a layer (see Figure 1).

Figure 1: A Restricted Boltzmann Machine with four 
visible nodes (bottom layer) and two hidden nodes (top layer).

In the easiest case all nodes are binary stochastic units, meaning the value of a node or variable can only be 0 and 1. A joint configuration (meaning we have values for every hi and every vj) has the following energy:




This equation says we have a bias on every visible input, a bias on every hidden unit and a weight on all connections between visible and hidden units. Then (as common in energy based probabilistic graphical models [1]) the probability of such joint configuration is:


While the scaler is summing over all possible joint configurations. However getting an unbiased sample for a hidden unit given all input units set is easy since there are no connections between them.
And involves rejection sampling from the following probability distribution:



Where sigma is the sigmoid function. The same is true for a sample of a visible unit, given all hidden units set:



Now learning can be performed using an alternating sampling approach. We first put a data point (vector of length of the input layer) as the configuration of the visible layer, compute the probability for P(h = 1|v) and then sample a hidden layer configuration from that. Now we fix the hidden layer, 
compute the probability P(v = 1 | h) and sample from that. We can repeat this process, for some iterations. However, Hinton [2] showed performing this up and down step one and a half times (up - down - up) is sufficient to compute a gradient to improve the model. This algorithm is called contrastive divergence.

Now the idea is to stack these models into a deep network by training each layer in isolation in a greedy manner. So we start with the first RBM as described above. We use the activations for each data point at the hidden units as our new data set, and repeat this process.  Intuitively, every layer
tries to have a small reconstruction error. So by stacking these models in such a way, we learn a hierarchical feature representation. Our top layer can then be used as a new feature space for clustering or classification.

Convolutional Restricted Boltzmann Machines (CRBM)
If we think about images and we want to learn features, we could collect a large number of local patches from images and then train a deep network as our feature extractor. However, the number of parameters to learn is huge and for computational reasons this approach might be impractical. Another (more holistic) approach to image features is to learn a set of k- filters with different responses. So instead of one weight matrix as in the RBM case, we have k. However, these filters are much smaller. I saw  7 x 7 pixel filters used. 


So the input to a Convolutional Restricted Boltzmann Machine is a complete image (not patches) 
and the hidden activations are k images, the result of convolving the input image with every filter. 
These responses are compressed using "max pooling", meaning we extract the max in non overlapping regions from the responses. The probability calculations are similar to the ones for regular RBMs and are explained in Lee et al[3]. Furthermore, we can use the same inference and estimation procedure as in RBMs. The difference is, that we do not update a weight matrix, but the k filters. Now we can stack CRBMs, too. Using the compressed responses as the input to the next layer, 
we can explain, larger and larger regions per layer. For example, the first layer might learn a set of k edge detectors for different edge directions, while the highest layer might learn actual object detectors such as wheels. 


REFERENCES:
[1] Christopher M. Bishop: "Pattern Recognition and Machine Learning", Springer, 2007
[2] Geoffrey Hinton:"A Practical Guide to Training Restricted Boltzmann Machines"
 UTML TR 2010–003, Department of Computer Science, University of Toronto 
[3] Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng: "Convolutional Deep Belief Networks
for Scalable Unsupervised Learning of Hierarchical Representations", ICML '09 , 2009

Thursday, October 10, 2013

My thoughts on the value of data processing in practical applications.

Just a quick post about a few thoughts about data engineering vs machine learning.
While I definitely like complex algorithms for Machine Learning, my recent results on applications such as Dolphin Communications or Typing Error Detection show me that most of the time the data preprocessing is doing a lot of the work while I get away with more or less simple models such as Gaussian Mixture Models, Hidden Markov Models, Decision Trees or Random Forests (okay the last one just works great in general). Now the question is where the tradeoff is between data engineering and actual learning. Deep Learning promises to extract features on their own and are on the learning heavy sides of things. I think Speech Recognition is an example that is traditionally approached with carefully crafted features such as Mel Cepstrum Components. I observe the same difference for carefully modeling such as Hidden Markov Models and more or less structureless models.
So for me it seems fitting an existing algorithm to new domains under heavy pre processing can lead to good results faster than the search for new models. Maybe I should call it Machine Learning Prototyping. While developing learning methods to solve new problems is the higher goal, whenever one needs a learning "prototype" for a specific domain, hacking the data and some existing model might be sufficient.

Sunday, July 14, 2013

Baum Welch Initialisation: Flat Start vs Randomisation vs Viterbi Training

As most of the readers will know Baum Welch is a parameter estimation Algorithm for Hidden Markov Models. Given a initial configuration of the HMM it will converge to a locally optimal parameter configuration. The important bit is that the initial configuration will determine "how good" the final parameter set will be. To my knowledge there are three initialisation methods two are implemented in the HTK3 toolkit:

Viterbi Training
Viterbi training is implemented in HTK3 in the tool HInit. First the data is uniformly segmented
with the number of segments equal to the number of states. Then we get the optimal path through the HMM using Viterbi + Backtracking. Using the path we reestimate the model.

Flat Start
Flat start simply initialises the model with the observations uniform. For example we estimate a uniform Gaussian from all data available and use it for all initial distributions. In both cases, Flat Start and Viterbi Training we set the initial transition probabilities ourselves. Flat Start is implemented in HTK3 in the HCompV tool. The problem of Viterbi training is that it requires labeled data, while Flat Start does not.

Random Initialisation with Random Restarts 
Viterbi Training needs labeled data and Flat Start is a simple initialization but can give a bad initial estimate. Since Baum Welch is dependent on the initial starting condition, we can also use random initializations with random restarts. In that way we can initialize the transition matrix by setting all possible transitions to random values. The Gaussians can be initialized as in Flat Start and then we add some small random number to the mean. We construct multiple random initializations and run Baum Welch. In the end we keep the model with the best likelihood. We can also use a random subset for the initial Gaussians. Unfortunately there is no implementation for this method in HTK. In my experiments this initialization works best in the scenario where no labels are available and Flat Start gives a bad estimates.




Monday, July 8, 2013

On different depictions of Hidden Markov Models

During my Diploma thesis I read most literature about Hidden Markov Models from the speech community. Since I arrived at Tech and started reading more about the Machine Learning perspective on these models I noticed a difference in depiction of Hidden Markov Models. In this post I want to argue why I think the "old" speech depiction is a bit more expressive.

The old depiction is a state machine. Each state of the HMM is depicted as a circle and the arrows
show how one can "travel" through the model. I saw this depiction in the classic Rabiner papers.


In the second representation is a Hidden Markov Model as a Bayes Network. There are two kinds of nodes. The one at the top show a discrete variable, only dependent on it's successor. Encoded in this node are the transition probabilities. Or in other words the state machine is encoded in each of these nodes. There are as many transition nodes as there are samples in the time series. The nodes at the bottom are the observation probabilities which are not seen in the model from the speech community. 


I think if one only publishes the second depiction it does not tell much about the model itself. One already assumes that there are observation distributions in an HMM so why depict them. Second just drawing as many nodes as there are samples is weird for me for two reasons. The first reason is that we do not depict anything about the transition model, which is the only thing to model in an HMM. The second reason is that the number of samples or time slices does not matter at all for an HMM. So I think  drawing the state machine is a way better way of showing your model then making the ML depiction that does not tell much of your specific model except that it is an HMM. Just to make my point more clear. The second model, even when specifying in the text that we use 3 states could be the top model or the one at the bottom, without the reader knowing.



However, explaining inference algorithms such as variable elimination is easier in the middle model, but for specific models I prefer the state machine depiction.

Tuesday, July 2, 2013

log-math: Dirichlet Distribution

A long time since my last post again. But I am working on one for Gaussian Processes but I am procrastinate there. So I decided to write another one.

As my series on Expectation Maximisation ("Thoughts on EM") evolved from research, implementations and readings, I will start a new series posting probability functions in their log likelihood form. So in the first post I will post my "log calculations" for the Dirichlet Distribution.
The Dirichlet Distribution is often used as a prior on a multinomial distribution. Using that prior one can  estimate Multinomial Distributions more stable, meaning your probability estimates do not go wild. In simple words the Dirichlet Distribution is a distribution over a Multinomial Distribution with parameters
alpha. The alpha parameter is a vector of the same size as the Multinomial Distribution. The probability
density function is:


The scaler can be expressed using the beta function:


The gamma is well the gamma function,  a numerical stable log implementation is available in the submission: "Algorithm 291: logarithm of gamma function". So let us start by converting the actual distribution to log:

Converting the scaler is also straight forwards:


And that is actually all already :D. The log - gamma function can be implemented from the paper :D



Thursday, March 14, 2013

Thoughts on EM - Clustering: kmeans++

In this post I will talk a bit about initialization of the EM algorithm for Gaussian Mixtures and numerical  stability. It might be the last post on clustering since I have a stable enough version for me now.

The initialization I want to talk about is kmeans++, an algorithm for initializing classic kmeans, which can also be used for Gaussian Mixtures. The high level bit is instead of starting with random samples, we prefer samples that are far from the already chosen centers.


However if we us the furthest point algorithm, we are more sensible to outliers since the furthest points might not belong to any cluster. Kmeans++ is combining ideas for classic kmeans initialization (pick k points at random) and furthest point. We do so by defining a probability distribution that assigns a probability to each sample being a good initial mean. The distribution is based on the distance to the closest center that is already assigned.


So the algorithm for initialization picks the first center at random and then samples the next center based on the probabilities. As one can see if the distance of a sample to the closest distance is large the probability will be large, too and vice versa. We stop drawing samples when we have k centers. Based on these centers we can run classic kmeans or Gaussian Mixture EM.

In my experiments kmeans++ performed great as an initialization for EM while being way faster then hierarchical clustering.

Monday, February 4, 2013

Magic Summon: Building false positive free Gestures from just negative examples

Thad and I have a new publication in the Journal Of Machine Learning Research. The nugget for me in this paper is that we were able to build a tool that automatically suggests gestures that won't false trigger in everyday life. The paper can be found in Volume 14. I attached the abstract for everyone interested:

"Gestures for interfaces should be short, pleasing, intuitive, and easily recognized by a computer. However, it is a challenge for interface designers to create gestures easily distinguishable from users' normal movements. Our tool MAGIC Summoning addresses this problem. Given a specific platform and task, we gather a large database of unlabeled sensor data captured in the environments in which the system will be used (an "Everyday Gesture Library" or EGL). The EGL is quantized and indexed via multi-dimensional Symbolic Aggregate approXimation (SAX) to enable quick searching. MAGIC exploits the SAX representation of the EGL to suggest gestures with a low likelihood of false triggering. Suggested gestures are ordered according to brevity and simplicity, freeing the interface designer to focus on the user experience. Once a gesture is selected, MAGIC can output synthetic examples of the gesture to train a chosen classifier (for example, with a hidden Markov model). If the interface designer suggests his own gesture and provides several examples, MAGIC estimates how accurately that gesture can be recognized and estimates its false positive rate by comparing it against the natural movements in the EGL. We demonstrate MAGIC's effectiveness in gesture selection and helpfulness in creating accurate gesture recognizers.