Thursday, January 30, 2014

Convolutional Restricted Boltzmann Machines and Restricted Boltzmann Machines

During Christmas I started reading about Convolutional Restricted Boltzmann Machines. While I was  reading and implementing Restricted Boltzmann Machines and Deep Belief Networks last semester I got curious about this other type of neural network. I will give a brief overview about both models and how they can be used for feature learning.

Restricted Boltzmann Machines
A Restricted Boltzmann Machine (RBM) is a two layer generative stochastic neural network. Each node of the bottom layer is connected to all nodes in the top layer. The restriction is that there are no connections in a layer (see Figure 1).

Figure 1: A Restricted Boltzmann Machine with four 
visible nodes (bottom layer) and two hidden nodes (top layer).

In the easiest case all nodes are binary stochastic units, meaning the value of a node or variable can only be 0 and 1. A joint configuration (meaning we have values for every hi and every vj) has the following energy:




This equation says we have a bias on every visible input, a bias on every hidden unit and a weight on all connections between visible and hidden units. Then (as common in energy based probabilistic graphical models [1]) the probability of such joint configuration is:


While the scaler is summing over all possible joint configurations. However getting an unbiased sample for a hidden unit given all input units set is easy since there are no connections between them.
And involves rejection sampling from the following probability distribution:



Where sigma is the sigmoid function. The same is true for a sample of a visible unit, given all hidden units set:



Now learning can be performed using an alternating sampling approach. We first put a data point (vector of length of the input layer) as the configuration of the visible layer, compute the probability for P(h = 1|v) and then sample a hidden layer configuration from that. Now we fix the hidden layer, 
compute the probability P(v = 1 | h) and sample from that. We can repeat this process, for some iterations. However, Hinton [2] showed performing this up and down step one and a half times (up - down - up) is sufficient to compute a gradient to improve the model. This algorithm is called contrastive divergence.

Now the idea is to stack these models into a deep network by training each layer in isolation in a greedy manner. So we start with the first RBM as described above. We use the activations for each data point at the hidden units as our new data set, and repeat this process.  Intuitively, every layer
tries to have a small reconstruction error. So by stacking these models in such a way, we learn a hierarchical feature representation. Our top layer can then be used as a new feature space for clustering or classification.

Convolutional Restricted Boltzmann Machines (CRBM)
If we think about images and we want to learn features, we could collect a large number of local patches from images and then train a deep network as our feature extractor. However, the number of parameters to learn is huge and for computational reasons this approach might be impractical. Another (more holistic) approach to image features is to learn a set of k- filters with different responses. So instead of one weight matrix as in the RBM case, we have k. However, these filters are much smaller. I saw  7 x 7 pixel filters used. 


So the input to a Convolutional Restricted Boltzmann Machine is a complete image (not patches) 
and the hidden activations are k images, the result of convolving the input image with every filter. 
These responses are compressed using "max pooling", meaning we extract the max in non overlapping regions from the responses. The probability calculations are similar to the ones for regular RBMs and are explained in Lee et al[3]. Furthermore, we can use the same inference and estimation procedure as in RBMs. The difference is, that we do not update a weight matrix, but the k filters. Now we can stack CRBMs, too. Using the compressed responses as the input to the next layer, 
we can explain, larger and larger regions per layer. For example, the first layer might learn a set of k edge detectors for different edge directions, while the highest layer might learn actual object detectors such as wheels. 


REFERENCES:
[1] Christopher M. Bishop: "Pattern Recognition and Machine Learning", Springer, 2007
[2] Geoffrey Hinton:"A Practical Guide to Training Restricted Boltzmann Machines"
 UTML TR 2010–003, Department of Computer Science, University of Toronto 
[3] Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng: "Convolutional Deep Belief Networks
for Scalable Unsupervised Learning of Hierarchical Representations", ICML '09 , 2009

No comments:

Post a Comment