Sunday, September 10, 2017

Variational Fashion Encoder

Variational Fashion Encoder

In this experiment I want to try out Variational Auto Encoders, an unsupervised feature learning algorithm on a new fashion classification dataset from Zalando. Since I wanted to play around with both more, I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.
Code can be found on github

Variational Auto Encoder

The bottom part of the model is embedding the input X into a mean and variance vector.
The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.

The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space.

Fashion Reconstruction

The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.
The fashion data consists of 10 - classes of cloths: 
  1. tshirt
  2. trouser
  3. pullover
  4. dress
  5. coat
  6. sandal
  7. shirt
  8. sneaker
  9. bag
  10. ankle boo
After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder.

As one can see the reconstructions work quite well indicating the latent space learned is useful.

Tuesday, August 1, 2017

Investigating Non Linear Optimization in Python

I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "Machine Learning For Trading". One interesting part was portfolio optimisation.
That is, if you want to invest some amount of money, how do you distribute it
over a selection of assets in a way minimising risk while at the same time
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the course book the optimisation is performed using a
python method for constrained non-linear optimisation.  The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.
The original algorithm can be found in the ACM Algorithms catalog (Algorithm 733)
written by D. Kraft. The code is automatically translated using the fortran to python (f2py)

Sunday, July 23, 2017

Black Box Variational Inference and PyFlux

I found an interesting paper on black box variational inference. Variational inference is an
approximate inference algorithm for factored probabilistic models (such as bayesian networks).

In probabilistic inference, we are normally interested in the posterior density of a latent variable
given all our data. The basic idea is to turn the inference problem into an optimisation problem.
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.

The problem is that in order to do this for a new model, we still have to do a lot of manual derivations
and work through a lot of math before we can actually use this algorithm. This new paper
however, shows an algorithm that is easier to use given a new model. In my opinion,
one of the successes in deep learning is auto differentiation, where we do not
have to develop all gradients on our own but let the computer do it (as described in earlier posts).
One library that implements this inference method for complex time series is called pyFlux [2].

The proposed algorithm [1] works as follows. First, we sample the latent variables in our model
from the approximate distribution. We then use the samples to compute a gradient
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).

Reproduced from original paper [1]
The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:

In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,
the final gradient becomes:

Where the lambda parameters will be the mean and variance of each of the gaussians used
with the gradients:


[1] Ranganath, Gerrish, Blei: "Black Box Variational Inference", AISTATS 2014, 2014
[2] PyFlux

Friday, May 26, 2017

A Generic Sequence Alignment Algorithm

In this post I will describe how a general sequence alignment could be implemented in Scala.
For readers that don't know what sequence alignment is, I recommend reading my
post on building a gesture recogniser under Dynamic Time Warping distance (DTW).

Sequence alignment is often used in several contexts including speech recognition,
gesture recognition, approximate string matching, typing correction, biological
sequence alignment and time series classification and even stereo vision.
All these algorithms are based on a technique called dynamic programming and the algorithms
are all very similar. However, in literature this alignment technique is often found under different names (Dynamic Time Warping, Levensthein Distance, Needleman-Wunsch).

I will first formulate my idea of a general alignment algorithm with a scala implementation and afterwards describe an implementation of the three above algorithms.

Given two sequences X = [x1 .. xi .. xN] and Y = [y .. yj .. yM], we use the following
recursion to find the alignment score S(x, y). At each position i in sequence X and position j in sequence Y we look backwards at the matching score in X only, in Y only and in X and Y simultaneously resulting in the following recursion:

The rank function scores the alignment at position i and j based on the previous alignment scores and
the current elements. The rank function is the main component that changes in all three algorithms.
Looking back in X direction is also called an insertion, in Y direction a deletion and for both it is
called a match so the ranking function becomes:

Furthermore we will also implement a generic initialisation function that starts our recursion.
Ok now the Scala code:

class Alignment[A](
  init: (Int, Int) => Double,
  rank: (Double, Double, Double, A, A) => Double
) {

  def score(x: Array[A], y: Array[A]): Double = {  
    val n  = x.size
    val m  = y.size
    val dp = Array.ofDim[Double](n + 1, m + 1)
    for(i <- 0 until n + 1) dp(i)(0) = init(i, 0)
    for(j <- 0 until m + 1) dp(0)(j) = init(0, j)

    for(i <- 1 until n + 1) {
      for(j <- 1 until m + 1) {
        dp(i)(j) = rank(
          dp(i - 1)(j),     // insertion
          dp(i - 1)(j - 1), // match
          dp(i)(j - 1),     // deletion 
          x(i - 1),         // current sample in x 
          y(j - 1)          // current sample in y



The alignment class has a type parameter specifying the type of the elements in the sequences we want to align and takes the initialising function and the ranking function as input. Now DTW and the Edit distance are just specialisations with a different type parameter and scoring function:

  class DTW extends Alignment[Double](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else Double.PositiveInfinity,
    rank = (insert: Double, matching: Double, delete: Double, x: Double, y: Double) =>
      if(insert < matching && insert < delete) insert + math.pow(x - y, 2)
      else if (delete < insert && delete < matching) delete + math.pow(x - y, 2)
      else matching + math.pow(x - y, 2)

DTW is defined over doubles and is initialising the recursion with 0.0 at position (0,0) and all others to infinity. The ranking adds the euclidian distance to the insertion, deletion and match and then takes the minimum. It is easy to see how this can be generalised to multivariate sequences. The type parameter becomes Seq[Double] and the euclidean distance is implemented for vectors. One can also use kernels instead or the cosine similarity. The next function is the edit distance for discrete sequence elements (here characters):

  class Levensthein extends Alignment[Char](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j else i,
    rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
      if(insert < matching && insert < delete) insert + 1
      else if (delete < insert && delete < matching) delete + 1
      else matching + (if(x == y) 0.0 else 1.0)

The function is initialised as the index at the borders and the function again takes the minimum of insertion, deletion and match. Insertions and deletions count as an error so we add 1. A match can also be a substitution so we add 1 to the error if the characters do not match. The last algorithm is the Needleman-Wunsch algorithm. Instead of minimising the distance we are maximising a score. The similarity between two symbols can be defined and the 'gap' penalty is not 1 as in the edit distance but given.

  final val Sim = Map(
    ('a', 'a') -> 10,
    ('a', 'g') -> -1,
    ('g', 'a') -> -1,
    ('a', 'c') -> -3,
    ('c', 'a') -> -3,
    ('a', 't') -> -4,
    ('t', 'a') -> -4,
    ('g', 'g') ->  7,
    ('g', 'c') -> -5,
    ('c', 'g') -> -5,
    ('g', 't') -> -3,
    ('t', 'g') -> -3,
    ('c', 'c') ->  9,
    ('c', 't') ->  0,
    ('t', 'c') ->  0,
    ('t', 't') ->  8

  final val Gap = -5
  class NeedlemanWunsch extends Alignment[Char](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j * Gap else i * Gap,
    rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
      if(insert > matching && insert > delete) insert + Gap
      else if (delete > insert && delete > matching) delete + Gap
      else matching + Sim(x, y)

If one needs the actual alignment, we can backtrack the alignment process in all above cases. The code can be found here [GitHub]

Saturday, April 15, 2017

Three Algorithm Tricks in Data Mining

I really enjoy it when concepts can be reused or applied in several contexts. Three of the algorithms
that seem to be everywhere are quick sort, merge sort and binary search. In this post I will
just show some tricks I learned during coding. Using a similar algorithm than the merge method
of merge sort for fast sparse vector multiplication, using binary search to insert into a histogram in O(log(n)) time, and using the partition method of quick sort during C.45 decision tree training.

Fast Sparse Vector Multiplication

A sparse vector is a vector where most if the entries are zeros. A common data structure is an array saving all non zero entries. Each entry is a tuple of the position and the corresponding value. Also the array is sorted increasingly by the position. Given two such vectors, we can multiply two such vectors very efficiently. We use two variables i and j, indicating a position in one of the vectors' array x and y. The variables move at the same speed as long as the positions of x(i) and y(j) are the same. In this case, we can also aggregate the multiplication of the vector elements. If i points to a position larger than j, we only move j until i and j are equal and vice versa. Since in the previous case, x(i) or y(j) is a zero entry, we don't have to compute anything. The movement of i and j is the same in the merge method of merge sort.

object DotProduct {

  case class SparseNode(pos: Int, value: Double)

  type SVector = Array[SparseNode]  

  def dot(x: SVector, y: SVector): Double = {
    var i = 0
    var j = 0
    var product = 0.0
    while(i < x.length && j < y.length) {
      if(x(i).pos == y(j).pos) {
        product += x(i).value * y(j).value
        i += 1
        j += 1
      } else if(x(i).pos < y(j).pos) i += 1
      else j += 1


Histogram Insertion

A histogram can be represented as an array that holds the counts of values that fall into 
a certain region. The range is another array containing the boundaries, which should be sorted
in increasing order. When we want to estimate the histogram, we need to find the region the value falls in. Naively that can be done by iterating the boundary array and checking if the value is in the range. However, we can perform this search using binary search in the boundary array. 
Instead of searching an element though, we are checking if the value is larger than boundary[middle - 1] and is smaller than boundary[middle], if so we can increase the count for the middle, else we recursively search the left or right side, depending if the value is smaller or larger than the middle.

class BinarySearchHistogram(val binBorders: Array[Double]) {

  final val bins = Array.fill(binBorders.length)(0.0)

  def findBin(x: Double, start: Int, stop: Int): Int = {   
    val center = ( (stop - start) / 2 ) + start
    if (center == 0 || center == binBorders.length - 1) center
    else if(x > binBorders(center - 1) && x <= binBorders(center)) center
    else if(x > binBorders(center - 1)) findBin(x, center, stop)
    else findBin(x, start, center)

  def insert(x: Double): Unit = bins(findBin(x, 0, binBorders.length)) += 1.0


Decision Tree Learning

A dataset for decision tree learning can be represented as a 2d array X[N][D] of N vectors each of D 
dimensions. At each node in the decision tree, a node partitions the data points into two subsets, one contains all vectors whose dimension D is smaller than a threshold and the other subset contains the vectors where dimension D is larger. We can introduce an ordering array. The left side of the array from a pivot element contains all instances where the split decision is true and the right side contains all elements where the decision is false. We can swap the elements from the left to right whenever
a split decision is violated (the element should be left but is right). We can then call the partition
function recursively on the left and the right side. If we apply the function recursively, we partition the dataset into the split decisions as a decision tree does. Of course, we still have to select the correct
spit decision for every level but this is the material for another post. In the end, this function is very similar to the quick sorts partition method. Another note: The method only works well for binary decision trees, which is not a huge problem in practice.

class DataSet(val ordering: Array[Int], val data: Array[Array[Double]]) {

  def partition(dec: Decision, hi: Int, lo: Int): Int = {
    var i = lo
    var j = lo
    while(j < hi) {
      if( dec.f( data(ordering(j)) ) ) {
        val tmp = ordering(j)
        ordering(j) = ordering(i)
        ordering(i) = tmp
        i += 1
      j += 1


case class Decision(attr: Int, th: Double) {

  def f(x: Array[Double]): Boolean = x(attr) <= th


Wednesday, March 1, 2017

Auto differentiation: Implementing the book version using Dynamic Programming

In the last two posts I talked about my interpretation of Auto Differentiation and how to implement it. This time, I will show the implementation as given by this Deep Learning book and show that the algorithm does the same as my version. The algorithm from the book saves the computation graph explicitly and independent from the computations while my implementation constructed the graph implicitly. Every node had it's children as a field and calling forward on a child would recursively calculate all values. The parents were saved in the node explicitly, too. Furthermore, the computation for the forward pass and backward pass were implemented here. Also, I saved all result computations in the node. The algorithm given in the book separates all of this. The results so far are saved in a table (as in classic dynamic programming). The computation node is a separate entity holding the forward and backward computation and the graph is coded as children and parent relationships The code to implement this behaviour is given below:
import NeuralNetork._

import breeze.linalg._

trait Computation {

  def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double]

  def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double]


class NeuralNetork(val children: Graph, val parents: Graph, cmp: Computations) {

  def fwd(requestID: Int, results: ComputationTable): Unit = {
    if(results(requestID) == None) {
      for(p <- parents(requestID)) {
        fwd(p, results)
      results(requestID) = Some(cmp(requestID).fwd(requestID, parents(requestID), results))

  def bwd(requestID: Int, gradients: ComputationTable, inputs: ComputationTable): DenseMatrix[Double] = {
    if (gradients(requestID) != None) gradients(requestID).get
    else {
      val g = for (c <- children(requestID)) yield {
        val gradient = bwd(c, gradients, inputs)
        cmp(c).bwd(requestID, parents(c), gradient, inputs)
      gradients(requestID) = Some(g.reduce(_ + _))


object NeuralNetork {

  type ComputationTable = Array[Option[DenseMatrix[Double]]]
  type Graph            = Array[Array[Int]]
  type Computations     = Map[Int, Computation]

In the end this computation does the same thing as the previous code, it is just another way of writing things down. The Computations can then be implemented in the same way as earlier. For more details check out Chapter 6 of the book. Now we can actually implement our nodes. In this example I am only using Add and Mul
  class Add extends Computation {
    def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double] = {
      results(parents(0)).get + results(parents(1)).get

    def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double] = gradient   


  class Mul extends Computation {

     def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double] = {
      results(parents(0)).get :* results(parents(1)).get

    def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double] = {
      if(parents(0) == requestID) inputs(parents(1)).get :* gradient
      else inputs(parents(0)).get :* gradient

And then we can set up a simple computation:
  • a := -2
  • b := 5
  • c := -4
  • d := 2
  • (a + b) * c
  • (a + b) * d
Which can be specified as:
    val in = Array.fill[Option[DenseMatrix[Double]]](7)(None)
    in(0)  = Some(DenseMatrix.ones[Double](1, 2) * -2.0)
    in(1)  = Some(DenseMatrix.ones[Double](1, 2) *  5.0)
    in(2)  = Some(DenseMatrix.ones[Double](1, 2) * -4.0)
    in(3)  = Some(DenseMatrix.ones[Double](1, 2) *  2.0)

    val cmp = Map(
      4 -> new Add,
      5 -> new Mul,
      6 -> new Mul

    val children = Array(
      Array(5, 6),

    val parents = Array(
      Array(0, 1),
      Array(2, 4),
      Array(3, 4)

    val nn = new NeuralNetork(children, parents, cmp)

    val gradients = Array.fill[Option[DenseMatrix[Double]]](7)(None)
    gradients(5)  =  Some(DenseMatrix.ones[Double](1, 2) * 1.0)
    gradients(6)  =  Some(DenseMatrix.ones[Double](1, 2) * 1.0)

    nn.fwd(5, in)
    nn.fwd(6, in)
    nn.bwd(0, gradients, in)
    nn.bwd(1, gradients, in)
    nn.bwd(2, gradients, in)
    nn.bwd(3, gradients, in)
The computation for the forward pass results in:
  • Some(-12.0 -12.0 )
  • Some(6.0 6.0 )
and the gradients for each variable
  • a = (Some(-2.0 -2.0 ),a)
  • b = (Some(-2.0 -2.0 ),b)
  • c = (Some(3.0 3.0 ),c)
  • d = (Some(3.0 3.0 ),d)
I also implemented a feed forward neural network with two layers:
   class Add extends Computation {

     def fwd(
       requestID: Int, parents: Array[Int], results: ComputationTable
     ): DenseMatrix[Double] = {
       results(parents(0)).get + results(parents(1)).get

     def bwd(
       requestID: Int,
       parents: Array[Int],
       gradient: DenseMatrix[Double],
       inputs: ComputationTable
     ): DenseMatrix[Double] = gradient


   class Mul extends Computation {

      def fwd(
        requestID: Int, parents: Array[Int], results: ComputationTable
      ): DenseMatrix[Double] = {
       results(parents(0)).get * results(parents(1)).get

     def bwd(
       requestID: Int,
       parents: Array[Int],
       gradient: DenseMatrix[Double],
       inputs: ComputationTable
     ): DenseMatrix[Double] =
       if(parents(0) == requestID) gradient * inputs(parents(1)).get.t
       else inputs(parents(0)).get.t * gradient


   class Sigmoid extends Computation {

     def fwd(
       requestID: Int, parents: Array[Int], results: ComputationTable
     ): DenseMatrix[Double] = sigmoid(results(parents(0)).get)

     def bwd(
       requestID: Int,
       parents: Array[Int],
       gradient: DenseMatrix[Double],
       inputs: ComputationTable
     ): DenseMatrix[Double] =
       (1.0 - fwd(requestID, parents, inputs)) :* fwd(requestID, parents, inputs) :* gradient

   def twoLayerNet = {
     val update = Array(1, 2, 3, 4)
     val output = 10
     val cmp = Map(
       5 ->  new Mul,
       6 ->  new Add,
       7 ->  new Sigmoid,
       8 ->  new Mul,
       9 ->  new Add,
       10 -> new Sigmoid

     val children = Array(
     val parents = Array(
       Array(0, 1),
       Array(5, 2),
       Array(7, 3),
       Array(8, 4),
     val nn = new NeuralNetork(children, parents, cmp)
     (nn, output, update)

On the spiral dataset and on the MNIST dataset I got around 98% accuracy.

Tuesday, February 28, 2017

Implementation notes on my auto differentiation for weight sharing

In my last post I wrote about building a very crude home made auto differentiation engine. As a side note I mentioned that if we get multiple incoming gradients we have to sum them. And also provided an option to sum the gradients and push them out instantly downstream. However in order to implement proper back propagation we have to wait until all incoming gradients have to be summed up before the gradient is propagated. This is especially important for weight sharing needed in order to implement recurrent neural networks such as LSTMs or more complex models such as the variational auto encoder.

If you are interested in the full code with the examples, I uploaded a Scala version Github 

Implementing the base computation node

The forward pass is the same as in the previous posts. The forward pass
simply computes it's activations using it's child nodes forward computations.
In this case the abstract class already holds the input nodes.
The backward pass follows the same idea too. Based on the gradients
from higher nodes and the local gradient in each node, we push the error
backwards through our computation. The only difference is that I added a wait
list which saves all parent node names. If all parents send their gradients,
we sum all of them and push down the error. Furthermore, I added
a recursive plotting function of the compute graph. The downside is now that we
have to register all parents at each node. During the graph plotting the parent
and wait relations are also checked.

abstract class Node(val name: String, inputs: Vector[Node]) {
    * Names of parents who have to send a gradient before backwards
    * pass continues
  val WaitFor   = mutable.Set.empty[String]

    * All parents gradients
  val Gradients = mutable.HashMap.empty[String, DenseMatrix[Double]]

    * Set value of forwards computation
  protected var Output: Option[DenseMatrix[Double]] = None

    * Should attach node type to name
  def typename: String

    * Recursively plot parent child relationships in dot format.
    * Also output error if the waitlist does not include the expected parent
  def graph(parent: String): List[String] = {
    if(!WaitFor.contains(parent) && parent.size > 1) throw new Exception(s"${parent} ${typename}_${name} connection not found")
    if(inputs.size == 0) List.empty[String]
    else {
      (for (in <- -="""" in.typename="" inputs="" s="" yield=""> ${typename}_${name};" :: in.graph(name))).flatten.toList

  def compute(x: Vector[DenseMatrix[Double]]): DenseMatrix[Double]

  def distribute(gradient: DenseMatrix[Double]): Unit

  def gradient: DenseMatrix[Double] = + _)

  def register(parent: String): Unit = WaitFor += parent

  def reset: Unit = {
    Output = None

  def fwd: DenseMatrix[Double] = Output match {
    case Some(x) => x
    case None => {
      val x = compute(
      Output = Some(x)

  def ready = (WaitFor & Gradients.keySet) == WaitFor

  def bwd(parent: String, gradient: DenseMatrix[Double]): Unit = {
    Gradients += parent -> gradient
    if(ready) distribute(gradient)


The rest of the computations such as multiply, add and the dot product stay the same.
Now we are actually able to build models with weight sharing.

A simple LSTM

Now we can implement a simple LSTM node. An LSTM is a recurrent node
that holds it's state through multiple time steps using a cell node. In the LSTM node
this cell can be updated based on the current input, a hidden state and the lasts cell state.
An LSTM node is wired in the following way.

This wiring actually translate into the following code and the graph below.

class LSTM(val in: Node, val ht: Node, val ct: Node, wf: Tensor, bf: Tensor, wi: Tensor, bi: Tensor, wo: Tensor, bo: Tensor, wc: Tensor, bc: Tensor) {

  val x = new Concat(in ,ht)
  val input  = new FeedForwardSigmoid(x, wi, bi)
  val output = new FeedForwardSigmoid(x, wo, bo)
  val forget = new FeedForwardSigmoid(x, wf, bf)
  val cell   = new FeedForwardTanH(x, wc, bc)

  val forgetCell = new Mul(forget.Layer, ct)
  val inCell     = new Mul(input.Layer, cell.Layer)
  val cellState  = new Add(inCell, forgetCell)
  val cellAct    = new Tanh(cellState)
  val hidden     = new Mul(output.Layer, cellAct)

  final val Layer = {






    (cellState, hidden)

  def update(rate: Double): Unit = {
      wf := wf.fwd - ((rate * wf.gradient))
      bf := bf.fwd - ((rate * bf.gradient))
      wi := wi.fwd - ((rate * wi.gradient))
      bi := bi.fwd - ((rate * bi.gradient))
      wc := wc.fwd - ((rate * wc.gradient))
      bc := bc.fwd - ((rate * bc.gradient))
      wo := wo.fwd - ((rate * wo.gradient))
      bo := bo.fwd - ((rate * bo.gradient))

Then for some given input we can unroll multiple cells and run regular stochastic gradient descent.
Given a list of inputs, representing a time series, we can unroll the LSTM. Notice that we reuse the same weights for each time step. In other words, if we unroll for 5 time steps, every weight will receive 5 gradient updates.

  def unroll(in: List[Node]): (LSTM,List[FeedForwardTanH]) =
    if(in.size == 1) {
      val lstm = new LSTM(in.head, h, c, wf, bf, wi, bi, wo, bo, wc, bc)
      val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
      (lstm, List(classify))
    } else {
      val (unrolled, out) = unroll(in.tail)
      val lstm = new LSTM(in.head, unrolled.Layer._2, unrolled.Layer._1, wf, bf, wi, bi, wo, bo, wc, bc)
      val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
      (lstm, classify :: out)

I trained the network with step size 5 on a trigonometric function and then generated 100 points
from an initial 5 points. Here is the result.

This looks promising I guess. So now we can generate some functions.

Variational Auto Encoders

The next model is a variational auto encoder. Here we do not connect the encoder
directly to the decoder but let the encoder generate a Gaussian distribution that when
sampled is used to generate the input when pushed through the decoder. The sample is
drawn uniformly from a multivariate gaussian with zero mean and variance one.

val enc   = new FeedForwardTanH(in, wenc, benc)
val mean1 = new Inner(enc.Layer, wmean)
val mean  = new Add(mean1, bmean)
val sigm1  = new Inner(enc.Layer, wsig)
val sigm2  = new Add(sigm1, bsig)
val sigm   = new Mul(sigm2, half)
val d1 = new Exp(sigm)
val d2 = new Mul(sample, d1)
val dist = new Add(mean, d2)
val dec  = new FeedForwardSigmoid(dist, wdec, bdec)

Or better to read in the final graph

I ran a very very few iterations on 200 random MNIST images. If we push uniform samples
through the decoder, we get images like these. Which also looks promising.

Most of the results are not amazing. The speed of the tool is not comparable at all to Tensorflow
and in turn I do not process a lot of data. Also, my hidden layers are very small. Furthermore, the only optimization I do is simple stochastic gradient descent. But this post is about differentiation and weight sharing not learning awesome models ... i guess. Again all the stuff is on github. In the next post I will go into developing a more performant variant of this algorithm.

Furthermore the new Deep Learning Book [3] defines the backpropagation algorithm in the general case as a dynamic programming problem filling a table. However, I think my method of directly coding the graph is a little better to explain.


[1] Hochreiter, Schmidhuber: "Long Short Term Memory", Journal of Neural Computation, 97
[2] Variational Auto Encoder,
[3] Goodfellow, Bengio, Courville: "Deep Learning", MIT Press 2016