## Tuesday, August 1, 2017

### Investigating Non Linear Optimization in Python

I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "Machine Learning For Trading". One interesting part was portfolio optimisation.
That is, if you want to invest some amount of money, how do you distribute it
over a selection of assets in a way minimising risk while at the same time
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the course book the optimisation is performed using a
python method for constrained non-linear optimisation.  The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.
The original algorithm can be found in the ACM Algorithms catalog (Algorithm 733)
written by D. Kraft. The code is automatically translated using the fortran to python (f2py)
library.

## Sunday, July 23, 2017

### Black Box Variational Inference and PyFlux

I found an interesting paper on black box variational inference. Variational inference is an
approximate inference algorithm for factored probabilistic models (such as bayesian networks).

In probabilistic inference, we are normally interested in the posterior density of a latent variable
given all our data. The basic idea is to turn the inference problem into an optimisation problem.
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.

The problem is that in order to do this for a new model, we still have to do a lot of manual derivations
and work through a lot of math before we can actually use this algorithm. This new paper
however, shows an algorithm that is easier to use given a new model. In my opinion,
one of the successes in deep learning is auto differentiation, where we do not
have to develop all gradients on our own but let the computer do it (as described in earlier posts).
One library that implements this inference method for complex time series is called pyFlux [2].

The proposed algorithm [1] works as follows. First, we sample the latent variables in our model
from the approximate distribution. We then use the samples to compute a gradient
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).

 Reproduced from original paper [1]
The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:

In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,

Where the lambda parameters will be the mean and variance of each of the gaussians used

## References:

[1] Ranganath, Gerrish, Blei: "Black Box Variational Inference", AISTATS 2014, 2014
[2] PyFlux

## Friday, May 26, 2017

### A Generic Sequence Alignment Algorithm

In this post I will describe how a general sequence alignment could be implemented in Scala.
For readers that don't know what sequence alignment is, I recommend reading my
post on building a gesture recogniser under Dynamic Time Warping distance (DTW).

Sequence alignment is often used in several contexts including speech recognition,
gesture recognition, approximate string matching, typing correction, biological
sequence alignment and time series classification and even stereo vision.
All these algorithms are based on a technique called dynamic programming and the algorithms
are all very similar. However, in literature this alignment technique is often found under different names (Dynamic Time Warping, Levensthein Distance, Needleman-Wunsch).

I will first formulate my idea of a general alignment algorithm with a scala implementation and afterwards describe an implementation of the three above algorithms.

Given two sequences X = [x1 .. xi .. xN] and Y = [y .. yj .. yM], we use the following
recursion to find the alignment score S(x, y). At each position i in sequence X and position j in sequence Y we look backwards at the matching score in X only, in Y only and in X and Y simultaneously resulting in the following recursion:

The rank function scores the alignment at position i and j based on the previous alignment scores and
the current elements. The rank function is the main component that changes in all three algorithms.
Looking back in X direction is also called an insertion, in Y direction a deletion and for both it is
called a match so the ranking function becomes:

Furthermore we will also implement a generic initialisation function that starts our recursion.
Ok now the Scala code:

```class Alignment[A](
init: (Int, Int) => Double,
rank: (Double, Double, Double, A, A) => Double
) {

def score(x: Array[A], y: Array[A]): Double = {
val n  = x.size
val m  = y.size
val dp = Array.ofDim[Double](n + 1, m + 1)

for(i <- 0 until n + 1) dp(i)(0) = init(i, 0)
for(j <- 0 until m + 1) dp(0)(j) = init(0, j)

for(i <- 1 until n + 1) {
for(j <- 1 until m + 1) {
dp(i)(j) = rank(
dp(i - 1)(j),     // insertion
dp(i - 1)(j - 1), // match
dp(i)(j - 1),     // deletion
x(i - 1),         // current sample in x
y(j - 1)          // current sample in y
)
}
}

dp(n)(m)
}

}

```

The alignment class has a type parameter specifying the type of the elements in the sequences we want to align and takes the initialising function and the ranking function as input. Now DTW and the Edit distance are just specialisations with a different type parameter and scoring function:

```  class DTW extends Alignment[Double](
init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else Double.PositiveInfinity,
rank = (insert: Double, matching: Double, delete: Double, x: Double, y: Double) =>
if(insert < matching && insert < delete) insert + math.pow(x - y, 2)
else if (delete < insert && delete < matching) delete + math.pow(x - y, 2)
else matching + math.pow(x - y, 2)
)
```

DTW is defined over doubles and is initialising the recursion with 0.0 at position (0,0) and all others to infinity. The ranking adds the euclidian distance to the insertion, deletion and match and then takes the minimum. It is easy to see how this can be generalised to multivariate sequences. The type parameter becomes Seq[Double] and the euclidean distance is implemented for vectors. One can also use kernels instead or the cosine similarity. The next function is the edit distance for discrete sequence elements (here characters):

```  class Levensthein extends Alignment[Char](
init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j else i,
rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
if(insert < matching && insert < delete) insert + 1
else if (delete < insert && delete < matching) delete + 1
else matching + (if(x == y) 0.0 else 1.0)
)
```

The function is initialised as the index at the borders and the function again takes the minimum of insertion, deletion and match. Insertions and deletions count as an error so we add 1. A match can also be a substitution so we add 1 to the error if the characters do not match. The last algorithm is the Needleman-Wunsch algorithm. Instead of minimising the distance we are maximising a score. The similarity between two symbols can be defined and the 'gap' penalty is not 1 as in the edit distance but given.

```  final val Sim = Map(
('a', 'a') -> 10,
('a', 'g') -> -1,
('g', 'a') -> -1,
('a', 'c') -> -3,
('c', 'a') -> -3,
('a', 't') -> -4,
('t', 'a') -> -4,
('g', 'g') ->  7,
('g', 'c') -> -5,
('c', 'g') -> -5,
('g', 't') -> -3,
('t', 'g') -> -3,
('c', 'c') ->  9,
('c', 't') ->  0,
('t', 'c') ->  0,
('t', 't') ->  8
)

final val Gap = -5

class NeedlemanWunsch extends Alignment[Char](
init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j * Gap else i * Gap,
rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
if(insert > matching && insert > delete) insert + Gap
else if (delete > insert && delete > matching) delete + Gap
else matching + Sim(x, y)
)

```

If one needs the actual alignment, we can backtrack the alignment process in all above cases. The code can be found here [GitHub]

## Saturday, April 15, 2017

### Three Algorithm Tricks in Data Mining

I really enjoy it when concepts can be reused or applied in several contexts. Three of the algorithms
that seem to be everywhere are quick sort, merge sort and binary search. In this post I will
just show some tricks I learned during coding. Using a similar algorithm than the merge method
of merge sort for fast sparse vector multiplication, using binary search to insert into a histogram in O(log(n)) time, and using the partition method of quick sort during C.45 decision tree training.

### Fast Sparse Vector Multiplication

A sparse vector is a vector where most if the entries are zeros. A common data structure is an array saving all non zero entries. Each entry is a tuple of the position and the corresponding value. Also the array is sorted increasingly by the position. Given two such vectors, we can multiply two such vectors very efficiently. We use two variables i and j, indicating a position in one of the vectors' array x and y. The variables move at the same speed as long as the positions of x(i) and y(j) are the same. In this case, we can also aggregate the multiplication of the vector elements. If i points to a position larger than j, we only move j until i and j are equal and vice versa. Since in the previous case, x(i) or y(j) is a zero entry, we don't have to compute anything. The movement of i and j is the same in the merge method of merge sort.

```object DotProduct {

case class SparseNode(pos: Int, value: Double)

type SVector = Array[SparseNode]

def dot(x: SVector, y: SVector): Double = {
var i = 0
var j = 0
var product = 0.0
while(i < x.length && j < y.length) {
if(x(i).pos == y(j).pos) {
product += x(i).value * y(j).value
i += 1
j += 1
} else if(x(i).pos < y(j).pos) i += 1
else j += 1
}
product
}

}```

### Histogram Insertion

A histogram can be represented as an array that holds the counts of values that fall into
a certain region. The range is another array containing the boundaries, which should be sorted
in increasing order. When we want to estimate the histogram, we need to find the region the value falls in. Naively that can be done by iterating the boundary array and checking if the value is in the range. However, we can perform this search using binary search in the boundary array.
Instead of searching an element though, we are checking if the value is larger than boundary[middle - 1] and is smaller than boundary[middle], if so we can increase the count for the middle, else we recursively search the left or right side, depending if the value is smaller or larger than the middle.

```class BinarySearchHistogram(val binBorders: Array[Double]) {

final val bins = Array.fill(binBorders.length)(0.0)

def findBin(x: Double, start: Int, stop: Int): Int = {
val center = ( (stop - start) / 2 ) + start
if (center == 0 || center == binBorders.length - 1) center
else if(x > binBorders(center - 1) && x <= binBorders(center)) center
else if(x > binBorders(center - 1)) findBin(x, center, stop)
else findBin(x, start, center)
}

def insert(x: Double): Unit = bins(findBin(x, 0, binBorders.length)) += 1.0

}
```

### Decision Tree Learning

A dataset for decision tree learning can be represented as a 2d array X[N][D] of N vectors each of D
dimensions. At each node in the decision tree, a node partitions the data points into two subsets, one contains all vectors whose dimension D is smaller than a threshold and the other subset contains the vectors where dimension D is larger. We can introduce an ordering array. The left side of the array from a pivot element contains all instances where the split decision is true and the right side contains all elements where the decision is false. We can swap the elements from the left to right whenever
a split decision is violated (the element should be left but is right). We can then call the partition
function recursively on the left and the right side. If we apply the function recursively, we partition the dataset into the split decisions as a decision tree does. Of course, we still have to select the correct
spit decision for every level but this is the material for another post. In the end, this function is very similar to the quick sorts partition method. Another note: The method only works well for binary decision trees, which is not a huge problem in practice.

```class DataSet(val ordering: Array[Int], val data: Array[Array[Double]]) {

def partition(dec: Decision, hi: Int, lo: Int): Int = {
var i = lo
var j = lo
while(j < hi) {
if( dec.f( data(ordering(j)) ) ) {
val tmp = ordering(j)
ordering(j) = ordering(i)
ordering(i) = tmp
i += 1
}
j += 1
}
i
}

}

case class Decision(attr: Int, th: Double) {

def f(x: Array[Double]): Boolean = x(attr) <= th

}```

## Wednesday, March 1, 2017

### Auto differentiation: Implementing the book version using Dynamic Programming

In the last two posts I talked about my interpretation of Auto Differentiation and how to implement it. This time, I will show the implementation as given by this Deep Learning book and show that the algorithm does the same as my version. The algorithm from the book saves the computation graph explicitly and independent from the computations while my implementation constructed the graph implicitly. Every node had it's children as a field and calling forward on a child would recursively calculate all values. The parents were saved in the node explicitly, too. Furthermore, the computation for the forward pass and backward pass were implemented here. Also, I saved all result computations in the node. The algorithm given in the book separates all of this. The results so far are saved in a table (as in classic dynamic programming). The computation node is a separate entity holding the forward and backward computation and the graph is coded as children and parent relationships The code to implement this behaviour is given below:
```
import NeuralNetork._

import breeze.linalg._

trait Computation {

def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double]

def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double]

}

class NeuralNetork(val children: Graph, val parents: Graph, cmp: Computations) {

def fwd(requestID: Int, results: ComputationTable): Unit = {
if(results(requestID) == None) {
for(p <- parents(requestID)) {
fwd(p, results)
}
results(requestID) = Some(cmp(requestID).fwd(requestID, parents(requestID), results))
}
}

def bwd(requestID: Int, gradients: ComputationTable, inputs: ComputationTable): DenseMatrix[Double] = {
else {
val g = for (c <- children(requestID)) yield {
}
}
}

}

object NeuralNetork {

type ComputationTable = Array[Option[DenseMatrix[Double]]]
type Graph            = Array[Array[Int]]
type Computations     = Map[Int, Computation]

}
```
In the end this computation does the same thing as the previous code, it is just another way of writing things down. The Computations can then be implemented in the same way as earlier. For more details check out Chapter 6 of the book. Now we can actually implement our nodes. In this example I am only using Add and Mul
```  class Add extends Computation {

def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double] = {
results(parents(0)).get + results(parents(1)).get
}

def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double] = gradient

}

class Mul extends Computation {

def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double] = {
results(parents(0)).get :* results(parents(1)).get
}

def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double] = {
if(parents(0) == requestID) inputs(parents(1)).get :* gradient
}

}
```
And then we can set up a simple computation:
• a := -2
• b := 5
• c := -4
• d := 2
• (a + b) * c
• (a + b) * d
Which can be specified as:
```    val in = Array.fill[Option[DenseMatrix[Double]]](7)(None)
in(0)  = Some(DenseMatrix.ones[Double](1, 2) * -2.0)
in(1)  = Some(DenseMatrix.ones[Double](1, 2) *  5.0)
in(2)  = Some(DenseMatrix.ones[Double](1, 2) * -4.0)
in(3)  = Some(DenseMatrix.ones[Double](1, 2) *  2.0)

val cmp = Map(
5 -> new Mul,
6 -> new Mul
)

val children = Array(
Array(4),
Array(4),
Array(5),
Array(6),
Array(5, 6),
Array.empty[Int],
Array.empty[Int]
)

val parents = Array(
Array.empty[Int],
Array.empty[Int],
Array.empty[Int],
Array.empty[Int],
Array(0, 1),
Array(2, 4),
Array(3, 4)
)

val nn = new NeuralNetork(children, parents, cmp)

gradients(5)  =  Some(DenseMatrix.ones[Double](1, 2) * 1.0)
gradients(6)  =  Some(DenseMatrix.ones[Double](1, 2) * 1.0)

nn.fwd(5, in)
nn.fwd(6, in)
println(in(5))
println(in(6))
println("---")
```
The computation for the forward pass results in:
• Some(-12.0 -12.0 )
• Some(6.0 6.0 )
and the gradients for each variable
• a = (Some(-2.0 -2.0 ),a)
• b = (Some(-2.0 -2.0 ),b)
• c = (Some(3.0 3.0 ),c)
• d = (Some(3.0 3.0 ),d)
I also implemented a feed forward neural network with two layers:
```   class Add extends Computation {

def fwd(
requestID: Int, parents: Array[Int], results: ComputationTable
): DenseMatrix[Double] = {
results(parents(0)).get + results(parents(1)).get
}

def bwd(
requestID: Int,
parents: Array[Int],
inputs: ComputationTable

}

class Mul extends Computation {

def fwd(
requestID: Int, parents: Array[Int], results: ComputationTable
): DenseMatrix[Double] = {
results(parents(0)).get * results(parents(1)).get
}

def bwd(
requestID: Int,
parents: Array[Int],
inputs: ComputationTable
): DenseMatrix[Double] =
if(parents(0) == requestID) gradient * inputs(parents(1)).get.t

}

class Sigmoid extends Computation {

def fwd(
requestID: Int, parents: Array[Int], results: ComputationTable
): DenseMatrix[Double] = sigmoid(results(parents(0)).get)

def bwd(
requestID: Int,
parents: Array[Int],
inputs: ComputationTable
): DenseMatrix[Double] =
(1.0 - fwd(requestID, parents, inputs)) :* fwd(requestID, parents, inputs) :* gradient
}

def twoLayerNet = {
val update = Array(1, 2, 3, 4)
val output = 10
val cmp = Map(
5 ->  new Mul,
7 ->  new Sigmoid,
8 ->  new Mul,
10 -> new Sigmoid
)

val children = Array(
Array(5),
Array(5),
Array(6),
Array(8),
Array(9),
Array(6),
Array(7),
Array(8),
Array(9),
Array(10),
Array.empty[Int]
)
val parents = Array(
Array.empty[Int],
Array.empty[Int],
Array.empty[Int],
Array.empty[Int],
Array.empty[Int],
Array(0, 1),
Array(5, 2),
Array(6),
Array(7, 3),
Array(8, 4),
Array(9)
)
val nn = new NeuralNetork(children, parents, cmp)
(nn, output, update)
}
```

On the spiral dataset and on the MNIST dataset I got around 98% accuracy.

## Tuesday, February 28, 2017

### Implementation notes on my auto differentiation for weight sharing

In my last post I wrote about building a very crude home made auto differentiation engine. As a side note I mentioned that if we get multiple incoming gradients we have to sum them. And also provided an option to sum the gradients and push them out instantly downstream. However in order to implement proper back propagation we have to wait until all incoming gradients have to be summed up before the gradient is propagated. This is especially important for weight sharing needed in order to implement recurrent neural networks such as LSTMs or more complex models such as the variational auto encoder.

If you are interested in the full code with the examples, I uploaded a Scala version

### Implementing the base computation node

The forward pass is the same as in the previous posts. The forward pass
simply computes it's activations using it's child nodes forward computations.
In this case the abstract class already holds the input nodes.
The backward pass follows the same idea too. Based on the gradients
from higher nodes and the local gradient in each node, we push the error
backwards through our computation. The only difference is that I added a wait
list which saves all parent node names. If all parents send their gradients,
we sum all of them and push down the error. Furthermore, I added
a recursive plotting function of the compute graph. The downside is now that we
have to register all parents at each node. During the graph plotting the parent
and wait relations are also checked.

```abstract class Node(val name: String, inputs: Vector[Node]) {
/**
* Names of parents who have to send a gradient before backwards
* pass continues
*/
val WaitFor   = mutable.Set.empty[String]

/**
*/

/**
* Set value of forwards computation
*/
protected var Output: Option[DenseMatrix[Double]] = None

/**
* Should attach node type to name
*/
def typename: String

/**
* Recursively plot parent child relationships in dot format.
* Also output error if the waitlist does not include the expected parent
*/
def graph(parent: String): List[String] = {
if(!WaitFor.contains(parent) && parent.size > 1) throw new Exception(s"\${parent} \${typename}_\${name} connection not found")
if(inputs.size == 0) List.empty[String]
else {
(for (in <- -="" in.name="" in.typename="" inputs="" s="" yield=""> \${typename}_\${name};" :: in.graph(name))).flatten.toList
}
}

def compute(x: Vector[DenseMatrix[Double]]): DenseMatrix[Double]

def register(parent: String): Unit = WaitFor += parent

def reset: Unit = {
Output = None
inputs.foreach(_.reset)
}

def fwd: DenseMatrix[Double] = Output match {
case Some(x) => x
case None => {
val x = compute(inputs.map(_.fwd))
Output = Some(x)
x
}
}

def bwd(parent: String, gradient: DenseMatrix[Double]): Unit = {
}

}```

The rest of the computations such as multiply, add and the dot product stay the same.
Now we are actually able to build models with weight sharing.

### A simple LSTM

Now we can implement a simple LSTM node. An LSTM is a recurrent node
that holds it's state through multiple time steps using a cell node. In the LSTM node
this cell can be updated based on the current input, a hidden state and the lasts cell state.
An LSTM node is wired in the following way.

This wiring actually translate into the following code and the graph below.

```class LSTM(val in: Node, val ht: Node, val ct: Node, wf: Tensor, bf: Tensor, wi: Tensor, bi: Tensor, wo: Tensor, bo: Tensor, wc: Tensor, bc: Tensor) {

val x = new Concat(in ,ht)
val input  = new FeedForwardSigmoid(x, wi, bi)
val output = new FeedForwardSigmoid(x, wo, bo)
val forget = new FeedForwardSigmoid(x, wf, bf)
val cell   = new FeedForwardTanH(x, wc, bc)

val forgetCell = new Mul(forget.Layer, ct)
val inCell     = new Mul(input.Layer, cell.Layer)
val cellState  = new Add(inCell, forgetCell)
val cellAct    = new Tanh(cellState)
val hidden     = new Mul(output.Layer, cellAct)

final val Layer = {
in.register(x.name)
ht.register(x.name)

forget.Layer.register(forgetCell.name)
ct.register(forgetCell.name)

input.Layer.register(inCell.name)
cell.Layer.register(inCell.name)

inCell.register(cellState.name)
forgetCell.register(cellState.name)

cellState.register(cellAct.name)

output.Layer.register(hidden.name)
cellAct.register(hidden.name)

(cellState, hidden)
}

def update(rate: Double): Unit = {
wf := wf.fwd - ((rate * wf.gradient))
bf := bf.fwd - ((rate * bf.gradient))
wi := wi.fwd - ((rate * wi.gradient))
bi := bi.fwd - ((rate * bi.gradient))
wc := wc.fwd - ((rate * wc.gradient))
bc := bc.fwd - ((rate * bc.gradient))
wo := wo.fwd - ((rate * wo.gradient))
bo := bo.fwd - ((rate * bo.gradient))
}
}
```

Then for some given input we can unroll multiple cells and run regular stochastic gradient descent.
Given a list of inputs, representing a time series, we can unroll the LSTM. Notice that we reuse the same weights for each time step. In other words, if we unroll for 5 time steps, every weight will receive 5 gradient updates.

```  def unroll(in: List[Node]): (LSTM,List[FeedForwardTanH]) =
if(in.size == 1) {
val lstm = new LSTM(in.head, h, c, wf, bf, wi, bi, wo, bo, wc, bc)
val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
(lstm, List(classify))
} else {
val (unrolled, out) = unroll(in.tail)
val lstm = new LSTM(in.head, unrolled.Layer._2, unrolled.Layer._1, wf, bf, wi, bi, wo, bo, wc, bc)
val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
(lstm, classify :: out)
}
```

I trained the network with step size 5 on a trigonometric function and then generated 100 points
from an initial 5 points. Here is the result.

This looks promising I guess. So now we can generate some functions.

### Variational Auto Encoders

The next model is a variational auto encoder. Here we do not connect the encoder
directly to the decoder but let the encoder generate a Gaussian distribution that when
sampled is used to generate the input when pushed through the decoder. The sample is
drawn uniformly from a multivariate gaussian with zero mean and variance one.

```val enc   = new FeedForwardTanH(in, wenc, benc)
val mean1 = new Inner(enc.Layer, wmean)
val mean  = new Add(mean1, bmean)
val sigm1  = new Inner(enc.Layer, wsig)
val sigm2  = new Add(sigm1, bsig)
val sigm   = new Mul(sigm2, half)
val d1 = new Exp(sigm)
val d2 = new Mul(sample, d1)
val dist = new Add(mean, d2)
val dec  = new FeedForwardSigmoid(dist, wdec, bdec)
```

Or better to read in the final graph

I ran a very very few iterations on 200 random MNIST images. If we push uniform samples
through the decoder, we get images like these. Which also looks promising.

Most of the results are not amazing. The speed of the tool is not comparable at all to Tensorflow
and in turn I do not process a lot of data. Also, my hidden layers are very small. Furthermore, the only optimization I do is simple stochastic gradient descent. But this post is about differentiation and weight sharing not learning awesome models ... i guess. Again all the stuff is on github. In the next post I will go into developing a more performant variant of this algorithm.

Furthermore the new Deep Learning Book [3] defines the backpropagation algorithm in the general case as a dynamic programming problem filling a table. However, I think my method of directly coding the graph is a little better to explain.

### References

[1] Hochreiter, Schmidhuber: "Long Short Term Memory", Journal of Neural Computation, 97
[2] Variational Auto Encoder, https://arxiv.org/abs/1606.05908
[3] Goodfellow, Bengio, Courville: "Deep Learning", MIT Press 2016

## Thursday, February 16, 2017

### AI: Implement your own Automatic Differentiation Algorithm For Neural Networks

Modern deep learning libraries such as Tensor Flow provide automatic differentiation in order to
compute the gradients in each neural network layer. While the algorithm for back propagation
in feed forward networks is well known, the gradients for new or more interesting architectures
is often hard to understand. This is where automatic differentiation can help. In this post
I will describe the algorithm and provide a simple Scala implementation that might help
to understand how automatic differentiation can be implemented. I also 'cheat' in the sense that
I use a linear algebra library that already interfaces to some native libraries that in turn use Blas and LaPack so my performance is dependent on this wrapper. I guess for playing around this is sufficient.
The code can be found on Github.

## Automatic Differentiation

In my opinion the easiest way to understand automatic differentiation is through a computation
graph. In a computation graph variables, operations and functions are represented as nodes in the graph while edges represent the input and output the said functions and operations.

Let us consider the example computation:

• W * x + b
• with W = 2
• with  x = 3
• with  b = 1

This results in the following computation tree. If we want to get the value this formula, we can
request the knowledge for the root node and then evaluate the expression recursively. The result
can then be achieved through backtracking (here 7). This is equivalent to finding all paths to the target variable and on the way instantiate all the results needed to compute the final value.
We can then propagate derivatives backwards from the variable to the inputs. For example,
if we measure an error of 2 at the top, we can push it down the tree to all children.
Since we apply the chain rule, a derivative of a function:  f(g(x))' = f'(g(x)) * g(x)',
we multiply the gradient flowing from the top, with the local gradient at the current node.
When adding two numbers, the local gradient is 1 so the gradient from the top flows through
uninterrupted to both children. Let the gradient from the top be called delta.
When multiplying to numbers, x and w, the gradient flowing towards x and w is:

• dx  = delta * w
• dw = delta * x

So the inputs get swapped and multiplied with the incoming gradient. The following image shows the forward pass in green and the backward pass in red.

Before we go on, if there is a node with two outgoing edges in the forward pass, we sum all in the incoming gradients during back propagation. While this does not happen in feed forward neural networks, it happens when using weight sharing such as in convolutional neural networks or recurrent neural networks.

Now we can turn to matrix computations needed for neural networks. Adding two matrices is actually equivalent to the scalar form from before. Since the addition is an element wise operation, the gradient (now a matrix) simply flows unchanged to the children. In general element wise operations behave like their scalar counterparts.

The gradient of the dot product between two matrices x and y becomes:

• dx  = delta * w.t
• dw = x.t * delta

The only thing left to build a neural network is to know the gradient of the activation function.
When using sigmoid units the forward and backward pass for this node given the input x is:

• fwd: y(x) = 1.0 / (1.0 + exp(-x))
• bwd: (1.0 - y(x)) * (y(x))
Again this is an element wise operation so we can simply apply the formula to each element in the matrix.

## A quick and dirty implementation

In order to implement this behaviour we first implement a general computation node
```
abstract class ComputationNode {

var valueOpt: Option[DMat]    = None

def value = valueOpt.get

def assignVal(x: DMat): DMat = {
if(valueOpt == None) {
valueOpt = Some(x)
}
value
}

def collectGradient(x: DMat): Unit = {
}

def reset: Unit

def fwd: DMat

def bwd(error: DMat): Unit

}
```

The node holds a value that is computed during the forward pass and the gradient.
I implemented this so the value does not have to be recomputed.
In other words, the values and gradients are represented as Scala options set to None by default.
When the value needs to be computed first, we initialise the option to its value. When the value is needed again, we do not have to recompute it in that way. For the gradients, when the first value comes in we graph we set the value and all later values are summed.

UPDATE: If you actually want to use weight sharing it is important to have a mechanism
to only send the backward message when all incoming gradients are received. This is not implemented here. I will discuss weight sharing in a later post.

The forward computation should compute the result up to that node and the backward function
should distribute the gradient. The reset function needs to delete all values (set back to None values)
in order to be able to reuse the graph.

## Some Computation Nodes For Tensors

Now we are ready to implement the actual computation nodes. All these nodes
extend the abstract class from above. The nodes we need for a basic neural network are addition of two matrices, the inner product and the sigmoid function since one layer in a feed forward neural network is given by: sigmoid(x*W + b). Using the linear algebra library this is actually quite simple.

During the forward pass in an addition node, we simply add the forward results (2 matrices) so far and pass back the error unchanged.

```case class Addition(x: ComputationNode, y: ComputationNode) extends ComputationNode {

def fwd = assignVal(x.fwd + y.fwd)

def bwd(error: DMat) = {
x.bwd(error)
y.bwd(error)
}

def reset = {
valueOpt    = None
x.reset
y.reset
}

}
```

During the forward pass in an inner product node, we compute the dot product (2 matrices) so far and pass back the error multiplied with the other child's output.

```case class InnerProduct(x: ComputationNode, y: ComputationNode) extends ComputationNode {

def fwd = {
assignVal(x.fwd * y.fwd)
}

def bwd(error: DMat) = {
x.bwd(error * y.fwd.t)
y.bwd(x.fwd.t * error)
}

def reset = {
valueOpt    = None
x.reset
y.reset
}

}
```

The sigmoid behaves like expected. We simply pass the error flowing from the top multiplied with the derivative of the function.

```case class Sigmoid(x: ComputationNode) extends ComputationNode {

def fwd = {
assignVal(sigmoid(x.fwd))
}

def bwd(error: DMat) = {
val ones = DenseMatrix.ones[Double](value.rows, value.cols)
val derivative = (ones - value) :* value
x.bwd(error :* derivative)
}

def reset = {
valueOpt    = None
x.reset
}

}
```

Last but not least we need a class representing a tensor such as a weight or an input.
During the forward pass we simply return the value of the tensor. And during the backward
pass we sum up all incoming gradients. There is another method called ':=' (using parameter overloading) to set a tensor.

```class Tensor extends ComputationNode {

def fwd = value

def :=(x: DMat) = valueOpt = Some(x)

def reset = {
}

}
```

For example, using this implementation we can build a simple logistic regressor
```val x = Tensor
val w = Tensor
val b = Tensor

w := DenseMatrix.rand[Double]( 100, 1)
b := DenseMatrix.zeros[Double](1,  1)

val logistic = sigmoid(Addition(InnerProduct(x, w), b))

x := input
val result = logistic.fwd
val error  = - (label - result)

logistic.bwd(error)

w := w.value - 0.01 * w.gradient
b := b.value - 0.01 * b.gradient
```
In the example, we first initialize some tensors and then define the computation graph. Setting the input we compute the result of the computation and then push the error back through the graph. In the end, we update the weights and biases. This can be seen as one step of stochastic gradient descent.

## Implementing a Generic Feed Forward Layer and a small Calculation Domain

In order to have an easier job of defining computations we now implement a small domain specific language in order to create the graph. We use parameter overloading. We now define a wrapper class around computation nodes that enables us to write expresssions such as: sigmoid(Computation(x) * w + b.) The class basically holds the computation so far. If we call the '+' function, we chain the compuation so far with the new compute node using an add node as a parent and return a new computation. In this way we can also implement a wrapper for a feed forward neural network.
```
class Computation(val x: ComputationNode) {

def +(y: ComputationNode): Computation = new Computation(Addition(x, y))

def *(y: ComputationNode): Computation = new Computation(InnerProduct(x, y))

}

object Computation {

def apply(x: ComputationNode): Computation = new Computation(x)

def sigmoid(x: Computation): Computation = new Computation(Sigmoid(x.x))

}

case class FeedForwardLayerSigmoid(x: ComputationNode, w: Tensor, b: Tensor) extends ComputationNode {

final val Logit = Computation(x) * w + b

final val Layer = Computation.sigmoid(Logit).x

def fwd = Layer.fwd

def bwd(error: DMat) = Layer.bwd(error)

def update(rate: Double): Unit = {
w := w.value - (rate * w.gradient)
b := b.value - (rate * b.gradient)
if(x.isInstanceOf[FeedForwardLayerSigmoid]) x.asInstanceOf[FeedForwardLayerSigmoid].update(rate)
}

def reset = Layer.reset

}
```
The feed forward layer is itself a computation node and specifies the computation for the layer. In the forward pass we compute the result of the complete layer and on the backward pass we also propagate the error through the whole layer. I also included an update function that recursively 'learns' the weights using stochastic gradient descent. Now we can learn the network using the following program.

```val nn = FeedForwardLayerSigmoid(FeedForwardLayerSigmoid(input, W1, B1), W2, B2)
// repeat for all examples
input := x
val prediction = nn.fwd
val error      = -(correct - prediction)
nn.bwd(error)
nn.update(0.01)
nn.reset
```

## Experiments

For the first experiment I use the synthetic spiral dataset. It is basically a non linear dataset with three classes.

I use a two layer neural network with the following initialisation

```input :=  DenseMatrix.rand[Double](1,   2)
W1    := (DenseMatrix.rand[Double](2, 100) - 0.5) * 0.01
B1    :=         DenseMatrix.zeros(1, 100)
W2    := (DenseMatrix.rand[Double](100, 3) - 0.5) * 0.01
B2    :=         DenseMatrix.zeros(1,   3)
```

And receive a 95% - 98% accuracy. So this seems to work. Next I tried to classify mnist. Again I used a two layer neural network. The hidden layer is 100 dimensions wide.

```x   := DenseMatrix.rand[Double](1,   784)
w1  := (DenseMatrix.rand[Double](784, 100) - 0.5) * 0.001
b1  := DenseMatrix.zeros[Double](1,   100)
w2  := (DenseMatrix.rand[Double](100, 10)  - 0.5) * 0.001
b2  := DenseMatrix.zeros[Double](1,   10)
```

I got a 97% accuracy. This is far away from the state of the art however, for a simple 2 layer neural network it is not bad. If you remove one layer I'll got an accuracy of about 92% which is the same reported in the tensorflow beginners tutorial. I also visualised the weights of the first layer.

One more thing. This library is really not optimized for speed. The big players such as theano, tensorflow, caffe and dl4j are way faster and more optimized. With this I just hope to clarify the concept.

References
```
[1]Stanford tutorial on unsupervised feature learning: http://ufldl.stanford.edu/wiki/index.php/Neural_Networks
[2]Stanford course on automatic differentiation: http://cs231n.github.io/optimization-2/
[3]Some notes on auto diff from Wisconsin: http://pages.cs.wisc.edu/~cs701-1/LectureNotes/trunk/cs701-lec-12-1-2015/cs701-lec-12-01-2015.pdf

```