Tuesday, December 12, 2017

Implementing Gradient Boosted Trees

Implementing Gradient Boosted Trees



In this post I will share my experience on implementing gradient boosted trees as seen in XGBoost in order to get a better understanding of the method. As in previous posts, the code won't be as efficient as the original. Furthermore, I will use scala again. The basic idea behind boosting is to construct an additive model of a collection of simpler models (base models) for classification or regresion. If you are unfamiliar with the general boosting idea plase consider (Hastie: Elements of Statistics Learning) for a great introduction. This post is mainly based on this blog post and these slides. In boosting, the model is constructed iteratively that is, in an iteration $t$, the new output given a datapoint $x$ is: $$ \hat{y}^t = \hat{y}^{(t - 1)} - f_t(x) $$ The idea of gradient boosting is that in each iteration the next model is determined so it minimizes a given loss function $\mathbb{L}$. Specifically the minimization is performed using the first and second order derivatives of the loss function. Each added model can also be seen as a step in the direction of the gradient. In other words, the iterations can be seen as a form of gradient descent. When extending this idea to regression tree base models and when regularizing the objective function of the optimization, we end up with a powerful algorithm that one multiple Kaggle challenges as well as the Xing hosted recommender systems challenge to years in a row. In the next part, I will derive the optimization step used for decision trees with a cross entropy loss and then "massage" in the regularization.


Cross Entropy loss for Binay Classification

For binary classification, the go-to output function is the sigmoid function: $$P(x = 1) = \frac{1}{1 = e^{-x}}$$ Prominent usage examples include Neural Networks and logistic regressions. Given that our output $P(x = 1)$ is $\hat{y}$ and the true label is $y$, we can measure the goodness of our classifier on our dataset using the cross entropy loss: $$ L(y, \hat{y}) = \sum_i -y_i * log(\hat{y_i}) - (1.0 - y_i) * log(1.0 - \hat{y_i})$$ with its first derivative at $g_i = \hat{y_i} - y_i$ and it's second derivative at $h_i = \hat{y_i} * (1.0 - \hat{y_i})$.


Regularization

One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to penalize large parameters that dominate our result and model complexity. Given a parameter set $\theta$, the L1 regularization is $\Omega(\theta) = \sum_i ||\theta_i||_1$, where $||x||_1$ is the L1 norm or simply the absolute value. In L2 regularization, we switch to $||\theta_i||_2$ where $||x||_2$ is $x^2$. Then for a decision tree, we could measure the complexity as the number of leaf nodes $T$, and the parameters $\theta$ are the output scores of a leaf nodes (we will later use regression trees). For example, we could penalize large trees by $\gamma T$ and large output by $\frac{1}{2}\sum_i \lambda ||\theta_i||_2$. We can control the penalty amount by two parameters $\gamma$ and $\lambda$.


Gradient boosted trees

The combined objective using regularization and some loss function is: $$O = L(y, \hat{y}) + \Omega(\theta)$$ Since we are boosting, we define the objective in terms of the predictions in iteration $t$: $$O^t = L(y, \hat{y}^{t - 1} + f(x)) + \Omega(\theta)$$ while $\hat{y}^{t - 1} + f(x)$ follows directly from the boosting formular $\hat{y}^t = \hat{y}^(t - 1) - f_t(x)$. Since we want to define the gradient in order to optimize f(x), we can apply the Taylor expansion to the loss. Remember that the Taylor expansion is defined as $f(x + \delta x) \approx f(x) + f'(x) * \delta x + \frac{1}{2} f''(x) \delta x^2$ So the loss becomes $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \Omega(\theta))$$ Assuming a decision tree base classifier we can use the example regularization above and the objective becomes: $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \gamma T \frac{1}{2}\sum_i \lambda * ||\theta_i||_2$$ With a few transformations, we can redefine the loss in a way we can directly optimize the function with the standard CART learning algorithm. First we defined $\mathbb{I}(j == i)$ as an indicator function that returns 1 if example $i$ when pushed trough the decision tree, ends up in leaf node $j$. Then the result of $f(x_i)$ is $w_j$ the leaf node score at node $j$ ig $x_i$ evaluates true. Now we can rearange the objective function towards the leaf nodes as: $$O^t = \sum_j [(\sum_i \mathbb{I}(j == i) * g_i) * w_j + \frac{1}{2} (\sum_i \mathbb{I}(j == i) * h_i + \lambda) * w_j^2] + \gamma T$$ redefining the sums of the gradients to $G_j$ and $H_j$ we get: $$O^t = \sum_j [G_j * w_j + \frac{1}{2} (H_j + lambda) * w_j^2] + \gamma T$$ Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero: $$Gj + (H_j + \lambda) = 0$$ and then $w_j = -\frac{G_j}{H_j + \lambda}$. Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result. In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a normal decision tree. The formular is the gradient "amplitude" before the split $(G_l + G_h)^2 / (H_l + H_r + \lambda)$ and after the split $G_l^2 / (H_l + \lambda)$, where $G_l$ and $H_l$ are the sum of the gradients in the left node or the right node. Then the gain is simply: $$Gain = \frac{1}{2} [ \frac{G_l^2}{H_l + \lambda} + \frac{G_r^2}{H_r + \lambda} - \frac{(G_l + G_h)^2}{H_l + H_r + \lambda}]$$ Now we can implement these functions:
package gbt

import math._

trait Loss {

  def gradient(truth: Double, prediction: Double): Double

  def hessian(truth: Double, prediction: Double): Double

  def loss(truth: Double, prediction: Double): Double

  def output(x: Double): Double

}


class Logistic extends Loss {

  private def sigmoid(x: Double): Double = 1.0 / (1.0 + exp(-x))

  override def loss(truth: Double, prediction: Double): Double = {
    val isOne  = -truth * log(sigmoid(prediction))
    val isZero = (1 - truth) * log(1.0 - sigmoid(prediction))
    (isOne - isZero)
  }

  override def gradient(truth: Double, prediction: Double): Double = sigmoid(prediction) - truth

  override def hessian(truth: Double, prediction: Double): Double =
    sigmoid(prediction) * (1.0 - sigmoid(prediction))

  override def output(x: Double) = sigmoid(x)

}


class Squared extends Loss {

  override def loss(truth: Double, prediction: Double): Double = pow(truth - prediction, 2.0)

  override def gradient(truth: Double, prediction: Double): Double = 2.0 * (prediction - truth)

  override def hessian(truth: Double, prediction: Double): Double = 2.0

  override def output(x: Double): Double = x

}


object LossUtils {

  def score(gradients: Vector[Double], hessians: Vector[Double], lambda: Double): Double = {
    val G = gradients.sum
    val H = hessians.sum 
    - G / (H + lambda)
  }

  def gain(
    gradients: Vector[Double],
    hessians: Vector[Double],
    left: Vector[Int],
    right: Vector[Int],
    lambda: Double,
    gamma: Double
  ): Double = {
    val GL = left.map(i => gradients(i)).sum
    val HL = left.map(i => hessians(i)).sum
    val GR = right.map(i => gradients(i)).sum
    val HR = right.map(i => hessians(i)).sum 
    0.5 * (
      (pow(GL, 2) / (HL + lambda)) + (pow(GR, 2) / (HR + lambda)) - (pow(GL + GR, 2) / (HL + HR + lambda))
    ) - gamma
  }

}
The full code can be found on github

Wednesday, November 29, 2017

Refactoring My Previous Deep Learning Library Ideas

Update: My Neural Network Posts

  • Basic Auto Differentiation: Post
  • Basic Auto Differentiation with Weight Sharing: Post
  • Auto Differentiation Code from Book: Post
  • Refactoring The Ideas: Post
in some previous posts I looked at how I could implement a simple deep learning library to gain a better understanding on the inner workings of: Auto Differentiation, Weight Sharing and a good Back Propagation implementation from the deep learning book.

In this post I will present my ideas on a refactored version. The main point is that the graph representation I implemented in the first two posts made it easier to build the computation graph, 
but the code to run back propagation, especially with weight sharing was quite messy. So this refactored version first creates a computation tree as before, only that the actual computations are decoupled from the tree. The tree later is translated into the instructions and parent / child relationships used in the book version of the back propagation algorithm. 

Now in the first step we write model setup code. For example, a neural network for mnist classification:



As in tensorflow I define placeholders, variables and computation nodes, along with their name.
Furthermore, the variables get a shape and an initialisation method. In the background this method constructs a computation tree rooted at a5 using the following DSL.



Each node, keeps track of the name, an instruction name and a shape, if needed.
We can then convert this tree into a neural network ready to be used in the back propagation code from original post. We build four maps keeping:
  1. Mapping from node names to computations (like add, multiply, ...)
  2. Mapping from node names to their parents node names
  3. Mapping from node names to their child node names
  4. Mapping from node names that are weights to matrices.
All computations such as add or multiply implement the forward pass through the variable 
as well as the backward pass.

Using a class I called TensorStore that maps from names to matrices, I implemented the forward and the backward pass as follows:



The code can be found along the other versions on github. An example for the spiral dataset as well as the mnist data set is attached.

Of course the code is not at all up to the popular libraries in terms of speed or functionality so for me it is more an ongoing investigation on how these libraries are implemented and what I can reproduce. :). On Mnist the given network achieves between 96 and 98 % accuracy depending on the starting condition.


Sunday, September 10, 2017

Variational Fashion Encoder

Variational Fashion Encoder


In this experiment I want to try out Variational Auto Encoders, an unsupervised feature learning algorithm on a new fashion classification dataset from Zalando. Since I wanted to play around with both more, I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.
Code can be found on github


Variational Auto Encoder

The bottom part of the model is embedding the input X into a mean and variance vector.
The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.


The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space.

Fashion Reconstruction

The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.
The fashion data consists of 10 - classes of cloths: 
  1. tshirt
  2. trouser
  3. pullover
  4. dress
  5. coat
  6. sandal
  7. shirt
  8. sneaker
  9. bag
  10. ankle boo
After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder.

As one can see the reconstructions work quite well indicating the latent space learned is useful.

Tuesday, August 1, 2017

Investigating Non Linear Optimization in Python

I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "Machine Learning For Trading". One interesting part was portfolio optimisation.
That is, if you want to invest some amount of money, how do you distribute it
over a selection of assets in a way minimising risk while at the same time
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the course book the optimisation is performed using a
python method for constrained non-linear optimisation.  The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.
The original algorithm can be found in the ACM Algorithms catalog (Algorithm 733)
written by D. Kraft. The code is automatically translated using the fortran to python (f2py)
library. 

Sunday, July 23, 2017

Black Box Variational Inference and PyFlux

I found an interesting paper on black box variational inference. Variational inference is an
approximate inference algorithm for factored probabilistic models (such as bayesian networks).

In probabilistic inference, we are normally interested in the posterior density of a latent variable
given all our data. The basic idea is to turn the inference problem into an optimisation problem.
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.

The problem is that in order to do this for a new model, we still have to do a lot of manual derivations
and work through a lot of math before we can actually use this algorithm. This new paper
however, shows an algorithm that is easier to use given a new model. In my opinion,
one of the successes in deep learning is auto differentiation, where we do not
have to develop all gradients on our own but let the computer do it (as described in earlier posts).
One library that implements this inference method for complex time series is called pyFlux [2].

The proposed algorithm [1] works as follows. First, we sample the latent variables in our model
from the approximate distribution. We then use the samples to compute a gradient
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).


Reproduced from original paper [1]
The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:


In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,
the final gradient becomes:


Where the lambda parameters will be the mean and variance of each of the gaussians used
with the gradients:





References:

[1] Ranganath, Gerrish, Blei: "Black Box Variational Inference", AISTATS 2014, 2014
[2] PyFlux

Friday, May 26, 2017

A Generic Sequence Alignment Algorithm

In this post I will describe how a general sequence alignment could be implemented in Scala.
For readers that don't know what sequence alignment is, I recommend reading my
post on building a gesture recogniser under Dynamic Time Warping distance (DTW).

Sequence alignment is often used in several contexts including speech recognition,
gesture recognition, approximate string matching, typing correction, biological
sequence alignment and time series classification and even stereo vision.
All these algorithms are based on a technique called dynamic programming and the algorithms
are all very similar. However, in literature this alignment technique is often found under different names (Dynamic Time Warping, Levensthein Distance, Needleman-Wunsch).

I will first formulate my idea of a general alignment algorithm with a scala implementation and afterwards describe an implementation of the three above algorithms.

Given two sequences X = [x1 .. xi .. xN] and Y = [y .. yj .. yM], we use the following
recursion to find the alignment score S(x, y). At each position i in sequence X and position j in sequence Y we look backwards at the matching score in X only, in Y only and in X and Y simultaneously resulting in the following recursion:


The rank function scores the alignment at position i and j based on the previous alignment scores and
the current elements. The rank function is the main component that changes in all three algorithms.
Looking back in X direction is also called an insertion, in Y direction a deletion and for both it is
called a match so the ranking function becomes:



Furthermore we will also implement a generic initialisation function that starts our recursion.
Ok now the Scala code:

class Alignment[A](
  init: (Int, Int) => Double,
  rank: (Double, Double, Double, A, A) => Double
) {

  def score(x: Array[A], y: Array[A]): Double = {  
    val n  = x.size
    val m  = y.size
    val dp = Array.ofDim[Double](n + 1, m + 1)
    
    for(i <- 0 until n + 1) dp(i)(0) = init(i, 0)
    for(j <- 0 until m + 1) dp(0)(j) = init(0, j)

    for(i <- 1 until n + 1) {
      for(j <- 1 until m + 1) {
        dp(i)(j) = rank(
          dp(i - 1)(j),     // insertion
          dp(i - 1)(j - 1), // match
          dp(i)(j - 1),     // deletion 
          x(i - 1),         // current sample in x 
          y(j - 1)          // current sample in y
        )
      }
    } 

    dp(n)(m)
  }

}



The alignment class has a type parameter specifying the type of the elements in the sequences we want to align and takes the initialising function and the ranking function as input. Now DTW and the Edit distance are just specialisations with a different type parameter and scoring function:

  class DTW extends Alignment[Double](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else Double.PositiveInfinity,
    rank = (insert: Double, matching: Double, delete: Double, x: Double, y: Double) =>
      if(insert < matching && insert < delete) insert + math.pow(x - y, 2)
      else if (delete < insert && delete < matching) delete + math.pow(x - y, 2)
      else matching + math.pow(x - y, 2)
  )


DTW is defined over doubles and is initialising the recursion with 0.0 at position (0,0) and all others to infinity. The ranking adds the euclidian distance to the insertion, deletion and match and then takes the minimum. It is easy to see how this can be generalised to multivariate sequences. The type parameter becomes Seq[Double] and the euclidean distance is implemented for vectors. One can also use kernels instead or the cosine similarity. The next function is the edit distance for discrete sequence elements (here characters):

  class Levensthein extends Alignment[Char](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j else i,
    rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
      if(insert < matching && insert < delete) insert + 1
      else if (delete < insert && delete < matching) delete + 1
      else matching + (if(x == y) 0.0 else 1.0)
  )


The function is initialised as the index at the borders and the function again takes the minimum of insertion, deletion and match. Insertions and deletions count as an error so we add 1. A match can also be a substitution so we add 1 to the error if the characters do not match. The last algorithm is the Needleman-Wunsch algorithm. Instead of minimising the distance we are maximising a score. The similarity between two symbols can be defined and the 'gap' penalty is not 1 as in the edit distance but given.

  final val Sim = Map(
    ('a', 'a') -> 10,
    ('a', 'g') -> -1,
    ('g', 'a') -> -1,
    ('a', 'c') -> -3,
    ('c', 'a') -> -3,
    ('a', 't') -> -4,
    ('t', 'a') -> -4,
    ('g', 'g') ->  7,
    ('g', 'c') -> -5,
    ('c', 'g') -> -5,
    ('g', 't') -> -3,
    ('t', 'g') -> -3,
    ('c', 'c') ->  9,
    ('c', 't') ->  0,
    ('t', 'c') ->  0,
    ('t', 't') ->  8
  )

  final val Gap = -5
  
  class NeedlemanWunsch extends Alignment[Char](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j * Gap else i * Gap,
    rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
      if(insert > matching && insert > delete) insert + Gap
      else if (delete > insert && delete > matching) delete + Gap
      else matching + Sim(x, y)
  )



If one needs the actual alignment, we can backtrack the alignment process in all above cases. The code can be found here [GitHub]

Saturday, April 15, 2017

Three Algorithm Tricks in Data Mining

I really enjoy it when concepts can be reused or applied in several contexts. Three of the algorithms
that seem to be everywhere are quick sort, merge sort and binary search. In this post I will
just show some tricks I learned during coding. Using a similar algorithm than the merge method
of merge sort for fast sparse vector multiplication, using binary search to insert into a histogram in O(log(n)) time, and using the partition method of quick sort during C.45 decision tree training.


Fast Sparse Vector Multiplication

A sparse vector is a vector where most if the entries are zeros. A common data structure is an array saving all non zero entries. Each entry is a tuple of the position and the corresponding value. Also the array is sorted increasingly by the position. Given two such vectors, we can multiply two such vectors very efficiently. We use two variables i and j, indicating a position in one of the vectors' array x and y. The variables move at the same speed as long as the positions of x(i) and y(j) are the same. In this case, we can also aggregate the multiplication of the vector elements. If i points to a position larger than j, we only move j until i and j are equal and vice versa. Since in the previous case, x(i) or y(j) is a zero entry, we don't have to compute anything. The movement of i and j is the same in the merge method of merge sort.



object DotProduct {

  case class SparseNode(pos: Int, value: Double)

  type SVector = Array[SparseNode]  

  def dot(x: SVector, y: SVector): Double = {
    var i = 0
    var j = 0
    var product = 0.0
    while(i < x.length && j < y.length) {
      if(x(i).pos == y(j).pos) {
        product += x(i).value * y(j).value
        i += 1
        j += 1
      } else if(x(i).pos < y(j).pos) i += 1
      else j += 1
    }
    product
  }

}

Histogram Insertion

A histogram can be represented as an array that holds the counts of values that fall into 
a certain region. The range is another array containing the boundaries, which should be sorted
in increasing order. When we want to estimate the histogram, we need to find the region the value falls in. Naively that can be done by iterating the boundary array and checking if the value is in the range. However, we can perform this search using binary search in the boundary array. 
Instead of searching an element though, we are checking if the value is larger than boundary[middle - 1] and is smaller than boundary[middle], if so we can increase the count for the middle, else we recursively search the left or right side, depending if the value is smaller or larger than the middle.


class BinarySearchHistogram(val binBorders: Array[Double]) {

  final val bins = Array.fill(binBorders.length)(0.0)

  def findBin(x: Double, start: Int, stop: Int): Int = {   
    val center = ( (stop - start) / 2 ) + start
    if (center == 0 || center == binBorders.length - 1) center
    else if(x > binBorders(center - 1) && x <= binBorders(center)) center
    else if(x > binBorders(center - 1)) findBin(x, center, stop)
    else findBin(x, start, center)
  }

  def insert(x: Double): Unit = bins(findBin(x, 0, binBorders.length)) += 1.0

}

Decision Tree Learning

A dataset for decision tree learning can be represented as a 2d array X[N][D] of N vectors each of D 
dimensions. At each node in the decision tree, a node partitions the data points into two subsets, one contains all vectors whose dimension D is smaller than a threshold and the other subset contains the vectors where dimension D is larger. We can introduce an ordering array. The left side of the array from a pivot element contains all instances where the split decision is true and the right side contains all elements where the decision is false. We can swap the elements from the left to right whenever
a split decision is violated (the element should be left but is right). We can then call the partition
function recursively on the left and the right side. If we apply the function recursively, we partition the dataset into the split decisions as a decision tree does. Of course, we still have to select the correct
spit decision for every level but this is the material for another post. In the end, this function is very similar to the quick sorts partition method. Another note: The method only works well for binary decision trees, which is not a huge problem in practice.


class DataSet(val ordering: Array[Int], val data: Array[Array[Double]]) {

  def partition(dec: Decision, hi: Int, lo: Int): Int = {
    var i = lo
    var j = lo
    while(j < hi) {
      if( dec.f( data(ordering(j)) ) ) {
        val tmp = ordering(j)
        ordering(j) = ordering(i)
        ordering(i) = tmp
        i += 1
      }
      j += 1
    }
    i
  }

}

case class Decision(attr: Int, th: Double) {

  def f(x: Array[Double]): Boolean = x(attr) <= th

}

Wednesday, March 1, 2017

Auto differentiation: Implementing the book version using Dynamic Programming

Update: My Neural Network Posts

  • Basic Auto Differentiation: Post
  • Basic Auto Differentiation with Weight Sharing: Post
  • Auto Differentiation Code from Book: Post
  • Refactoring The Ideas: Post
In the last two posts I talked about my interpretation of Auto Differentiation and how to implement it. This time, I will show the implementation as given by this Deep Learning book and show that the algorithm does the same as my version. The algorithm from the book saves the computation graph explicitly and independent from the computations while my implementation constructed the graph implicitly. Every node had it's children as a field and calling forward on a child would recursively calculate all values. The parents were saved in the node explicitly, too. Furthermore, the computation for the forward pass and backward pass were implemented here. Also, I saved all result computations in the node. The algorithm given in the book separates all of this. The results so far are saved in a table (as in classic dynamic programming). The computation node is a separate entity holding the forward and backward computation and the graph is coded as children and parent relationships The code to implement this behaviour is given below:
 
import NeuralNetork._

import breeze.linalg._

trait Computation {

  def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double]

  def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double]

}

class NeuralNetork(val children: Graph, val parents: Graph, cmp: Computations) {

  def fwd(requestID: Int, results: ComputationTable): Unit = {
    if(results(requestID) == None) {
      for(p <- parents(requestID)) {
        fwd(p, results)
      }
      results(requestID) = Some(cmp(requestID).fwd(requestID, parents(requestID), results))
    }
  }

  def bwd(requestID: Int, gradients: ComputationTable, inputs: ComputationTable): DenseMatrix[Double] = {
    if (gradients(requestID) != None) gradients(requestID).get
    else {
      val g = for (c <- children(requestID)) yield {
        val gradient = bwd(c, gradients, inputs)
        cmp(c).bwd(requestID, parents(c), gradient, inputs)
      }
      gradients(requestID) = Some(g.reduce(_ + _))
      gradients(requestID).get
    }
  }

}

object NeuralNetork {

  type ComputationTable = Array[Option[DenseMatrix[Double]]]
  type Graph            = Array[Array[Int]]
  type Computations     = Map[Int, Computation]

}
In the end this computation does the same thing as the previous code, it is just another way of writing things down. The Computations can then be implemented in the same way as earlier. For more details check out Chapter 6 of the book. Now we can actually implement our nodes. In this example I am only using Add and Mul
  class Add extends Computation {
  
    def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double] = {
      results(parents(0)).get + results(parents(1)).get
    }

    def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double] = gradient   

  }

  class Mul extends Computation {

     def fwd(requestID: Int, parents: Array[Int], results: ComputationTable): DenseMatrix[Double] = {
      results(parents(0)).get :* results(parents(1)).get
    }

    def bwd(requestID: Int, parents: Array[Int], gradient: DenseMatrix[Double], inputs: ComputationTable): DenseMatrix[Double] = {
      if(parents(0) == requestID) inputs(parents(1)).get :* gradient
      else inputs(parents(0)).get :* gradient
    }

  }
And then we can set up a simple computation:
  • a := -2
  • b := 5
  • c := -4
  • d := 2
  • (a + b) * c
  • (a + b) * d
Which can be specified as:
    val in = Array.fill[Option[DenseMatrix[Double]]](7)(None)
    in(0)  = Some(DenseMatrix.ones[Double](1, 2) * -2.0)
    in(1)  = Some(DenseMatrix.ones[Double](1, 2) *  5.0)
    in(2)  = Some(DenseMatrix.ones[Double](1, 2) * -4.0)
    in(3)  = Some(DenseMatrix.ones[Double](1, 2) *  2.0)

    val cmp = Map(
      4 -> new Add,
      5 -> new Mul,
      6 -> new Mul
    )    

    val children = Array(
      Array(4),
      Array(4),
      Array(5),
      Array(6),
      Array(5, 6),
      Array.empty[Int],
      Array.empty[Int]
    )

    val parents = Array(
      Array.empty[Int],
      Array.empty[Int],
      Array.empty[Int],
      Array.empty[Int],
      Array(0, 1),
      Array(2, 4),
      Array(3, 4)
    )

    val nn = new NeuralNetork(children, parents, cmp)

    val gradients = Array.fill[Option[DenseMatrix[Double]]](7)(None)
    gradients(5)  =  Some(DenseMatrix.ones[Double](1, 2) * 1.0)
    gradients(6)  =  Some(DenseMatrix.ones[Double](1, 2) * 1.0)

    nn.fwd(5, in)
    nn.fwd(6, in)
    println(in(5))
    println(in(6))
    println("---")
    nn.bwd(0, gradients, in)
    nn.bwd(1, gradients, in)
    nn.bwd(2, gradients, in)
    nn.bwd(3, gradients, in)
    gradients.zipWithIndex.foreach(println)
The computation for the forward pass results in:
  • Some(-12.0 -12.0 )
  • Some(6.0 6.0 )
and the gradients for each variable
  • a = (Some(-2.0 -2.0 ),a)
  • b = (Some(-2.0 -2.0 ),b)
  • c = (Some(3.0 3.0 ),c)
  • d = (Some(3.0 3.0 ),d)
I also implemented a feed forward neural network with two layers:
   class Add extends Computation {

     def fwd(
       requestID: Int, parents: Array[Int], results: ComputationTable
     ): DenseMatrix[Double] = {
       results(parents(0)).get + results(parents(1)).get
     }

     def bwd(
       requestID: Int,
       parents: Array[Int],
       gradient: DenseMatrix[Double],
       inputs: ComputationTable
     ): DenseMatrix[Double] = gradient

   }

   class Mul extends Computation {

      def fwd(
        requestID: Int, parents: Array[Int], results: ComputationTable
      ): DenseMatrix[Double] = {
       results(parents(0)).get * results(parents(1)).get
     }

     def bwd(
       requestID: Int,
       parents: Array[Int],
       gradient: DenseMatrix[Double],
       inputs: ComputationTable
     ): DenseMatrix[Double] =
       if(parents(0) == requestID) gradient * inputs(parents(1)).get.t
       else inputs(parents(0)).get.t * gradient

   }

   class Sigmoid extends Computation {

     def fwd(
       requestID: Int, parents: Array[Int], results: ComputationTable
     ): DenseMatrix[Double] = sigmoid(results(parents(0)).get)

     def bwd(
       requestID: Int,
       parents: Array[Int],
       gradient: DenseMatrix[Double],
       inputs: ComputationTable
     ): DenseMatrix[Double] =
       (1.0 - fwd(requestID, parents, inputs)) :* fwd(requestID, parents, inputs) :* gradient
   }

   def twoLayerNet = {
     val update = Array(1, 2, 3, 4)
     val output = 10
     val cmp = Map(
       5 ->  new Mul,
       6 ->  new Add,
       7 ->  new Sigmoid,
       8 ->  new Mul,
       9 ->  new Add,
       10 -> new Sigmoid
     )

     val children = Array(
       Array(5),
       Array(5),
       Array(6),
       Array(8),
       Array(9),
       Array(6),
       Array(7),
       Array(8),
       Array(9),
       Array(10),
       Array.empty[Int]
     )
     val parents = Array(
       Array.empty[Int],
       Array.empty[Int],
       Array.empty[Int],
       Array.empty[Int],
       Array.empty[Int],
       Array(0, 1),
       Array(5, 2),
       Array(6),
       Array(7, 3),
       Array(8, 4),
       Array(9)
     )
     val nn = new NeuralNetork(children, parents, cmp)
     (nn, output, update)
   }

On the spiral dataset and on the MNIST dataset I got around 98% accuracy.


Tuesday, February 28, 2017

Implementation notes on my auto differentiation for weight sharing

Update: My Neural Network Posts

  • Basic Auto Differentiation: Post
  • Basic Auto Differentiation with Weight Sharing: Post
  • Auto Differentiation Code from Book: Post
  • Refactoring The Ideas: Post
In my last post I wrote about building a very crude home made auto differentiation engine. As a side note I mentioned that if we get multiple incoming gradients we have to sum them. And also provided an option to sum the gradients and push them out instantly downstream. However in order to implement proper back propagation we have to wait until all incoming gradients have to be summed up before the gradient is propagated. This is especially important for weight sharing needed in order to implement recurrent neural networks such as LSTMs or more complex models such as the variational auto encoder.

If you are interested in the full code with the examples, I uploaded a Scala version Github 

Implementing the base computation node

The forward pass is the same as in the previous posts. The forward pass
simply computes it's activations using it's child nodes forward computations.
In this case the abstract class already holds the input nodes.
The backward pass follows the same idea too. Based on the gradients
from higher nodes and the local gradient in each node, we push the error
backwards through our computation. The only difference is that I added a wait
list which saves all parent node names. If all parents send their gradients,
we sum all of them and push down the error. Furthermore, I added
a recursive plotting function of the compute graph. The downside is now that we
have to register all parents at each node. During the graph plotting the parent
and wait relations are also checked.


abstract class Node(val name: String, inputs: Vector[Node]) {
  /**
    * Names of parents who have to send a gradient before backwards
    * pass continues
    */
  val WaitFor   = mutable.Set.empty[String]

  /**
    * All parents gradients
    */
  val Gradients = mutable.HashMap.empty[String, DenseMatrix[Double]]

  /**
    * Set value of forwards computation
    */
  protected var Output: Option[DenseMatrix[Double]] = None

  /**
    * Should attach node type to name
    */
  def typename: String

  /**
    * Recursively plot parent child relationships in dot format.
    * Also output error if the waitlist does not include the expected parent
    */
  def graph(parent: String): List[String] = {
    if(!WaitFor.contains(parent) && parent.size > 1) throw new Exception(s"${parent} ${typename}_${name} connection not found")
    if(inputs.size == 0) List.empty[String]
    else {
      (for (in <- -="" in.name="" in.typename="" inputs="" s="" yield=""> ${typename}_${name};" :: in.graph(name))).flatten.toList
    }
  }

  def compute(x: Vector[DenseMatrix[Double]]): DenseMatrix[Double]

  def distribute(gradient: DenseMatrix[Double]): Unit

  def gradient: DenseMatrix[Double] = Gradients.map(_._2).reduce(_ + _)

  def register(parent: String): Unit = WaitFor += parent

  def reset: Unit = {
    Gradients.clear()
    Output = None
    inputs.foreach(_.reset)
  }

  def fwd: DenseMatrix[Double] = Output match {
    case Some(x) => x
    case None => {
      val x = compute(inputs.map(_.fwd))
      Output = Some(x)
      x
    }
  }

  def ready = (WaitFor & Gradients.keySet) == WaitFor

  def bwd(parent: String, gradient: DenseMatrix[Double]): Unit = {
    Gradients += parent -> gradient
    if(ready) distribute(gradient)
  }

}


The rest of the computations such as multiply, add and the dot product stay the same.
Now we are actually able to build models with weight sharing.

A simple LSTM

Now we can implement a simple LSTM node. An LSTM is a recurrent node
that holds it's state through multiple time steps using a cell node. In the LSTM node
this cell can be updated based on the current input, a hidden state and the lasts cell state.
An LSTM node is wired in the following way.



This wiring actually translate into the following code and the graph below.

class LSTM(val in: Node, val ht: Node, val ct: Node, wf: Tensor, bf: Tensor, wi: Tensor, bi: Tensor, wo: Tensor, bo: Tensor, wc: Tensor, bc: Tensor) {

  val x = new Concat(in ,ht)
  val input  = new FeedForwardSigmoid(x, wi, bi)
  val output = new FeedForwardSigmoid(x, wo, bo)
  val forget = new FeedForwardSigmoid(x, wf, bf)
  val cell   = new FeedForwardTanH(x, wc, bc)

  val forgetCell = new Mul(forget.Layer, ct)
  val inCell     = new Mul(input.Layer, cell.Layer)
  val cellState  = new Add(inCell, forgetCell)
  val cellAct    = new Tanh(cellState)
  val hidden     = new Mul(output.Layer, cellAct)

  final val Layer = {
    in.register(x.name)
    ht.register(x.name)

    forget.Layer.register(forgetCell.name)
    ct.register(forgetCell.name)

    input.Layer.register(inCell.name)
    cell.Layer.register(inCell.name)

    inCell.register(cellState.name)
    forgetCell.register(cellState.name)

    cellState.register(cellAct.name)

    output.Layer.register(hidden.name)
    cellAct.register(hidden.name)

    (cellState, hidden)
  }

  def update(rate: Double): Unit = {
      wf := wf.fwd - ((rate * wf.gradient))
      bf := bf.fwd - ((rate * bf.gradient))
      wi := wi.fwd - ((rate * wi.gradient))
      bi := bi.fwd - ((rate * bi.gradient))
      wc := wc.fwd - ((rate * wc.gradient))
      bc := bc.fwd - ((rate * bc.gradient))
      wo := wo.fwd - ((rate * wo.gradient))
      bo := bo.fwd - ((rate * bo.gradient))
  }
}






Then for some given input we can unroll multiple cells and run regular stochastic gradient descent.
Given a list of inputs, representing a time series, we can unroll the LSTM. Notice that we reuse the same weights for each time step. In other words, if we unroll for 5 time steps, every weight will receive 5 gradient updates.


  def unroll(in: List[Node]): (LSTM,List[FeedForwardTanH]) =
    if(in.size == 1) {
      val lstm = new LSTM(in.head, h, c, wf, bf, wi, bi, wo, bo, wc, bc)
      val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
      (lstm, List(classify))
    } else {
      val (unrolled, out) = unroll(in.tail)
      val lstm = new LSTM(in.head, unrolled.Layer._2, unrolled.Layer._1, wf, bf, wi, bi, wo, bo, wc, bc)
      val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
      (lstm, classify :: out)
    }


I trained the network with step size 5 on a trigonometric function and then generated 100 points
from an initial 5 points. Here is the result.







This looks promising I guess. So now we can generate some functions.


Variational Auto Encoders

The next model is a variational auto encoder. Here we do not connect the encoder
directly to the decoder but let the encoder generate a Gaussian distribution that when
sampled is used to generate the input when pushed through the decoder. The sample is
drawn uniformly from a multivariate gaussian with zero mean and variance one.

val enc   = new FeedForwardTanH(in, wenc, benc)
val mean1 = new Inner(enc.Layer, wmean)
val mean  = new Add(mean1, bmean)
val sigm1  = new Inner(enc.Layer, wsig)
val sigm2  = new Add(sigm1, bsig)
val sigm   = new Mul(sigm2, half)
val d1 = new Exp(sigm)
val d2 = new Mul(sample, d1)
val dist = new Add(mean, d2)
val dec  = new FeedForwardSigmoid(dist, wdec, bdec)


Or better to read in the final graph





I ran a very very few iterations on 200 random MNIST images. If we push uniform samples
through the decoder, we get images like these. Which also looks promising.



Most of the results are not amazing. The speed of the tool is not comparable at all to Tensorflow
and in turn I do not process a lot of data. Also, my hidden layers are very small. Furthermore, the only optimization I do is simple stochastic gradient descent. But this post is about differentiation and weight sharing not learning awesome models ... i guess. Again all the stuff is on github. In the next post I will go into developing a more performant variant of this algorithm.

Furthermore the new Deep Learning Book [3] defines the backpropagation algorithm in the general case as a dynamic programming problem filling a table. However, I think my method of directly coding the graph is a little better to explain.

References

[1] Hochreiter, Schmidhuber: "Long Short Term Memory", Journal of Neural Computation, 97
[2] Variational Auto Encoder, https://arxiv.org/abs/1606.05908
[3] Goodfellow, Bengio, Courville: "Deep Learning", MIT Press 2016



Thursday, February 16, 2017

AI: Implement your own Automatic Differentiation Algorithm For Neural Networks


Update: My Neural Network Posts

  • Basic Auto Differentiation: Post
  • Basic Auto Differentiation with Weight Sharing: Post
  • Auto Differentiation Code from Book: Post
  • Refactoring The Ideas: Post

Modern deep learning libraries such as Tensor Flow provide automatic differentiation in order to
compute the gradients in each neural network layer. While the algorithm for back propagation
in feed forward networks is well known, the gradients for new or more interesting architectures
is often hard to understand. This is where automatic differentiation can help. In this post
I will describe the algorithm and provide a simple Scala implementation that might help
to understand how automatic differentiation can be implemented. I also 'cheat' in the sense that
I use a linear algebra library that already interfaces to some native libraries that in turn use Blas and LaPack so my performance is dependent on this wrapper. I guess for playing around this is sufficient.
The code can be found on Github.


Automatic Differentiation

In my opinion the easiest way to understand automatic differentiation is through a computation
graph. In a computation graph variables, operations and functions are represented as nodes in the graph while edges represent the input and output the said functions and operations.

Let us consider the example computation:


  • W * x + b 
  • with W = 2
  • with  x = 3
  • with  b = 1

This results in the following computation tree. If we want to get the value this formula, we can 
request the knowledge for the root node and then evaluate the expression recursively. The result 
can then be achieved through backtracking (here 7). This is equivalent to finding all paths to the target variable and on the way instantiate all the results needed to compute the final value.
We can then propagate derivatives backwards from the variable to the inputs. For example,
if we measure an error of 2 at the top, we can push it down the tree to all children.
Since we apply the chain rule, a derivative of a function:  f(g(x))' = f'(g(x)) * g(x)',
we multiply the gradient flowing from the top, with the local gradient at the current node.
When adding two numbers, the local gradient is 1 so the gradient from the top flows through
uninterrupted to both children. Let the gradient from the top be called delta.
When multiplying to numbers, x and w, the gradient flowing towards x and w is:

  • dx  = delta * w
  • dw = delta * x


So the inputs get swapped and multiplied with the incoming gradient. The following image shows the forward pass in green and the backward pass in red.



Before we go on, if there is a node with two outgoing edges in the forward pass, we sum all in the incoming gradients during back propagation. While this does not happen in feed forward neural networks, it happens when using weight sharing such as in convolutional neural networks or recurrent neural networks.

Now we can turn to matrix computations needed for neural networks. Adding two matrices is actually equivalent to the scalar form from before. Since the addition is an element wise operation, the gradient (now a matrix) simply flows unchanged to the children. In general element wise operations behave like their scalar counterparts.

The gradient of the dot product between two matrices x and y becomes:


  • dx  = delta * w.t
  • dw = x.t * delta

The only thing left to build a neural network is to know the gradient of the activation function.
When using sigmoid units the forward and backward pass for this node given the input x is:

  • fwd: y(x) = 1.0 / (1.0 + exp(-x))
  • bwd: (1.0 - y(x)) * (y(x))
Again this is an element wise operation so we can simply apply the formula to each element in the matrix.

A quick and dirty implementation

In order to implement this behaviour we first implement a general computation node
 
abstract class ComputationNode {

  var valueOpt: Option[DMat]    = None
  var gradientOpt: Option[DMat] = None

  def gradient = gradientOpt.get

  def value = valueOpt.get

  def assignVal(x: DMat): DMat = {
    if(valueOpt == None) {
      valueOpt = Some(x)
    }
    value
  }

  def collectGradient(x: DMat): Unit = {
    if(gradientOpt == None) gradientOpt = Some(x)
    else gradientOpt = Some(x + gradient)
  }

  def reset: Unit

  def fwd: DMat

  def bwd(error: DMat): Unit

}

The node holds a value that is computed during the forward pass and the gradient.
I implemented this so the value does not have to be recomputed.
In other words, the values and gradients are represented as Scala options set to None by default.
When the value needs to be computed first, we initialise the option to its value. When the value is needed again, we do not have to recompute it in that way. For the gradients, when the first value comes in we graph we set the value and all later values are summed.

UPDATE: If you actually want to use weight sharing it is important to have a mechanism 
to only send the backward message when all incoming gradients are received. This is not implemented here. I will discuss weight sharing in a later post.

The forward computation should compute the result up to that node and the backward function
should distribute the gradient. The reset function needs to delete all values (set back to None values)
in order to be able to reuse the graph.

Some Computation Nodes For Tensors

Now we are ready to implement the actual computation nodes. All these nodes
extend the abstract class from above. The nodes we need for a basic neural network are addition of two matrices, the inner product and the sigmoid function since one layer in a feed forward neural network is given by: sigmoid(x*W + b). Using the linear algebra library this is actually quite simple.

During the forward pass in an addition node, we simply add the forward results (2 matrices) so far and pass back the error unchanged.

case class Addition(x: ComputationNode, y: ComputationNode) extends ComputationNode {

  def fwd = assignVal(x.fwd + y.fwd)

  def bwd(error: DMat) = {
    collectGradient(error)
    x.bwd(error)
    y.bwd(error)
  }

  def reset = {
    gradientOpt = None
    valueOpt    = None
    x.reset
    y.reset
  }

}

During the forward pass in an inner product node, we compute the dot product (2 matrices) so far and pass back the error multiplied with the other child's output.

case class InnerProduct(x: ComputationNode, y: ComputationNode) extends ComputationNode {

  def fwd = {
    assignVal(x.fwd * y.fwd)
  }

  def bwd(error: DMat) = {
    collectGradient(error)
    x.bwd(error * y.fwd.t)
    y.bwd(x.fwd.t * error)
  }

  def reset = {
    gradientOpt = None
    valueOpt    = None
    x.reset
    y.reset
  }

}

The sigmoid behaves like expected. We simply pass the error flowing from the top multiplied with the derivative of the function.

case class Sigmoid(x: ComputationNode) extends ComputationNode {

  def fwd = {
    assignVal(sigmoid(x.fwd))
  }

  def bwd(error: DMat) = {
    collectGradient(error)
    val ones = DenseMatrix.ones[Double](value.rows, value.cols)
    val derivative = (ones - value) :* value
    x.bwd(error :* derivative)
  }

  def reset = {
    gradientOpt = None
    valueOpt    = None
    x.reset
  }

}

Last but not least we need a class representing a tensor such as a weight or an input.
During the forward pass we simply return the value of the tensor. And during the backward
pass we sum up all incoming gradients. There is another method called ':=' (using parameter overloading) to set a tensor.

class Tensor extends ComputationNode {

  def fwd = value

  def bwd(error: DMat) = collectGradient(error)

  def :=(x: DMat) = valueOpt = Some(x)

  def reset = {
    gradientOpt = None
  }

}

For example, using this implementation we can build a simple logistic regressor
val x = Tensor
val w = Tensor
val b = Tensor

w := DenseMatrix.rand[Double]( 100, 1)
b := DenseMatrix.zeros[Double](1,  1)
 
val logistic = sigmoid(Addition(InnerProduct(x, w), b))

x := input
val result = logistic.fwd
val error  = - (label - result)

logistic.bwd(error)

w := w.value - 0.01 * w.gradient
b := b.value - 0.01 * b.gradient
In the example, we first initialize some tensors and then define the computation graph. Setting the input we compute the result of the computation and then push the error back through the graph. In the end, we update the weights and biases. This can be seen as one step of stochastic gradient descent.

Implementing a Generic Feed Forward Layer and a small Calculation Domain

In order to have an easier job of defining computations we now implement a small domain specific language in order to create the graph. We use parameter overloading. We now define a wrapper class around computation nodes that enables us to write expresssions such as: sigmoid(Computation(x) * w + b.) The class basically holds the computation so far. If we call the '+' function, we chain the compuation so far with the new compute node using an add node as a parent and return a new computation. In this way we can also implement a wrapper for a feed forward neural network.
 
class Computation(val x: ComputationNode) {

  def +(y: ComputationNode): Computation = new Computation(Addition(x, y))

  def *(y: ComputationNode): Computation = new Computation(InnerProduct(x, y))

}

object Computation {

  def apply(x: ComputationNode): Computation = new Computation(x)

  def sigmoid(x: Computation): Computation = new Computation(Sigmoid(x.x))

}


case class FeedForwardLayerSigmoid(x: ComputationNode, w: Tensor, b: Tensor) extends ComputationNode {

  final val Logit = Computation(x) * w + b

  final val Layer = Computation.sigmoid(Logit).x

  def fwd = Layer.fwd

  def bwd(error: DMat) = Layer.bwd(error)

  def update(rate: Double): Unit = {
    w := w.value - (rate * w.gradient)
    b := b.value - (rate * b.gradient)
    if(x.isInstanceOf[FeedForwardLayerSigmoid]) x.asInstanceOf[FeedForwardLayerSigmoid].update(rate)
  }

  def reset = Layer.reset

}
The feed forward layer is itself a computation node and specifies the computation for the layer. In the forward pass we compute the result of the complete layer and on the backward pass we also propagate the error through the whole layer. I also included an update function that recursively 'learns' the weights using stochastic gradient descent. Now we can learn the network using the following program.


val nn = FeedForwardLayerSigmoid(FeedForwardLayerSigmoid(input, W1, B1), W2, B2)
// repeat for all examples 
input := x
val prediction = nn.fwd
val error      = -(correct - prediction)
nn.bwd(error)
nn.update(0.01)
nn.reset

Experiments

For the first experiment I use the synthetic spiral dataset. It is basically a non linear dataset with three classes.


I use a two layer neural network with the following initialisation

input :=  DenseMatrix.rand[Double](1,   2)
W1    := (DenseMatrix.rand[Double](2, 100) - 0.5) * 0.01
B1    :=         DenseMatrix.zeros(1, 100)
W2    := (DenseMatrix.rand[Double](100, 3) - 0.5) * 0.01
B2    :=         DenseMatrix.zeros(1,   3)

And receive a 95% - 98% accuracy. So this seems to work. Next I tried to classify mnist. Again I used a two layer neural network. The hidden layer is 100 dimensions wide.

x   := DenseMatrix.rand[Double](1,   784)
w1  := (DenseMatrix.rand[Double](784, 100) - 0.5) * 0.001
b1  := DenseMatrix.zeros[Double](1,   100)
w2  := (DenseMatrix.rand[Double](100, 10)  - 0.5) * 0.001
b2  := DenseMatrix.zeros[Double](1,   10)

I got a 97% accuracy. This is far away from the state of the art however, for a simple 2 layer neural network it is not bad. If you remove one layer I'll got an accuracy of about 92% which is the same reported in the tensorflow beginners tutorial. I also visualised the weights of the first layer.



One more thing. This library is really not optimized for speed. The big players such as theano, tensorflow, caffe and dl4j are way faster and more optimized. With this I just hope to clarify the concept.

References
  • [1]Stanford tutorial on unsupervised feature learning: http://ufldl.stanford.edu/wiki/index.php/Neural_Networks
  • [2]Stanford course on automatic differentiation: http://cs231n.github.io/optimization-2/
  • [3]Some notes on auto diff from Wisconsin: http://pages.cs.wisc.edu/~cs701-1/LectureNotes/trunk/cs701-lec-12-1-2015/cs701-lec-12-01-2015.pdf

Sunday, February 12, 2017

AI: Logic Programming in Scala

I decided to write some weakly related posts about the application of AI algorithms. One of my all time favorites is unification + backward chaining. These two algorithms together form the basis of logic programming or declarative programming in Prolog. In this post I will talk about Prolog and actually implement my own version in Scala.


Prolog, Logic Programming and First Order Logic

Atoms, Variables, Rules and Clauses

When writing procedural code, we write down statements and assignments in order to compute a certain value. In contrast, Prolog is a declarative language. We only write down facts and certain rules and the variable assignments are computed using "logical" inference. Then given a certain goal or query , Prolog tries to very that the goal is true. For example, the following definition defines an ancestor relationship:

// Some facts about family 
parent(sam, peter).
parent(sam, joana).
parent(peter, maria).
parent(maria, john).

In Prolog facts are written down as predicates of the form: name(value1, ..., valueN).
Here the parent(sam, peter) predicate states that sam has a parent called peter. The values
"sam" and "peter" are called atoms. Atoms can be strings or integer numbers. Now given
that definition we can start asking Prolog questions. For example:

? parent(sam, peter).

will return true.

? parent(sam, john).

will return false.

Prolog tries to find a matching predicate in the database and returns true if it does and false otherwise.
However, this is not really programming yet. One thing we are missing is variables.
Variables in Prolog start with an uppercase letter (atoms and predicates do not).
With variables we could ask Prolog to return all parents of same:

? parent(sam, Parent).

Will return:

Parent = peter
Parent = joana
true

Since both are valid options for Parent. Now maybe we want to find all ancestors of sam.
Besides facts we can also write rules in Prolog:

ancestor(Person,Ancestor):-
   parent(Person, Ancestor).

ancestor(Person, Ancestor):-
   parent(Person,Parent),
   ancestor(Parent, Ancestor).

Rules in Prolog equate Horn- Clauses in First Order Logic: y <- x AND y AND z.
In other words, y is true if x and y and z are true. In Prolog y is called the head.
The head predicate is seperated by a  ":-" from the body of the clause.
The clause and the body are other predicates separated by commas. If all
statements in the body evaluate true, the clause is true, too.
One detail that will make implementation easier in a second is that facts such as parent are a clause with an empty body. Anyways, in the example, we define the clause: ancestor(P, A) <- parent(P,A).
In other terms A is an ancestor of P if A is a parent of P. This is simply the search we did earlier when searching for parents. The other ancestor clause defines an ancestor recursively:
ancestor(P, A) <- parent(P,Z), ancestor(Z,A). Meaning A is an ancestor of P if there is a person
Z that is a parent of P and has an ancestor(A). Using the facts and the rules, we can ask Prolog again:

? ancestor(sam, john).

Will evaluate true and:

? ancestor(sam, X).

Will return

 X = peter
 X = joana
 X = maria
 X = john

Data Structures in Prolog.

Data structures such as lists, trees can be implemented in Prolog as recursive predicates.
For example a simply binary tree can be defined as:

node(
    leaf(a), 
    node(
       leaf(b),
       leaf(c)
    ) 
 ).

Lists are also represented as binary trees. Every left node is a leaf with the "head"
of the list and the rest (or "tail") is saved recursively in the right child:

node(
  leaf(1),
  node(
    leaf(2),
    node(
       leaf(3),
       nil
    )
).

defines the list: [1,2,3]. Once I introduced how Prolog actually works, I will give some examples. On how to program in Prolog. For now lets start implementing a very basic Prolog.

Coding the logic model

    
sealed trait Term

case class Atom[T](val x: T) extends Term

case class Variable(name: String) extends Term

case class Predicate(functor: String, terms: List[Term]) extends Term {

  def arity: Int = terms.size

  def matches(other: Predicate): Boolean = other.functor == functor && other.arity == arity

}

case class Clause(head: Predicate, body: List[Predicate])

In this scala implementation predicates, atoms and variables are all terms.
Atoms can be of any type (since we need to compare atoms T should implent a proper equals method). Variables are simply names. Predicates have a name and are an ordered collection
of Terms. These can be atoms (such as in facts), variables (as in rules) and predicates again.
The number of terms in a predicate is also called the arity of the predicate. The name of the predicate (or functor) and the arity define a predicate. If both match, the predicate is the same.
As stated earlier a clause has a head and a body. The head is a predicate that we try to prove true
and the body is a list of predicates that have to be true.

Unification, Depth First Search and Backtracking

Unification

In order to implement a Prolog version the first step is to implement how to match terms.
This process is called unification in Prolog.
  
  def unify(x: Term, y: Term, subs: mutable.Map[Variable, Term]): Boolean = (x, y) match {
    case (Atom(x), Atom(y)) => x == y
    case (x: Variable, y) => unifyVar(x, y, subs)
    case (x, y: Variable) => unifyVar(y, x, subs)
    case (x: Predicate, y: Predicate) if x matches y => {
      x.terms.zip(y.terms).map{ case (t1, t2) => unify(t1, t2, subs)}.reduce(_ && _)
    }
    case _ => false
  }


If we unify two atoms x and y, unification succeeds (returns true) if x and y are equal.
For matching predicates unification succeeds if their functor and arity are the same and if
the recursive unification succeeds.

In order to unify variables, we keep a substitution map that assigns variables to terms.
When unifying a term with an already assigned variable, and that term is not equal
to the assigned value unification fails. Otherwise, unification succeeds and the term
is assigned to the variable. This is implemented in the unifyVar function.

 
 def unifyVar(x: Variable, y: Term, subs: mutable.Map[Variable, Term]): Boolean = {
    if(subs.contains(x)) unify(subs(x), y, subs)
    else if(y.isInstanceOf[Variable] && subs.contains(y.asInstanceOf[Variable])) unify(x, subs(y.asInstanceOf[Variable]), subs)
    else {
      subs += (x -> y)
      true
    }
  }

In order to understand unification better consider the following examples:

unification(Atom("sam"), Atom("sam"), Map) == true, the map is empty
unification(Atom("sam"), Atom("peter")) == false, the map is empty

in other words two atoms match if there value is the same. Furthermore,
no substitutions are given since no variables are involved.

unification(
  Predicate(parent, List(Atom(sam), Atom(peter))),
  Predicate(parent, List(Atom(sam), Atom(peter)))
) == true, the map is empty

The above expression shows that two predicates match if their parts matches, too.
Last but not least the expression:

unification(
  Predicate(parent, List(Variable(X), Atom(peter))),
  Predicate(parent, List(Atom(sam), Variable(Y)))
) == true, the map is (X -> sam, Y -> peter)

So the variables are assigned with the matching atoms. Variables can also be bound to other variables.
In this case their values will be the same as in the following map:

X -> Y
Y -> sam

Searching in a knowledge database

In the section above we just saw how we can match terms such as atoms and predicates and
also how to bind variables. In prolog all statements are given in the form of horn clauses.
A prolog program can be seen as a list of clauses. A query to that database is a goal list of predicates.
All goals have to be found true in the database in order to full fill the query.
 
  protected def inferAll(goals: List[Predicate], solution: Substitution): Solutions =
    if(goals.isEmpty) List(solution)
    else {
      val goal   = goals.head
      val other  = goals.tail
      val answer = for (clause <- program(goal.functor)) yield {
        val Clause(standarizedHead, standarizedBody) = standarizer.standarizeAppart(clause)
        val substitution = mutable.Map.empty[Variable, Term] ++ solution
        val unifyable    = unify(goal, standarizedHead, substitution)
        if(unifyable) {
          val newGoals = substitute(standarizedBody ::: other, substitution.toMap)
          val x = inferAll(newGoals.collect{case x: Predicate => x}, substitution.toMap)
          x
        } else Nil
      }
      answer.flatten
    }


The search algorithm that finds all solutions is a depth first search algorithm with backtracking.
If the goal list is empty the program returns the current substitution. In the implementation substitution is a map from variables to assigned terms.

The search algorithm checks the complete database for a match of a clause head with the first goal.
If found, all the variables in the clause head and body are renamed so that there is no collisions
and the clause head is then unified with the current goal. If the unification succeeds, the body is added to the rest of the goal list (all elements except the first) and then all the substitutions that can be applied will be performed.  The search continues recursively with the new goal list. The algorithm
returns a list of assignments to the variables in the query.

The next listing shows the helper functions
 

def substitute(x: List[Term], subs: Substitution): List[Term] = x.map {
  case term: Variable           => subs.get(term).getOrElse(term)
  case Predicate(functor, body) => Predicate(functor, substitute(body, subs))
  case term                     => term
}

  
class VariableStandarization(var idx: Int = 0) {

  def standarizeAppart(clause: Clause): Clause = {
    next
    clause match {
      case Clause(Predicate(functor, body), terms) ==> Clause(
        Predicate(functor, standarize(body)),
        standarize(terms).map(_.asInstanceOf[Predicate])
      )
    }
  }

  def standarize(terms: List[Term]): List[Term] =
    terms.map {
      case term: Atom[_] => term
      case Predicate(functor, body) => Predicate(functor, standarize(body))
      case x: Variable ==> Variable(x.name + s"_${idx}")
    }

  def next: Unit = idx += 1

}

Standarization basically appends an id to the variable name. The id is increased whenever we take take a clause from the database.

The last thing we need is to output all the assignments for the variables in the query from the result list. Since we renamed them multiple times, we have to find a path from the original variable name
to a term that is not a variable.
 
  def assign(predicate: List[Predicate], substitution: Substitution): Substitution = {
    val variables: List[Variable] = predicate.flatMap {
      case Predicate(_, body) => body.flatMap(getAllVariables)
    }.distinct
    variables.map(
      x => x -> findAssignment(x, substitution)
    ).toMap
  }

  def getAllVariables(term: Term): List[Variable] = term match {
    case x: Atom[_]         => List.empty
    case x: Variable        => List(x)
    case Predicate(_, body) => body.flatMap(getAllVariables)
  }

  def findAssignment(x: Term, substitution: Substitution): Term = x match {
    case x: Variable => findAssignment(substitution(x), substitution)
    case Predicate(functor, body) => Predicate(functor, body.map(x => findAssignment(x, substitution)))
    case x => x
  }


Syntactic sugar and an example


First let us add an easy way to handle lists. Especially to create lists in the binary tree form. First we define a special predicate name for all list elements '.'. The same name is used in prolog.
Then we define an atom indicating the end of a list called nil. Given a scala list we can create a prolog list by recursively building the binary tree. We save the next element in the list as the left child
and the right child is the rest of the list. Furthermore, we can create a split of a list. A 'split' splits the list into its head and the rest of the list. In prolog this is written as '[A|B]'. A is the head and B is the tail. The split can actually be matched directly to a list predicate.
 
object ListBuilder {

  final val ListFunctor = "."

  final val End = Atom("nil")

  def split(head: String, tail: String) = Predicate(ListFunctor, List(Variable(head), Variable(tail)))

  def split(head: Term, tail: Term) = Predicate(ListFunctor, List(head, tail))

  def apply(x: List[Term]):Term = binary(x)

  def binary(x: List[Term]): Term =
    if(x.isEmpty) End
    else Predicate(ListFunctor, List(x.head, binary(x.tail)))

  def flatten(x: Predicate): List[Term] = x match {
    case Predicate(_, List(h, End)) => List(h)
    case Predicate(_, List(h, tail)) => h :: flatten(tail.asInstanceOf[Predicate])
   }

}

Now we can write a small prolog program to find out if an element is in the list.
Predicate("member", List(Variable("X"), Predicate(ListFunctor, List(Variable("X"), Variable("T"))))) 
Predicate("member", List(Variable("X"), Predicate(ListFunctor, List(Variable("H"), Variable("T"))))), List(
      Predicate("member", List(Variable("X"), Variable("T")))
)
So we found a member if we can unify the value of variable 'X' with the head of the list. If we can not we search in the tail recursively.

Now lets turn to our example. It is a quicksort in prolog. Since our version of prolog does not support numeric operations we introduce a predicate for greater(X, Y) and smaller(X,Y) meaning X is greater than Y or smaller or equal. The partition function sorts the values into a sublist with all elements smaller than a pivot and another lists with all elements bigger. Append will merge two lists. Quicksort then sorts the values into sublists around a pivot element and then recombines the resulting lists. A quicksort in prolog.


greater(1, 0).
greater(2, 0).
greater(2, 1).
greater(3, 0).
greater(3, 1).
greater(3, 2).
greater(4, 0).
greater(4, 1).
greater(4, 2).
greater(4, 3).

smaller(0, 0).
smaller(0, 1).
smaller(0, 2).
smaller(0, 3).
smaller(0, 4).
smaller(1, 1).
smaller(1, 2).
smaller(1, 3).
smaller(1, 4).
smaller(2, 2).
smaller(2, 3).
smaller(2, 4).
smaller(3, 3).
smaller(3, 4).
smaller(4, 4).

quicksort([X|Xs],Ys):-
    partition(Xs,X,Left,Right),
    quicksort(Left,Ls),
    quicksort(Right,Rs),
    append(Ls,[X|Rs],Ys).
quicksort([],[]).

partition([X|Xs],Y,[X|Ls],Rs):-
    smaller(X,Y),
    partition(Xs,Y,Ls,Rs).
partition([X|Xs],Y,Ls,[X|Rs]):-
    greater(X,Y),
    partition(Xs,Y,Ls,Rs).
partition([],Y,[],[]).

append([],Ys,Ys).
append([X|Xs],Ys,[X|Zs]):-
    append(Xs,Ys,Zs).


I also put a "full" implementation into github. Enjoy!
So if you like the full code click there [github].
As always I also attached some books to read :)

References

[1] Ivan Bratko: "Prolog Programming for Artificial Intelligence", 4th ed., Chapter 2, Adison Wesley, 2011

[2]  Stuart Russel, Peter Norvig: "Artificial Intelligence a Modern Approach", Pearson, 2011