Tuesday, February 28, 2017

Implementation notes on my auto differentiation for weight sharing

Update: My Neural Network Posts

  • Basic Auto Differentiation: Post
  • Basic Auto Differentiation with Weight Sharing: Post
  • Auto Differentiation Code from Book: Post
  • Refactoring The Ideas: Post
In my last post I wrote about building a very crude home made auto differentiation engine. As a side note I mentioned that if we get multiple incoming gradients we have to sum them. And also provided an option to sum the gradients and push them out instantly downstream. However in order to implement proper back propagation we have to wait until all incoming gradients have to be summed up before the gradient is propagated. This is especially important for weight sharing needed in order to implement recurrent neural networks such as LSTMs or more complex models such as the variational auto encoder.

If you are interested in the full code with the examples, I uploaded a Scala version Github 

Implementing the base computation node

The forward pass is the same as in the previous posts. The forward pass
simply computes it's activations using it's child nodes forward computations.
In this case the abstract class already holds the input nodes.
The backward pass follows the same idea too. Based on the gradients
from higher nodes and the local gradient in each node, we push the error
backwards through our computation. The only difference is that I added a wait
list which saves all parent node names. If all parents send their gradients,
we sum all of them and push down the error. Furthermore, I added
a recursive plotting function of the compute graph. The downside is now that we
have to register all parents at each node. During the graph plotting the parent
and wait relations are also checked.


abstract class Node(val name: String, inputs: Vector[Node]) {
  /**
    * Names of parents who have to send a gradient before backwards
    * pass continues
    */
  val WaitFor   = mutable.Set.empty[String]

  /**
    * All parents gradients
    */
  val Gradients = mutable.HashMap.empty[String, DenseMatrix[Double]]

  /**
    * Set value of forwards computation
    */
  protected var Output: Option[DenseMatrix[Double]] = None

  /**
    * Should attach node type to name
    */
  def typename: String

  /**
    * Recursively plot parent child relationships in dot format.
    * Also output error if the waitlist does not include the expected parent
    */
  def graph(parent: String): List[String] = {
    if(!WaitFor.contains(parent) && parent.size > 1) throw new Exception(s"${parent} ${typename}_${name} connection not found")
    if(inputs.size == 0) List.empty[String]
    else {
      (for (in <- -="" in.name="" in.typename="" inputs="" s="" yield=""> ${typename}_${name};" :: in.graph(name))).flatten.toList
    }
  }

  def compute(x: Vector[DenseMatrix[Double]]): DenseMatrix[Double]

  def distribute(gradient: DenseMatrix[Double]): Unit

  def gradient: DenseMatrix[Double] = Gradients.map(_._2).reduce(_ + _)

  def register(parent: String): Unit = WaitFor += parent

  def reset: Unit = {
    Gradients.clear()
    Output = None
    inputs.foreach(_.reset)
  }

  def fwd: DenseMatrix[Double] = Output match {
    case Some(x) => x
    case None => {
      val x = compute(inputs.map(_.fwd))
      Output = Some(x)
      x
    }
  }

  def ready = (WaitFor & Gradients.keySet) == WaitFor

  def bwd(parent: String, gradient: DenseMatrix[Double]): Unit = {
    Gradients += parent -> gradient
    if(ready) distribute(gradient)
  }

}


The rest of the computations such as multiply, add and the dot product stay the same.
Now we are actually able to build models with weight sharing.

A simple LSTM

Now we can implement a simple LSTM node. An LSTM is a recurrent node
that holds it's state through multiple time steps using a cell node. In the LSTM node
this cell can be updated based on the current input, a hidden state and the lasts cell state.
An LSTM node is wired in the following way.



This wiring actually translate into the following code and the graph below.

class LSTM(val in: Node, val ht: Node, val ct: Node, wf: Tensor, bf: Tensor, wi: Tensor, bi: Tensor, wo: Tensor, bo: Tensor, wc: Tensor, bc: Tensor) {

  val x = new Concat(in ,ht)
  val input  = new FeedForwardSigmoid(x, wi, bi)
  val output = new FeedForwardSigmoid(x, wo, bo)
  val forget = new FeedForwardSigmoid(x, wf, bf)
  val cell   = new FeedForwardTanH(x, wc, bc)

  val forgetCell = new Mul(forget.Layer, ct)
  val inCell     = new Mul(input.Layer, cell.Layer)
  val cellState  = new Add(inCell, forgetCell)
  val cellAct    = new Tanh(cellState)
  val hidden     = new Mul(output.Layer, cellAct)

  final val Layer = {
    in.register(x.name)
    ht.register(x.name)

    forget.Layer.register(forgetCell.name)
    ct.register(forgetCell.name)

    input.Layer.register(inCell.name)
    cell.Layer.register(inCell.name)

    inCell.register(cellState.name)
    forgetCell.register(cellState.name)

    cellState.register(cellAct.name)

    output.Layer.register(hidden.name)
    cellAct.register(hidden.name)

    (cellState, hidden)
  }

  def update(rate: Double): Unit = {
      wf := wf.fwd - ((rate * wf.gradient))
      bf := bf.fwd - ((rate * bf.gradient))
      wi := wi.fwd - ((rate * wi.gradient))
      bi := bi.fwd - ((rate * bi.gradient))
      wc := wc.fwd - ((rate * wc.gradient))
      bc := bc.fwd - ((rate * bc.gradient))
      wo := wo.fwd - ((rate * wo.gradient))
      bo := bo.fwd - ((rate * bo.gradient))
  }
}






Then for some given input we can unroll multiple cells and run regular stochastic gradient descent.
Given a list of inputs, representing a time series, we can unroll the LSTM. Notice that we reuse the same weights for each time step. In other words, if we unroll for 5 time steps, every weight will receive 5 gradient updates.


  def unroll(in: List[Node]): (LSTM,List[FeedForwardTanH]) =
    if(in.size == 1) {
      val lstm = new LSTM(in.head, h, c, wf, bf, wi, bi, wo, bo, wc, bc)
      val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
      (lstm, List(classify))
    } else {
      val (unrolled, out) = unroll(in.tail)
      val lstm = new LSTM(in.head, unrolled.Layer._2, unrolled.Layer._1, wf, bf, wi, bi, wo, bo, wc, bc)
      val classify  = new FeedForwardTanH(lstm.Layer._2, wl, bl)
      (lstm, classify :: out)
    }


I trained the network with step size 5 on a trigonometric function and then generated 100 points
from an initial 5 points. Here is the result.







This looks promising I guess. So now we can generate some functions.


Variational Auto Encoders

The next model is a variational auto encoder. Here we do not connect the encoder
directly to the decoder but let the encoder generate a Gaussian distribution that when
sampled is used to generate the input when pushed through the decoder. The sample is
drawn uniformly from a multivariate gaussian with zero mean and variance one.

val enc   = new FeedForwardTanH(in, wenc, benc)
val mean1 = new Inner(enc.Layer, wmean)
val mean  = new Add(mean1, bmean)
val sigm1  = new Inner(enc.Layer, wsig)
val sigm2  = new Add(sigm1, bsig)
val sigm   = new Mul(sigm2, half)
val d1 = new Exp(sigm)
val d2 = new Mul(sample, d1)
val dist = new Add(mean, d2)
val dec  = new FeedForwardSigmoid(dist, wdec, bdec)


Or better to read in the final graph





I ran a very very few iterations on 200 random MNIST images. If we push uniform samples
through the decoder, we get images like these. Which also looks promising.



Most of the results are not amazing. The speed of the tool is not comparable at all to Tensorflow
and in turn I do not process a lot of data. Also, my hidden layers are very small. Furthermore, the only optimization I do is simple stochastic gradient descent. But this post is about differentiation and weight sharing not learning awesome models ... i guess. Again all the stuff is on github. In the next post I will go into developing a more performant variant of this algorithm.

Furthermore the new Deep Learning Book [3] defines the backpropagation algorithm in the general case as a dynamic programming problem filling a table. However, I think my method of directly coding the graph is a little better to explain.

References

[1] Hochreiter, Schmidhuber: "Long Short Term Memory", Journal of Neural Computation, 97
[2] Variational Auto Encoder, https://arxiv.org/abs/1606.05908
[3] Goodfellow, Bengio, Courville: "Deep Learning", MIT Press 2016



Thursday, February 16, 2017

AI: Implement your own Automatic Differentiation Algorithm For Neural Networks


Update: My Neural Network Posts

  • Basic Auto Differentiation: Post
  • Basic Auto Differentiation with Weight Sharing: Post
  • Auto Differentiation Code from Book: Post
  • Refactoring The Ideas: Post

Modern deep learning libraries such as Tensor Flow provide automatic differentiation in order to
compute the gradients in each neural network layer. While the algorithm for back propagation
in feed forward networks is well known, the gradients for new or more interesting architectures
is often hard to understand. This is where automatic differentiation can help. In this post
I will describe the algorithm and provide a simple Scala implementation that might help
to understand how automatic differentiation can be implemented. I also 'cheat' in the sense that
I use a linear algebra library that already interfaces to some native libraries that in turn use Blas and LaPack so my performance is dependent on this wrapper. I guess for playing around this is sufficient.
The code can be found on Github.


Automatic Differentiation

In my opinion the easiest way to understand automatic differentiation is through a computation
graph. In a computation graph variables, operations and functions are represented as nodes in the graph while edges represent the input and output the said functions and operations.

Let us consider the example computation:


  • W * x + b 
  • with W = 2
  • with  x = 3
  • with  b = 1

This results in the following computation tree. If we want to get the value this formula, we can 
request the knowledge for the root node and then evaluate the expression recursively. The result 
can then be achieved through backtracking (here 7). This is equivalent to finding all paths to the target variable and on the way instantiate all the results needed to compute the final value.
We can then propagate derivatives backwards from the variable to the inputs. For example,
if we measure an error of 2 at the top, we can push it down the tree to all children.
Since we apply the chain rule, a derivative of a function:  f(g(x))' = f'(g(x)) * g(x)',
we multiply the gradient flowing from the top, with the local gradient at the current node.
When adding two numbers, the local gradient is 1 so the gradient from the top flows through
uninterrupted to both children. Let the gradient from the top be called delta.
When multiplying to numbers, x and w, the gradient flowing towards x and w is:

  • dx  = delta * w
  • dw = delta * x


So the inputs get swapped and multiplied with the incoming gradient. The following image shows the forward pass in green and the backward pass in red.



Before we go on, if there is a node with two outgoing edges in the forward pass, we sum all in the incoming gradients during back propagation. While this does not happen in feed forward neural networks, it happens when using weight sharing such as in convolutional neural networks or recurrent neural networks.

Now we can turn to matrix computations needed for neural networks. Adding two matrices is actually equivalent to the scalar form from before. Since the addition is an element wise operation, the gradient (now a matrix) simply flows unchanged to the children. In general element wise operations behave like their scalar counterparts.

The gradient of the dot product between two matrices x and y becomes:


  • dx  = delta * w.t
  • dw = x.t * delta

The only thing left to build a neural network is to know the gradient of the activation function.
When using sigmoid units the forward and backward pass for this node given the input x is:

  • fwd: y(x) = 1.0 / (1.0 + exp(-x))
  • bwd: (1.0 - y(x)) * (y(x))
Again this is an element wise operation so we can simply apply the formula to each element in the matrix.

A quick and dirty implementation

In order to implement this behaviour we first implement a general computation node
 
abstract class ComputationNode {

  var valueOpt: Option[DMat]    = None
  var gradientOpt: Option[DMat] = None

  def gradient = gradientOpt.get

  def value = valueOpt.get

  def assignVal(x: DMat): DMat = {
    if(valueOpt == None) {
      valueOpt = Some(x)
    }
    value
  }

  def collectGradient(x: DMat): Unit = {
    if(gradientOpt == None) gradientOpt = Some(x)
    else gradientOpt = Some(x + gradient)
  }

  def reset: Unit

  def fwd: DMat

  def bwd(error: DMat): Unit

}

The node holds a value that is computed during the forward pass and the gradient.
I implemented this so the value does not have to be recomputed.
In other words, the values and gradients are represented as Scala options set to None by default.
When the value needs to be computed first, we initialise the option to its value. When the value is needed again, we do not have to recompute it in that way. For the gradients, when the first value comes in we graph we set the value and all later values are summed.

UPDATE: If you actually want to use weight sharing it is important to have a mechanism 
to only send the backward message when all incoming gradients are received. This is not implemented here. I will discuss weight sharing in a later post.

The forward computation should compute the result up to that node and the backward function
should distribute the gradient. The reset function needs to delete all values (set back to None values)
in order to be able to reuse the graph.

Some Computation Nodes For Tensors

Now we are ready to implement the actual computation nodes. All these nodes
extend the abstract class from above. The nodes we need for a basic neural network are addition of two matrices, the inner product and the sigmoid function since one layer in a feed forward neural network is given by: sigmoid(x*W + b). Using the linear algebra library this is actually quite simple.

During the forward pass in an addition node, we simply add the forward results (2 matrices) so far and pass back the error unchanged.

case class Addition(x: ComputationNode, y: ComputationNode) extends ComputationNode {

  def fwd = assignVal(x.fwd + y.fwd)

  def bwd(error: DMat) = {
    collectGradient(error)
    x.bwd(error)
    y.bwd(error)
  }

  def reset = {
    gradientOpt = None
    valueOpt    = None
    x.reset
    y.reset
  }

}

During the forward pass in an inner product node, we compute the dot product (2 matrices) so far and pass back the error multiplied with the other child's output.

case class InnerProduct(x: ComputationNode, y: ComputationNode) extends ComputationNode {

  def fwd = {
    assignVal(x.fwd * y.fwd)
  }

  def bwd(error: DMat) = {
    collectGradient(error)
    x.bwd(error * y.fwd.t)
    y.bwd(x.fwd.t * error)
  }

  def reset = {
    gradientOpt = None
    valueOpt    = None
    x.reset
    y.reset
  }

}

The sigmoid behaves like expected. We simply pass the error flowing from the top multiplied with the derivative of the function.

case class Sigmoid(x: ComputationNode) extends ComputationNode {

  def fwd = {
    assignVal(sigmoid(x.fwd))
  }

  def bwd(error: DMat) = {
    collectGradient(error)
    val ones = DenseMatrix.ones[Double](value.rows, value.cols)
    val derivative = (ones - value) :* value
    x.bwd(error :* derivative)
  }

  def reset = {
    gradientOpt = None
    valueOpt    = None
    x.reset
  }

}

Last but not least we need a class representing a tensor such as a weight or an input.
During the forward pass we simply return the value of the tensor. And during the backward
pass we sum up all incoming gradients. There is another method called ':=' (using parameter overloading) to set a tensor.

class Tensor extends ComputationNode {

  def fwd = value

  def bwd(error: DMat) = collectGradient(error)

  def :=(x: DMat) = valueOpt = Some(x)

  def reset = {
    gradientOpt = None
  }

}

For example, using this implementation we can build a simple logistic regressor
val x = Tensor
val w = Tensor
val b = Tensor

w := DenseMatrix.rand[Double]( 100, 1)
b := DenseMatrix.zeros[Double](1,  1)
 
val logistic = sigmoid(Addition(InnerProduct(x, w), b))

x := input
val result = logistic.fwd
val error  = - (label - result)

logistic.bwd(error)

w := w.value - 0.01 * w.gradient
b := b.value - 0.01 * b.gradient
In the example, we first initialize some tensors and then define the computation graph. Setting the input we compute the result of the computation and then push the error back through the graph. In the end, we update the weights and biases. This can be seen as one step of stochastic gradient descent.

Implementing a Generic Feed Forward Layer and a small Calculation Domain

In order to have an easier job of defining computations we now implement a small domain specific language in order to create the graph. We use parameter overloading. We now define a wrapper class around computation nodes that enables us to write expresssions such as: sigmoid(Computation(x) * w + b.) The class basically holds the computation so far. If we call the '+' function, we chain the compuation so far with the new compute node using an add node as a parent and return a new computation. In this way we can also implement a wrapper for a feed forward neural network.
 
class Computation(val x: ComputationNode) {

  def +(y: ComputationNode): Computation = new Computation(Addition(x, y))

  def *(y: ComputationNode): Computation = new Computation(InnerProduct(x, y))

}

object Computation {

  def apply(x: ComputationNode): Computation = new Computation(x)

  def sigmoid(x: Computation): Computation = new Computation(Sigmoid(x.x))

}


case class FeedForwardLayerSigmoid(x: ComputationNode, w: Tensor, b: Tensor) extends ComputationNode {

  final val Logit = Computation(x) * w + b

  final val Layer = Computation.sigmoid(Logit).x

  def fwd = Layer.fwd

  def bwd(error: DMat) = Layer.bwd(error)

  def update(rate: Double): Unit = {
    w := w.value - (rate * w.gradient)
    b := b.value - (rate * b.gradient)
    if(x.isInstanceOf[FeedForwardLayerSigmoid]) x.asInstanceOf[FeedForwardLayerSigmoid].update(rate)
  }

  def reset = Layer.reset

}
The feed forward layer is itself a computation node and specifies the computation for the layer. In the forward pass we compute the result of the complete layer and on the backward pass we also propagate the error through the whole layer. I also included an update function that recursively 'learns' the weights using stochastic gradient descent. Now we can learn the network using the following program.


val nn = FeedForwardLayerSigmoid(FeedForwardLayerSigmoid(input, W1, B1), W2, B2)
// repeat for all examples 
input := x
val prediction = nn.fwd
val error      = -(correct - prediction)
nn.bwd(error)
nn.update(0.01)
nn.reset

Experiments

For the first experiment I use the synthetic spiral dataset. It is basically a non linear dataset with three classes.


I use a two layer neural network with the following initialisation

input :=  DenseMatrix.rand[Double](1,   2)
W1    := (DenseMatrix.rand[Double](2, 100) - 0.5) * 0.01
B1    :=         DenseMatrix.zeros(1, 100)
W2    := (DenseMatrix.rand[Double](100, 3) - 0.5) * 0.01
B2    :=         DenseMatrix.zeros(1,   3)

And receive a 95% - 98% accuracy. So this seems to work. Next I tried to classify mnist. Again I used a two layer neural network. The hidden layer is 100 dimensions wide.

x   := DenseMatrix.rand[Double](1,   784)
w1  := (DenseMatrix.rand[Double](784, 100) - 0.5) * 0.001
b1  := DenseMatrix.zeros[Double](1,   100)
w2  := (DenseMatrix.rand[Double](100, 10)  - 0.5) * 0.001
b2  := DenseMatrix.zeros[Double](1,   10)

I got a 97% accuracy. This is far away from the state of the art however, for a simple 2 layer neural network it is not bad. If you remove one layer I'll got an accuracy of about 92% which is the same reported in the tensorflow beginners tutorial. I also visualised the weights of the first layer.



One more thing. This library is really not optimized for speed. The big players such as theano, tensorflow, caffe and dl4j are way faster and more optimized. With this I just hope to clarify the concept.

References
  • [1]Stanford tutorial on unsupervised feature learning: http://ufldl.stanford.edu/wiki/index.php/Neural_Networks
  • [2]Stanford course on automatic differentiation: http://cs231n.github.io/optimization-2/
  • [3]Some notes on auto diff from Wisconsin: http://pages.cs.wisc.edu/~cs701-1/LectureNotes/trunk/cs701-lec-12-1-2015/cs701-lec-12-01-2015.pdf

Sunday, February 12, 2017

AI: Logic Programming in Scala

I decided to write some weakly related posts about the application of AI algorithms. One of my all time favorites is unification + backward chaining. These two algorithms together form the basis of logic programming or declarative programming in Prolog. In this post I will talk about Prolog and actually implement my own version in Scala.


Prolog, Logic Programming and First Order Logic

Atoms, Variables, Rules and Clauses

When writing procedural code, we write down statements and assignments in order to compute a certain value. In contrast, Prolog is a declarative language. We only write down facts and certain rules and the variable assignments are computed using "logical" inference. Then given a certain goal or query , Prolog tries to very that the goal is true. For example, the following definition defines an ancestor relationship:

// Some facts about family 
parent(sam, peter).
parent(sam, joana).
parent(peter, maria).
parent(maria, john).

In Prolog facts are written down as predicates of the form: name(value1, ..., valueN).
Here the parent(sam, peter) predicate states that sam has a parent called peter. The values
"sam" and "peter" are called atoms. Atoms can be strings or integer numbers. Now given
that definition we can start asking Prolog questions. For example:

? parent(sam, peter).

will return true.

? parent(sam, john).

will return false.

Prolog tries to find a matching predicate in the database and returns true if it does and false otherwise.
However, this is not really programming yet. One thing we are missing is variables.
Variables in Prolog start with an uppercase letter (atoms and predicates do not).
With variables we could ask Prolog to return all parents of same:

? parent(sam, Parent).

Will return:

Parent = peter
Parent = joana
true

Since both are valid options for Parent. Now maybe we want to find all ancestors of sam.
Besides facts we can also write rules in Prolog:

ancestor(Person,Ancestor):-
   parent(Person, Ancestor).

ancestor(Person, Ancestor):-
   parent(Person,Parent),
   ancestor(Parent, Ancestor).

Rules in Prolog equate Horn- Clauses in First Order Logic: y <- x AND y AND z.
In other words, y is true if x and y and z are true. In Prolog y is called the head.
The head predicate is seperated by a  ":-" from the body of the clause.
The clause and the body are other predicates separated by commas. If all
statements in the body evaluate true, the clause is true, too.
One detail that will make implementation easier in a second is that facts such as parent are a clause with an empty body. Anyways, in the example, we define the clause: ancestor(P, A) <- parent(P,A).
In other terms A is an ancestor of P if A is a parent of P. This is simply the search we did earlier when searching for parents. The other ancestor clause defines an ancestor recursively:
ancestor(P, A) <- parent(P,Z), ancestor(Z,A). Meaning A is an ancestor of P if there is a person
Z that is a parent of P and has an ancestor(A). Using the facts and the rules, we can ask Prolog again:

? ancestor(sam, john).

Will evaluate true and:

? ancestor(sam, X).

Will return

 X = peter
 X = joana
 X = maria
 X = john

Data Structures in Prolog.

Data structures such as lists, trees can be implemented in Prolog as recursive predicates.
For example a simply binary tree can be defined as:

node(
    leaf(a), 
    node(
       leaf(b),
       leaf(c)
    ) 
 ).

Lists are also represented as binary trees. Every left node is a leaf with the "head"
of the list and the rest (or "tail") is saved recursively in the right child:

node(
  leaf(1),
  node(
    leaf(2),
    node(
       leaf(3),
       nil
    )
).

defines the list: [1,2,3]. Once I introduced how Prolog actually works, I will give some examples. On how to program in Prolog. For now lets start implementing a very basic Prolog.

Coding the logic model

    
sealed trait Term

case class Atom[T](val x: T) extends Term

case class Variable(name: String) extends Term

case class Predicate(functor: String, terms: List[Term]) extends Term {

  def arity: Int = terms.size

  def matches(other: Predicate): Boolean = other.functor == functor && other.arity == arity

}

case class Clause(head: Predicate, body: List[Predicate])

In this scala implementation predicates, atoms and variables are all terms.
Atoms can be of any type (since we need to compare atoms T should implent a proper equals method). Variables are simply names. Predicates have a name and are an ordered collection
of Terms. These can be atoms (such as in facts), variables (as in rules) and predicates again.
The number of terms in a predicate is also called the arity of the predicate. The name of the predicate (or functor) and the arity define a predicate. If both match, the predicate is the same.
As stated earlier a clause has a head and a body. The head is a predicate that we try to prove true
and the body is a list of predicates that have to be true.

Unification, Depth First Search and Backtracking

Unification

In order to implement a Prolog version the first step is to implement how to match terms.
This process is called unification in Prolog.
  
  def unify(x: Term, y: Term, subs: mutable.Map[Variable, Term]): Boolean = (x, y) match {
    case (Atom(x), Atom(y)) => x == y
    case (x: Variable, y) => unifyVar(x, y, subs)
    case (x, y: Variable) => unifyVar(y, x, subs)
    case (x: Predicate, y: Predicate) if x matches y => {
      x.terms.zip(y.terms).map{ case (t1, t2) => unify(t1, t2, subs)}.reduce(_ && _)
    }
    case _ => false
  }


If we unify two atoms x and y, unification succeeds (returns true) if x and y are equal.
For matching predicates unification succeeds if their functor and arity are the same and if
the recursive unification succeeds.

In order to unify variables, we keep a substitution map that assigns variables to terms.
When unifying a term with an already assigned variable, and that term is not equal
to the assigned value unification fails. Otherwise, unification succeeds and the term
is assigned to the variable. This is implemented in the unifyVar function.

 
 def unifyVar(x: Variable, y: Term, subs: mutable.Map[Variable, Term]): Boolean = {
    if(subs.contains(x)) unify(subs(x), y, subs)
    else if(y.isInstanceOf[Variable] && subs.contains(y.asInstanceOf[Variable])) unify(x, subs(y.asInstanceOf[Variable]), subs)
    else {
      subs += (x -> y)
      true
    }
  }

In order to understand unification better consider the following examples:

unification(Atom("sam"), Atom("sam"), Map) == true, the map is empty
unification(Atom("sam"), Atom("peter")) == false, the map is empty

in other words two atoms match if there value is the same. Furthermore,
no substitutions are given since no variables are involved.

unification(
  Predicate(parent, List(Atom(sam), Atom(peter))),
  Predicate(parent, List(Atom(sam), Atom(peter)))
) == true, the map is empty

The above expression shows that two predicates match if their parts matches, too.
Last but not least the expression:

unification(
  Predicate(parent, List(Variable(X), Atom(peter))),
  Predicate(parent, List(Atom(sam), Variable(Y)))
) == true, the map is (X -> sam, Y -> peter)

So the variables are assigned with the matching atoms. Variables can also be bound to other variables.
In this case their values will be the same as in the following map:

X -> Y
Y -> sam

Searching in a knowledge database

In the section above we just saw how we can match terms such as atoms and predicates and
also how to bind variables. In prolog all statements are given in the form of horn clauses.
A prolog program can be seen as a list of clauses. A query to that database is a goal list of predicates.
All goals have to be found true in the database in order to full fill the query.
 
  protected def inferAll(goals: List[Predicate], solution: Substitution): Solutions =
    if(goals.isEmpty) List(solution)
    else {
      val goal   = goals.head
      val other  = goals.tail
      val answer = for (clause <- program(goal.functor)) yield {
        val Clause(standarizedHead, standarizedBody) = standarizer.standarizeAppart(clause)
        val substitution = mutable.Map.empty[Variable, Term] ++ solution
        val unifyable    = unify(goal, standarizedHead, substitution)
        if(unifyable) {
          val newGoals = substitute(standarizedBody ::: other, substitution.toMap)
          val x = inferAll(newGoals.collect{case x: Predicate => x}, substitution.toMap)
          x
        } else Nil
      }
      answer.flatten
    }


The search algorithm that finds all solutions is a depth first search algorithm with backtracking.
If the goal list is empty the program returns the current substitution. In the implementation substitution is a map from variables to assigned terms.

The search algorithm checks the complete database for a match of a clause head with the first goal.
If found, all the variables in the clause head and body are renamed so that there is no collisions
and the clause head is then unified with the current goal. If the unification succeeds, the body is added to the rest of the goal list (all elements except the first) and then all the substitutions that can be applied will be performed.  The search continues recursively with the new goal list. The algorithm
returns a list of assignments to the variables in the query.

The next listing shows the helper functions
 

def substitute(x: List[Term], subs: Substitution): List[Term] = x.map {
  case term: Variable           => subs.get(term).getOrElse(term)
  case Predicate(functor, body) => Predicate(functor, substitute(body, subs))
  case term                     => term
}

  
class VariableStandarization(var idx: Int = 0) {

  def standarizeAppart(clause: Clause): Clause = {
    next
    clause match {
      case Clause(Predicate(functor, body), terms) ==> Clause(
        Predicate(functor, standarize(body)),
        standarize(terms).map(_.asInstanceOf[Predicate])
      )
    }
  }

  def standarize(terms: List[Term]): List[Term] =
    terms.map {
      case term: Atom[_] => term
      case Predicate(functor, body) => Predicate(functor, standarize(body))
      case x: Variable ==> Variable(x.name + s"_${idx}")
    }

  def next: Unit = idx += 1

}

Standarization basically appends an id to the variable name. The id is increased whenever we take take a clause from the database.

The last thing we need is to output all the assignments for the variables in the query from the result list. Since we renamed them multiple times, we have to find a path from the original variable name
to a term that is not a variable.
 
  def assign(predicate: List[Predicate], substitution: Substitution): Substitution = {
    val variables: List[Variable] = predicate.flatMap {
      case Predicate(_, body) => body.flatMap(getAllVariables)
    }.distinct
    variables.map(
      x => x -> findAssignment(x, substitution)
    ).toMap
  }

  def getAllVariables(term: Term): List[Variable] = term match {
    case x: Atom[_]         => List.empty
    case x: Variable        => List(x)
    case Predicate(_, body) => body.flatMap(getAllVariables)
  }

  def findAssignment(x: Term, substitution: Substitution): Term = x match {
    case x: Variable => findAssignment(substitution(x), substitution)
    case Predicate(functor, body) => Predicate(functor, body.map(x => findAssignment(x, substitution)))
    case x => x
  }


Syntactic sugar and an example


First let us add an easy way to handle lists. Especially to create lists in the binary tree form. First we define a special predicate name for all list elements '.'. The same name is used in prolog.
Then we define an atom indicating the end of a list called nil. Given a scala list we can create a prolog list by recursively building the binary tree. We save the next element in the list as the left child
and the right child is the rest of the list. Furthermore, we can create a split of a list. A 'split' splits the list into its head and the rest of the list. In prolog this is written as '[A|B]'. A is the head and B is the tail. The split can actually be matched directly to a list predicate.
 
object ListBuilder {

  final val ListFunctor = "."

  final val End = Atom("nil")

  def split(head: String, tail: String) = Predicate(ListFunctor, List(Variable(head), Variable(tail)))

  def split(head: Term, tail: Term) = Predicate(ListFunctor, List(head, tail))

  def apply(x: List[Term]):Term = binary(x)

  def binary(x: List[Term]): Term =
    if(x.isEmpty) End
    else Predicate(ListFunctor, List(x.head, binary(x.tail)))

  def flatten(x: Predicate): List[Term] = x match {
    case Predicate(_, List(h, End)) => List(h)
    case Predicate(_, List(h, tail)) => h :: flatten(tail.asInstanceOf[Predicate])
   }

}

Now we can write a small prolog program to find out if an element is in the list.
Predicate("member", List(Variable("X"), Predicate(ListFunctor, List(Variable("X"), Variable("T"))))) 
Predicate("member", List(Variable("X"), Predicate(ListFunctor, List(Variable("H"), Variable("T"))))), List(
      Predicate("member", List(Variable("X"), Variable("T")))
)
So we found a member if we can unify the value of variable 'X' with the head of the list. If we can not we search in the tail recursively.

Now lets turn to our example. It is a quicksort in prolog. Since our version of prolog does not support numeric operations we introduce a predicate for greater(X, Y) and smaller(X,Y) meaning X is greater than Y or smaller or equal. The partition function sorts the values into a sublist with all elements smaller than a pivot and another lists with all elements bigger. Append will merge two lists. Quicksort then sorts the values into sublists around a pivot element and then recombines the resulting lists. A quicksort in prolog.


greater(1, 0).
greater(2, 0).
greater(2, 1).
greater(3, 0).
greater(3, 1).
greater(3, 2).
greater(4, 0).
greater(4, 1).
greater(4, 2).
greater(4, 3).

smaller(0, 0).
smaller(0, 1).
smaller(0, 2).
smaller(0, 3).
smaller(0, 4).
smaller(1, 1).
smaller(1, 2).
smaller(1, 3).
smaller(1, 4).
smaller(2, 2).
smaller(2, 3).
smaller(2, 4).
smaller(3, 3).
smaller(3, 4).
smaller(4, 4).

quicksort([X|Xs],Ys):-
    partition(Xs,X,Left,Right),
    quicksort(Left,Ls),
    quicksort(Right,Rs),
    append(Ls,[X|Rs],Ys).
quicksort([],[]).

partition([X|Xs],Y,[X|Ls],Rs):-
    smaller(X,Y),
    partition(Xs,Y,Ls,Rs).
partition([X|Xs],Y,Ls,[X|Rs]):-
    greater(X,Y),
    partition(Xs,Y,Ls,Rs).
partition([],Y,[],[]).

append([],Ys,Ys).
append([X|Xs],Ys,[X|Zs]):-
    append(Xs,Ys,Zs).


I also put a "full" implementation into github. Enjoy!
So if you like the full code click there [github].
As always I also attached some books to read :)

References

[1] Ivan Bratko: "Prolog Programming for Artificial Intelligence", 4th ed., Chapter 2, Adison Wesley, 2011

[2]  Stuart Russel, Peter Norvig: "Artificial Intelligence a Modern Approach", Pearson, 2011