Saturday, March 16, 2019

Avro Rust: A first look at data serialisation in rust using avro

Avro is a data serialisation system in which you can define a schema (basically the fields in a record you want to serialise along with their type) in json and then send the data binary.




In the Rust language, structs can be serialised to multiple formats using a library called serde. In the
last post  we saved our word2vec model into a binary format using serde and bincode. In this post, we will look into serialising with serde and avro-rs, an avro implementation in rust and implement sending serialised objects over a tcp stream using framing.

As usual the code is on [:github:]


Data (De) Serialisation 

Avro supports several data types for serialisation including more complex types such as arrays.


 #[derive(Debug, Deserialize, Serialize)]
pub struct Test {
    a: i32,
    b: f64,
    c: String,
    d: Vec<i32>
}

fn schema() -> Schema {
    let raw_schema = r#"
            {
                "type": "record",
                "name": "test",
                "fields": [
                    {"name": "a", "type": "int",    "default": 0},
                    {"name": "b", "type": "double", "default": 1.0},
                    {"name": "c", "type": "string", "default": "dansen"},
                    {"name": "d", "type": {"type": "array", "items": "int"}}
                ]
            }
        "#;
    avro_rs::Schema::parse_str(raw_schema).unwrap()
}

In the code above, we define a struct that we want to serialise and define the serialisation schema in json. Nothing more is needed. With this information we can serialise and deserialise the `Test` struct. We can even (de) serialise a collection of these structs. The serialisation results in a vector of bytes and given this vector of bytes we can deserialise into a vector of our original struct.
 #[derive(Debug, Deserialize, Serialize)]
fn write_data(data: &[Test], schema: Schema) -> Vec<u8> {
    let mut writer = Writer::with_codec(&schema, Vec::new(), Codec::Deflate);
    for x in data.iter() {
        writer.append_ser(x).unwrap();
    }
    writer.flush().unwrap();
    writer.into_inner()
}

fn read_data(data: &[u8], schema: Schema) -> Vec<Test> {
    let reader = Reader::with_schema(&schema, data).unwrap();
    reader
        .map(|record| from_value::<Test>(&record.unwrap()).unwrap())
        .collect()
}

Framed Writing and Reading over TCP/IP

In order to send avro over the network, it seems recommended to frame the message. That means breaking our byte stream into chunks we send over, each preceded by it's length and terminated by a 0 sized chunk. Avro messages are framed as a list of buffers. Framing is a layer between messages and the transport. It exists to optimize certain operations. The format of framed message data is: a series of buffers, where each buffer consists of: a four-byte, big-endian buffer length, followed by that many bytes of buffer data. A message is always terminated by a zero-length buffer. Framing is transparent to request and response message formats (described below). Any message may be presented as a single or multiple buffers. Framing can permit readers to more efficiently get different buffers from different sources and for writers to more efficiently store different buffers to different destinations. In particular, it can reduce the number of times large binary objects are copied. For example, if an RPC parameter consists of a megabyte of file data, that data can be copied directly to a socket from a file descriptor, and, on the other end, it could be written directly to a file descriptor, never entering user space. A simple, recommended, framing policy is for writers to create a new segment whenever a single binary object is written that is larger than a normal output buffer. Small objects are then appended in buffers, while larger objects are written as their own buffers. When a reader then tries to read a large object the runtime can hand it an entire buffer directly, without having to copy it. from Avro Page .

Writing over the network with a fixed buffer or chunk size can be implemented by sending slices of the given size over the network, each preceded by the length of it (which really only matters for the last one). As described above we send the size of the buffer as a 4-byte integer, big endian.


 pub fn send_framed(objects: &[Test], address: String, schema: Schema, buf_size: usize) {
    let data = write_data(objects, schema);
    let mut stream = TcpStream::connect(address).unwrap();
    let n = data.len();
    for i in (0..n).step_by(buf_size) {
        // determine size of bytes to write
        let start = i;
        let stop = usize::min(i + buf_size, n);
        // send length of buffer
        let mut buffer_length = [0; 4];
        BigEndian::write_u32(&mut buffer_length, (stop - start) as u32);
        stream.write(&buffer_length).unwrap();
        // write actual data
        stream.write(&data[start..stop]).unwrap();
    }
    // terminate by 0 sized
    let mut buffer_length = [0; 4];
    BigEndian::write_u32(&mut buffer_length, 0);
    stream.write(&buffer_length).unwrap();
    stream.flush().unwrap();
}


When reading we simply read the size of the next buffer, then read it and append it into a vector of bytes until we hit the zero size buffer marking the end. Then we simply use the method above to de serialise into a vector of structs
 
pub fn read_framed(stream: &mut TcpStream, schema: Schema) -> Vec<Test> {
    let mut message: Vec<u8> = Vec::new();
    let mut first_buf = [0; 4];
    stream.read(&mut first_buf).unwrap();
    let mut next = BigEndian::read_u32(&first_buf);
    // read while we see non empty buffers
    while next > 0 {
        // append all bytes to final object
        let mut object = vec![0; next as usize];
        stream.read(&mut object).unwrap();
        message.append(&mut object);
        // read next buffer size
        let mut next_buf = [0; 4];
        stream.read(&mut next_buf).unwrap();
        next = BigEndian::read_u32(&next_buf);
    }
    read_data(&message, schema)
}


Thats it, now we can send avro framed over the network and de-serialise on the other end.

Friday, March 8, 2019

Word2Vec Implementation in Rust: Continuous Bag of Words


I recently started to get serious about learning rust and I needed a small to medium project to train on.
One thing I explained in this blog and walked through the code is Word2Vec. A method to learn word embeddings. In this post I will explain the method again using my new rust code on github.

Word Similarity - Basic Idea 

In the following, I will describe the basic idea behind continuous bag of words, one way to estimate word vectors. Word vectors are a dense vectorised word representation capturing similarity of words in Euclidean space. Basically we have one vector for each word in our vocabulary. The euclidean distance between two vectors of words that are similar is smaller than the distance between vectors of words that are less similar. The assumption in Word2Vec is that words that appear in a similar context are similar. We shift a sliding window over the text and try to predict the center word in the window from the word vectors of the surrounding words in the window. 

pub struct Window<'a> {
    pub words:       &'a [usize], 
    pub predict_pos: usize
}

pub struct Document {
    pub words: Vec<usize>
}

impl<'a> Window<'a> {

    pub fn label(&self) -> usize {
        self.words[self.predict_pos]
    }

}

impl Document {

    pub fn window(&self, pos: usize, size: usize) -> Option<Window> {
        let start = sub(pos, size);
        let stop  = usize::min(pos + size, self.words.len() - 1); 
        if stop - start == 2 * size {
            Some(Window {words: &self.words[start as usize .. stop as usize], predict_pos: size})
        } else {
            None
        }
    }

}

The code above defines a document as a vector of word ids and allows to shift a sliding window over it. The return type is not a vector itself but a slice, which is convenient since there is nothing copied. In the following the center word in the window is called a label $l$ and the other words are called the context: $c_1 .. c_m$.

Word Vectors

Now that we have defined the window, we can start implementing a representation for the word vectors. There will be one vector of $d$ dimensions or columns per word in the dictionary.
In code we save the word vectors as a flat vector of doubles. We can then access the $i-th$ vector as a slice from $[i * cols .. (i + 1) * cols]$



#[derive(Serialize, Deserialize)]
pub struct ParameterStore {
    pub cols: usize,
    pub weights: Vec<f64>
}

impl ParameterStore {

    pub fn seeded(rows: usize, cols: usize) -> ParameterStore {
        let mut rng = rand::thread_rng();
        let mut weights = Vec::new();
        for _i in 0 .. (rows * cols) {
            let uniform = (rng.gen_range(0.0, 1.0) - 0.5) / cols as f64;
            weights.push(uniform);
        }
        ParameterStore {cols: cols, weights: weights}
    }

    pub fn at(&self, i: usize) -> &[f64] {
        let from = i * self.cols;
        let to = (i + 1) * self.cols;
        &self.weights[from .. to]
    }

    pub fn update(&mut self, i: usize, v: &Vec<f64>) {
        let from = i * self.cols;
        for i in 0 .. self.cols {
            self.weights[i + from] += v[i];
        }
    }

    pub fn avg(&self, result: &mut Vec<f64>, indices: Vec<usize>) {
        for col in 0 .. self.cols {
            result[col] = 0.0;
            for row in indices.iter() {
                let from = row * self.cols;
                result[col] += self.weights[col + from];
            }
            result[col] /= indices.len() as f64;
        }
    }

}


We initialise the vectors by a random number in the range $\frac{Uniform([-0.5;0.5])}{d}$, which is the standard way to initialise word2vec.  The update method simply adds a gradient to a word vector. There is also a method to compute the average of multiple word vectors, needed later during learning.

Putting It All Together: Computing the Gradients

So now we can compute windows from the text and average the word vectors of the context words. 
The averaged vectors are our hidden layer's activations. From there we compute the prediction and then back propagate the error. So we need two sets of parameters. One is the word vectors initialised as above, and one for the predictions which is initialised to zeros. 

#[derive(Serialize, Deserialize)]
pub struct CBOW {
    pub embed:   ParameterStore,
    pub predict: ParameterStore
}

Given the average embedding for the context as $h$ and the vector for word $i$ from the predictions $w_i$, the forward pass on the neural network is:

$a_i = \sigma(h * w_i)$.

We are predicting the center word of the window, that means we model the output layer as a multinomial where the center word is one hotcoded. So output should be $1$ for the center word
and $0$ for all others. Then the error for the center word is:

$e_i = log(a_i) $

and for all others it is

$e_i = log(1 - a_i)$.

The gradient on the loss is simply $\delta_i = (y - a_i) * \lambda$, with $\lambda$ being the learning rate.

 fn gradient(&self, label: usize, h: &Vecf&lt64>, truth: f64, rate: f64) -> (f64, f64) {        
        let w = self.predict.at(label);
        let a = sigmoid(dot(&h, &w));
        let d = (truth - a) * rate;
        let e = -f64::ln(if truth == 1.0 {a} else {1.0 - a});    
        (d, e)
    } 

The gradient applied to the output parameters is: $\delta_i * h$. The gradient for the embeddings is: $\delta_i * w_i$ which is applied to all context words. So all words in the same context are updated with the same gradient, in turn making them more similar, which was the main goal for word embeddings. There is one more detail. Normally, we would have to compute the losses for the complete dictionary, however for a million words, that would take very long. So instead we use something called negative sampling. We compute the gradient for the center word and then sample some negative words as the negative gradient.
The final computation for learning is shown below:
 pub fn negative_sampeling(&mut self, window: &Window, rate: f64, n_samples: usize, unigrams: &Sampler) -> f64 {
        let mut gradient_embed = vec![0.0; self.embed.cols];
        let h = self.embedding(self.ids(window));        
        let (pos, pos_err) = self.gradient(window.label(), &h, 1.0, rate);        
        let mut error = pos_err;
        let mut gradient_pos_predict = vec![0.0; self.predict.cols];
        for i in 0 .. self.embed.cols {        
            gradient_embed[i] += pos * self.predict.at(window.label())[i];
            gradient_pos_predict[i] += pos * h[i];
        }
        self.predict.update(window.label(), &gradient_pos_predict);
        for _sample in 0 .. n_samples {
            let token = unigrams.multinomial();
            let (neg, neg_error) = self.gradient(token, &h, 0.0, rate);
            error += neg_error;
            let mut gradient_neg_predict = vec![0.0; self.predict.cols];
            for i in 0 .. self.embed.cols {
                gradient_embed[i]       += neg * self.predict.at(token)[i];
                gradient_neg_predict[i] += neg * h[i];
            }
            self.predict.update(token, &gradient_neg_predict);
        }
        let ids = self.ids(window);
        for i in 0 .. ids.len() {
            self.embed.update(ids[i], &gradient_embed);
        }
        error
    }

Results

I trained the net on quora questions and when searching for similar vectors we will find similar words:

Searching: India
kashmir: 11.636631273392045
singapore: 12.155846288647625
kerala: 12.31677547739736
dubai: 12.597659271137639
syria: 12.630908898646624
pradesh: 12.657158343619573
pok: 12.667834993874745
malaysia: 12.708228117809442
taiwan: 12.75190112716203
afghanistan: 12.772290465531222
africa: 12.772372266369985
bjp: 12.802207718902666
karnataka: 12.817502941769622
europe: 12.825095372204384

Thursday, January 10, 2019

Scala3D: Implementing A Wolfenstein Like Ray Casting Engine

Introduction

I recently got interested in 3D game engines for realistic images. I am currently reading through
a book describing a complete ray tracing algorithm [3]. However, since there is a lot of coding to do and a lot of complex geometry, I searched for a simpler side project. I then started to look into 2D
ray casting to generate 3D environments, as used in the Wolfenstein 3D engine. There are several
descriptions on how to build one [1,2]. However, the Wolfenstein black book [1] talks a lot about memory optimisations and execution speed. Lots of the chapters include assembly code and c snippets dealing with the old hardware. Implementing something like this is way easier, now.
Computers got way faster and programming languages way more convenient. So I decided to
build a clone in Scala. The result can be seen in the video above. The left side is the 3D view, and the right side shows a 2D map view. The red lines are the rays from the ray casting algorithm which represent what the player sees and the green line is the camera. The code can be found here [:github:]. If you want to move in the map use "WASD".

Assumptions About The World

The world in Wolfenstein 3D is represented by a flat grid. Different heights in the map are not possible. All objects stand on the same plane and have to be boxes. This assumption makes a lot of things easier. In a 3D engine, the main task is to determine what is seen from a specific view (the player's camera). The nice thing is that we do not have to determine what is visible in any possible direction, because we are moving on a plane, which restricts us to a 2D problem. The second assumption of a world on a grid will simplify the calculation of ray to box intersection later.

Implementing The Player And Camera

The basic idea is to describe a player in 2D by it's position: $p = (p_x, p_y)$. Furthermore, we save the position the player is looking at: $d = (d_x, d_y)$. We also save the image plane: $i = (i_x, i_y)$ which is perpendicular to the player.

In the image above the image plane is marked in green, and the rays for collision are marked in green. In order to calculate the direction for each ray, we first calculate the x coordinate in camera space as $c_x = 2 * x / w$ where $x$ is the row in the image and $w$ is the width of the image, then we compute the ray from the player that goes through that position as:
  • $r_{dx} = d_x + i_x * c_x$
  • $r_{dy} = d_y + i_y * c_x$
Now we can follow each ray until we hit a wall.

Digital Differential Analysis Ray Casting



In order to find what wall we see at the image row $c_x$, we follow the ray until it hits a wall. We initialize the search by calculating the distance to the first time the ray hits the x-axis (in the image this is the segment $a$ to $b$) and the first time we hit the y axis (in the image this is the segment $a$ to $c$). From there we estimate the slope $\delta = (\delta_x, \delta_y)$ in x and y which tells us how far we have to travel to hit each axis again. We then travel in the closest next grid position and check if the position is a wall hit.
    @tailrec
    private def traceRay(
        grid: Vector2D[Int], 
        sideDist: Vector2D[Double], 
        side: Int,
        deltaDist: Vector2D[Double],
        step: Vector2D[Int], 
        hit: Boolean,
    ): Hit = 
        if(hit) Hit(grid, side)
        else sideDist match {
            case Vector2D(x, y) if x < y => 
                traceRay(
                    Vector2D[Int](grid.x + step.x, grid.y),
                    Vector2D[Double](sideDist.x + deltaDist.x, sideDist.y),
                    0,
                    deltaDist,
                    step,
                    world.collids(grid.x + step.x, grid.y)
                )
            case Vector2D(x, y) if x >= y => 
                traceRay(
                    Vector2D[Int](grid.x, grid.y + step.y),
                    Vector2D[Double](sideDist.x, sideDist.y + deltaDist.y),
                    1,
                    deltaDist,
                    step,
                    world.collids(grid.x, grid.y + step.y)
                )
        }
In the implementation we move through the grid bu the delta distance and the current side distance (distance traveled along the $x$ and $y$ component). We also save which side ($x$ or $y$) we hit the wall at. In the recursion we follow the closest next map square until we find a hit and then return the position of the hit $h = (h_x, h_y)$ as well as the side.

Rendering A Scene


So for each image row, we run the ray tracer until we hit a wall. We then compute the distance to the wall in order to calculate how high the wall appears in the image $w_d$ from the player position, the hit position, the step size and the direction:
    private def projectedDistance(
        hit: Hit,
        pos: Vector2D[Double], 
        step: Vector2D[Int], 
        dir: Vector2D[Double], 
    ): Double = 
            if (hit.side == 0) (hit.pos.x - pos.x + (1 - step.x) / 2.0) / dir.x
            else               (hit.pos.y - pos.y + (1 - step.y) / 2.0) / dir.y
The height of the wall is then simply $l_h = \frac{h}{w_d}$. We then compress the slice of the texture we hit to that height and paint it in the image at the row position we send the ray for.

References

[1] GAME ENGINE BLACK BOOK: WOLFENSTEIN 3D, 2ND EDITION, 2018

Tuesday, November 6, 2018

Lindenmayer system: Implementation


I recently got interested in Lindenmayer systems or L-system for short and I also wrote some code.  These are generative 
systems that generate strings, that in turn can be interpreted as turtle graphics. For example, 
you can model koch curves and the famous dragon curve. This book describes how to generate plants which you can see below.


Implementation


For example, given a initial string $+A-Bf$, we have terminal symbols $\{+,-,f\}$ and non terminals $\{A, B\}$ that can still be replaced by rules like $A \rightarrow AfA$, which reads: each A can be replaced in the string. Furthermore, the string $+A-Bf$ can be interpreted by a turtle as $+$: turning right, $-$: turning left, all capital letters mean moving forward drawing a line, and all lower case letters mean moving silently to the next position. We start by defining a symbol class.
case class Symbol(id: Char) {

  def isSilent: Boolean = id.isLower

  def isTerminal: Boolean = !id.isLetter || id.isLower

}

Next we define a Lindenmayer system as a start string and a set of rules. Given the start string and a depth limit, we can recursively replace all non terminal symbols with the associated rules. This enables us to generate string.
case class LSystem(start: List[Symbol], rules: Map[Symbol, List[Symbol]]) {

  private def produce(symbol: Symbol): List[Symbol] =
    if(symbol.isTerminal) List(symbol)
    else rules(symbol)

  @tailrec
  private def produce(current: List[Symbol], result: List[Symbol]): List[Symbol] = current match {
    case Nil          => result
    case head :: tail => {
      produce(tail, result ::: produce(head))
    }
  }

  @tailrec
  private def produce(current: List[Symbol], stepsLeft: Int): List[Symbol] =
    if(stepsLeft == 0) current
    else produce(produce(current, Nil), stepsLeft - 1)

  def produce(depth: Int): List[Symbol] = produce(start, depth)

}
Now we are done with the definition of the grammar system. The only thing left is to define a turtle and an interpreter that converts strings from L-systems to a sequence of turtles. I define a turtle by it's posture $[x, y, \alpha]$ where $x,y$ is the turtle's position and the orientation is defined by the angle $\alpha$. For moving forward the next positions are defined as:
  • $x + \delta * cos(\alpha)$
  • $y + \delta * sin(\alpha)$
A rotation basically involves increasing the angle by some certain degree.
case class Turtle(x: Int, y: Int, angle: Double, sil: Boolean) {

  def move(distance: Int): Turtle =
    Turtle(
        math.round(x + distance * math.cos(math.toRadians(angle))).toInt, 
        math.round(y + distance * math.sin(math.toRadians(angle))).toInt, angle, false)

  def moveSil(distance: Int): Turtle =
    Turtle(
        math.round(x + distance * math.cos(math.toRadians(angle))).toInt, 
        math.round(y + distance * math.sin(math.toRadians(angle))).toInt, angle, true)

  def turn(by: Double): Turtle = {
    Turtle(x, y, (angle + by) % 360, false)
  }

  def silence: Turtle = Turtle(x, y, angle, true)
Then given a string, we can recursively build a sequence of positions by applying the action associated with each symbol.
class TurtleInterpreter(moveBy: Int, turnBy: Double) {

  import TurtleInterpreter._

  private def execute(cmd: List[Symbol], state: Turtle, stack: List[Turtle], drawable: List[Turtle]): List[Turtle] = cmd match {
    case Nil => drawable
    case RotateRight :: tail => execute(tail, state.turn(turnBy), stack, drawable)
    case RotateLeft  :: tail => execute(tail, state.turn(-turnBy), stack, drawable)
    case PushStack   :: tail => execute(tail, state, state :: stack, drawable)
    case PopStack    :: tail => execute(tail, stack.head.silence, stack.tail, stack.head.silence :: drawable)
    case head        :: tail => {
      val next = if (head.isSilent) state.moveSil(moveBy) else state.move(moveBy)
      execute(tail, next, stack, next :: drawable)
    }
  }

  def execute(cmd: List[Symbol], init: Turtle): List[Turtle] =
    execute(cmd, init, List.empty[Turtle], List(init))

}

object TurtleInterpreter {

  final val RotateRight = Symbol('+')
  final val RotateLeft  = Symbol('-')
  final val PushStack   = Symbol('[')
  final val PopStack    = Symbol(']')
Given that we can define grammars and paint pictures as the ones shown above. I wrote some code that plots all the above images that can be found on github.

Saturday, November 3, 2018

Exploring Super Resolution With Neural Networks: Blade Runner Style

In Blade Runner there is a scene in which Deckard (the protagonist) inspects one of the replicant's (Loean's) photos. The scene features an insane zoom which this blog post estimates the final magnification at a factor of 667.9


While this is science fiction (obviously) there where some breakthroughs in super resolution recently. Basically we use machine learning models to upscale an image without gaining pixelated artefacts.
I used this code to upscale the same photo from the movie using the code that is around using the Lasagne neural network framework.

The top image shows a reflection in a cabinet that is itself a reflection in the round mirror.
The bottom image shows the first zoom in Deckard performs.

Below is the original image (top) and the result (bottom) I got when running the 4 times upscale neural net on the image from Blade Runner. 




If you zoom into the mirror in the reconstructed image (open in new tab to see it in it's original resolution) and you use your imagination, I believe you can see a face in it. I do not believe
it is actually there but that it is a reconstruction artefact, but who knows ... 

Monday, October 8, 2018

Fast Threshold Estimation In Decision Trees

In decision trees and related methods such as random forests and gradient boosting, we need
to find thresholds in order to split the continuous attributes. Finding the optimal threshold on a feature using information gain can be computationally expensive. However, there is a method to perform this operation fast. One solution is the histogram method described in [1].

Algorithm Sketch

Basically we build a table where we bin each value of a certain attribute and save the number of occurrences of a label in said bin. Each bin represents a certain threshold we could apply. Once everything is indexed, we estimate the total number of entries for each class. In the example, class one occurs 33 times in total and class two 96 times. We then scan lowest threshold and keep track of all the values below the current.  We can then easily compute the number of each class below and
above the threshold and use these numbers to compute the information gain from that.

Implementation

First we need to be able to find a bin given a sorted kist of thresholds and given an attribute values. One algorithm is an adjusted binary search. We basically check the center bin in a range. If this is our bin, we are done. Otherwise, we continue search to the left or the right.
def bisection(val, th, start, stop):
    center = int((stop - start) / 2 + start)
    n = len(th) - 1
    if center <= 0:
        return 0
    elif center >= n:
        return n
    elif val > th[center - 1] and val <= th[center]:
        return center
    elif val < th[center]:
        return bisection(val, th, start, center)
    else:
        return bisection(val, th, center, stop)
The other thing we need is an entropy calculation. Since we are talking about a classification problem, we will implement the discrete entropy calculation.
def entropy(x):
    scaler = x.sum()
    if scaler == 0:
        return 0.0
    else:        
        logScaler = log2(scaler) 
        h = 0
        for i in range(0, len(x)):
            if x[i] > 0:
                p     = x[i] / scaler
                log_p = np.log2(x[i]) - logScaler
                h += p * log_p
        return -h
Finally, we can implement the algorithm sketch from above.
class LabelHistogram:
    
    def __init__(self, thresholds, n_classes):
        self.thresholds = thresholds
        self.n_bins     = len(thresholds)
        self.n_classes  = n_classes
        self.bins       = np.zeros((self.n_bins, n_classes))
                             
    def insert(self, attr, label):
        i = bisection(attr, self.thresholds, 0, self.n_bins)
        self.bins[i, label] += 1        
                
    def gain(self, parent_gain):
        total = np.sum(self.bins, axis=0)
        cur = np.zeros(self.n_classes)
        max_gain = float('-inf')
        max_th = None
        for i in range(0, self.n_bins):
            cur += self.bins[i]
            left    = cur
            right   = total - cur            
            cleft   = left.sum() 
            cright  = right.sum()
            total   = cleft + cright
            gain    = parent_gain - ((cleft / total) * entropy(left) + (cright / total) * entropy(right))
            if gain > max_gain:
                max_gain = gain
                max_th   = self.thresholds[i]
       return (max_gain, max_th)


References

[1] Criminisi, Shotton: "Decision Forests for Computer Vision and Medical Image Analysis", Springer, 2013

Saturday, July 28, 2018

Simple Gaussian Processes For Regression

In this post I am implementing a simple Gaussian Process Regressor. A Gaussian Process is a
probabilistic kernel model. Basically I am following the "Machine Learning: A probabilistic Perspective" book by Kevin Murphy.

The Basics

For an input $x_i$ and output $y_i = f(x_i)$, infer a distribution over functions of the data:
$p(f | X, y)$. In order to make predictions $y*$ for a new data point $x*$ given all the previous observations, we define:

$p(y*|x*, X, y) = \int p(y*|f, x*) p(f|x, y) df$

For our data $X = x_1 ... x_n$ the Gaussian process assumption is that the the distributions over functions $p(f_1 ... f_N)$ is jointly Gaussian with mean $\mu(x)$ and a covariance function $\Sigma_{ij} = k(x_i, x_j)$ and a kernel $k$.

Covariance structure

It is quite common to leaf the mean function at zero. Given a kernel, we need to compute the covariance function between all training and test instances. However, we structure the kernel matrix in the following way. 



The sub-matrix $K$ describes the affinity between all training instances and the sub-matrix $K*$ describes the affinity between training and test instances. Finally, the submatrix $K**$ describes the affinity between test instances to itself. The image above shows the final covariance structure. With that, our function $f*$ can be inferred as: $(f, f*) ~ N(0, \Sigma)$.

In order to obtain the prediction calculate: $f* = K_*  \bullet K^{-1} \bullet f$.
If we want the find the variance of the predictions we calculate: $trace(K_{**} - K_*  \bullet K^{-1} \bullet  K_*^T)$

In my opinion the method is very similar to support vector machines and to instance beased learning methods in general. The difference is that a SVM is only using a small set of training examples called support vectors, while the Gaussian process compares every example to all others in order to make a prediction.


A Naive Gaussian Process

Below I implemented a naive Gaussian process using numpy. Basically, I am building the three components of the covariance structure and then solve the prediction equations above. I call my method naive since computing the inverse has a worst case complexity of $O(n^3)$
So it is not advisable to use this method for larger datasets.
class GaussianProcess:
    
    def __init__(self, kernel):
        self.kernel = kernel
        
    def predict(self, train_x, train_y, test_x):
        (n, _) = train_x.shape
        (m, _) = test_x.shape
        covar_train      = np.zeros((n, n))
        covar_train_test = np.zeros((n, m))
        covar_test       = np.zeros((m, m))
        for i in range(0, n):
            for j in range(0, n):
                covar_train[i, j] = self.kernel.k(train_x[i], train_x[j], i == j)
        for i in range(0, m):
            for j in range(0, m):
                covar_test[i, j] = self.kernel.k(test_x[i], test_x[j], i == j)
        for i in range(0, n):
            for j in range(0, m):
                covar_train_test[i, j] = self.kernel.k(train_x[i], test_x[j], False)
        covar_inv  = np.linalg.inv(covar_train)
        prediction = np.dot(np.dot(covar_train_test.T, covar_inv), train_y)
        confidence = covar_test - np.dot(covar_train_test.T, np.dot(covar_inv, covar_train_test))
        confidence = np.sqrt(np.array([confidence[i, i] for i in range(0, m)]))
        return prediction, confidence 


Squared Exponential Kernel

For my experiments I use a squared exponential kernel and also add a variance parameter for for noisy observations:

$k(x_i, x_j) = \sigma_f^2 e^{-\frac{1}{2l^2} || x_i - x_j ||_2^2} + \sigma_y^2 \mathbb{I}(i == j)$

The parameters are as follows:
  • $\sigma_f^2$: Variance giving the scale of the output
  • $\sigma_y^2 \mathbb{I}(i == j)$: Noise variance applied to diagnoal
  • $l$: Scale on input
again, the implementation is quite simple in numpy.
class SquaredExponential:

    def __init__(self, variance, alpha, noise):
        self.alpha = alpha
        self.variance = variance
        self.noise = noise
        self.scaler = -1.0 / (2 * alpha**2)
        
    def k(self, a, b, on_diag):
        error  = np.sum(np.square(a - b))
        f      = np.exp(self.scaler * error)
        scaled = self.variance * f
        if on_diag:
            scaled += self.noise
        return scaled   


Results on some simple example datasets

I tried my method on three datasets, 2 times, I am intepolating stock marktet data and one time a sine wave. Each time, the index is the input to the method and the output for the stock market prediction is the closing price.
Sine Wave Interpolation:

Stock Interpolation Dax:

Stock Interpolation US: