Daniel@World

Implementing Clustering Loss For Deep Learning in Tensorflow 2.0

2020-09-27T09:49:00.006-07:00

I recently found a new interesting loss function that can be used for clustering [1]. Basically, we can start building a neural network with a softmax output.

The first term is the average entropy of the probabilities for each example using the softmax probabilities.

Each example xi in a batch is first embedded f(xi) and then pushed through the softmax. This term basically encourages assignment to a single cluster. The second term is the entropy of the average probability in each cluster over the complete batch. This is done so all clusters are used and the score discourages the assignment of all instances to the same batch. Below I posted my code for the loss function in Tensorflow.

    
class ClusteringLoss(Loss):

def __init__(self, gamma=1.0):
    super().__init__()
    self.gamma = gamma
 
  def call(self, true, pred):
    y  = pred + 1e-6
    h1 = -1.0 * tf.reduce_sum(
         tf.math.log(y) * y, axis=-1
    )
    H1 = tf.math.reduce_mean(h1)
    total_y = tf.math.reduce_mean(y, axis=0)
    H2 = -1 * tf.reduce_sum(
       tf.math.log(total_y) * total_y
    )
    return H1 - self.gamma * H2

REFERENCES

[1] Jensen et. al.: "Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision", ICASSP 2020.

Recursive Auto Encoders in Tensorflow 2.0

2020-03-12T04:49:00.002-07:00

In my last post, I described recursive neural networks and how they can be used in PyTorch.

In this post, I will describe the recursive autoencoder and my implementation in TensorFlow 2.0 using the new eager execution functionality and the gradient tape.

What is a recursive autoencoder?

Figure from the paper[1] explaining auto encoders

As with the regular autoencoder we build a neural network that decides if we merge two inputs. Instead of having a target, in the autoencoder case we simply reconstruct the input and merge the nodes with the lowest reconstruction error in a greedy manner.

Implementation

First, we define the auto encoder:

    
def merge_encoder(n_in):
    a = tf.keras.layers.Input(n_in)
    b = tf.keras.layers.Input(n_in)
    c = tf.keras.layers.Concatenate()([a,b])
    h = tf.keras.layers.Dense(n_in)(c)
    o = tf.keras.layers.Dense(n_in * 2)(h)
    merge = tf.keras.models.Model(inputs=[a, b], outputs=[h, c, o])
    return merge

The model has two inputs that should be merged. These will be concatenated and fed through a dense layer resulting in the hidden representation. The hidden representation is directly used to reconstruct the input. In order to keep track of the merges, I define a node class.

    
class Node:
    
    def __init__(self, i, embedding, score, l = None, r = None):
        self.i = i
        self.score = score
        self.embedding  = embedding
        self.left = l
        self.right = r
        
    def print(self, offset=""):
        print("{} {} {} {}".format(offset, self.i, self.score, np.mean(self.embeding)))
        if self.left is not None and self.right is not None:
            self.left.print(offset + "\t")
            self.right.print(offset + "\t")

    def merge(self, other, merger):
        merged = merger([self.embedding, other.embedding])
        h = merged[0]
        c = merged[1]
        y = merged[2]
        score = tf.nn.l2_loss(y - c) + self.score + other.score
        return Node(-1, h, score, self, other)

def ts2leafs(ts):
    return [Node(i,ts[i], tf.constant(0.0)) for i in range(0, len(ts))]

A merge holds the embedding, an id for all leaf nodes as well as the merge score. Merging nodes will involve computing the hidden representation for the new node and computing the score from the reconstruction error. At the beginning every vector in a sequence is simply a leaf node. Now we can greedily merge the sequences by finding the minimum possible reconstruction error. We keep merging until only one node is left.

    

def merge(x, m):
    if len(x) > 1:
        min_loss = float('inf')
        min_node = None
        min_i = 0
        min_j = 0
        for i in range(len(x)):
            for j in range(len(x)):
                if i < j:
                    node = x[i].merge(x[j], m)
                    if node.score < min_loss:
                        min_node = node
                        min_loss = node.score
                        min_i = i
                        min_j = j
        next_x = x.copy()
        next_x[min_i] = min_node
        next_x = [next_x[idx] for idx in range(0, len(x)) if idx != min_j]
        return merge(next_x, m)
    else:
        return x[0]

Now we can learn the model using TensorFlow's gradient tape which keeps track of all computations during our merges, compute the gradient and update the model.

    

m = merge_encoder(10)
x = ts2leafs([np.ones((1, 10)) * i for i in range(0, 10)] + [np.ones((1, 10)) * i for i in range(0, 10)])
optimizer = tf.keras.optimizers.Adam()
for i in range(0, 100):
    with tf.GradientTape(watch_accessed_variables=True) as tape:
        tape.watch(m.variables) 
        node = merge(x, m)
        print(node.score)
        g = tape.gradient(node.score, m.variables)
    optimizer.apply_gradients(zip(g, m.variables))
m.save('merger.h5')

My code can be found in this gist

REFERENCES

[1] Richard Socher Jeffrey Pennington, Eric H. Huang Andrew Y. Ng Christopher D. Manning: "Semi-Supervised Recursive Autoencodersfor Predicting Sentiment Distributions", EMNLP 2011

Recursive Neural Networks

2019-09-28T08:34:00.001-07:00

A Recursive Neural Network

Recursively Neural Networks can learn to structure data such as text or images hierarchically. The network can be seen to work similar to a context free grammar and the output is a parse t
The figure below (from this paper) shows examples of parse trees for an image (top) and a sentence (bottom).

A recursive neural network for sequences always merges two consecutive input vectors and predicts if these two vectors can be merged. If so, we replace the two vectors with best merging score with the hidden vector responsible for the prediction. If done in a recursive manner, we can construct a parse tree.

Formally we can construct the network as follows. Given an input sequence $x = x_1 ... x_T, x_i \in \mathbb{R}^D$ then for two neighboring inputs $x_t, x_{t + 1}$ we predict a hidden representation:

$h_t = \sigma(x_t W_l + b_l)$
$h_{t + 1} = \sigma(x_{t + 1} W_r + b_r)$

The hidden layer used for replacement is computed from the concatenation of the two previous hidden layers: $h=\sigma([h_t, h_{t + 1}] W_h + b_h)$. Finally, the prediction is simply another layer predicting the merging score.

For the implementation we define the neural network with the merging decision first. Given a node with the parent representations (for leaf nodes that are the input vectors) we build the hidden representations and then compute the merging score.

class NeuralGrammar(nn.Module):
    '''
    Given a tree node we project the left child and right child into
    a feature space cl and cr. The concatenated [cl, cr] is projected
    into the representation of that node. Furthermore, we predict
    the label from the hidden representation.
    '''

    def __init__(self):
        super(NeuralGrammar, self).__init__()
        self.left       = nn.Linear(IN, H, bias=True)
        self.right      = nn.Linear(IN, H, bias=True)
        self.parent     = nn.Linear(2 * H, IN, bias=True)
        self.projection = nn.Linear(IN, 1, bias=True) 

    def forward(self, node):        
        l = node.left_child.representation
        r = node.right_child.representation
        y = node.label
        x = torch.cat([torch.relu(self.left(l)), torch.relu(self.right(r))], 0)
        p = torch.tanh(self.parent(x))
        score = self.projection(p)
        return (p, score)

Now we can implement our node class which also handles the parsing, by greedily merging up to a tree. A node in the tree holds it's representation as well as a left and a right child. In other words, each node represents a merging decision with the two children being the nodes merged and the parent representation being the hidden layer in the merger grammar. Parsing a tree involves merging the vectors with the highest score and replacing the nodes with their parent node.

    
class TreeNode:
    '''
    A tree node in a neural grammar. 
    Represented as a binary tree. Each node is also represented by
    an embedding vector.
    '''

    def __init__(self, representation=None):
        self.representation = representation
        self.left_child     = None
        self.right_child    = None

    def greedy_tree(cls, x, grammar):
        '''
        Greedily merges bottom up using a feature extractor and grammar
        '''
        T, D   = x.shape
        leafs  = [cls(x[i, :]) for i in range(T)] 
        while len(leafs) >= 2:
            max_score       = float('-inf')
            max_hypothesis  = None 
            max_idx         = None 
            for i in range(1, len(leafs)):
                # difference of reconstruction of the center of the span of the subtree in the spectrogram is the score
                # inspired by continuous bag of words
                hypothesis = cls(leafs[i - 1].start, leafs[i].stop)
                hypothesis.left_child     = leafs[i - 1]
                hypothesis.right_child    = leafs[i]
                representation, score     = grammar(hypothesis)
                hypothesis.representation = representation
                if score > max_score:
                    max_score      = score
                    max_hypothesis = hypothesis
                    max_idx        = i    
            leafs[max_idx - 1:max_idx + 1] = [max_hypothesis]            
        return leafs[0]

During learning, we collect the scores of the subtrees and use labels to decide if a merge was correct or not.

Code Along Books: Bitcoin, Go Game, Ray Tracers

2019-08-01T06:26:00.000-07:00

A scene rendered with your own ray tracer

I like learning by doing. In my opinion the only real self test whether you understood something is a working prototype. For example, when learning more about the details of deep learning or xgboost, I implemented my own auto differentiation and gradient boosted trees.

Recently I found several books that guide you through implementations of fairly complex pieces of software. After finishing a chapter it feels like making progress on a project. So it is easier to follow along and you don't have to search for all the references needed yourself. You still have the benefit from learning by doing.

The book based projects I found are:

A ray tracer [1]
A go AI [2]
A bitcoin implementation [3]
A Prolog interpreter [4]

The prolog interpreter needed a little more adjustment since the book's topic is classic
AI in lisp and I wanted to implement my prolog in scala.

The previous projects on neural networks were based on Goodfellow's Deep Learning book [5] and many other references such as online classes so it was not as easy to code along. The same is true for the gradient boosting implementation which was mostly based on papers and books.

However the other projects simply involved writing code and the book was the only thing required to get into the topic. The quality all of these books show is that it is not a collection of separated code pieces that explain a topic but the topic's are explained by a coherent code based that you develop yourself form scratch.

1. A Ray Tracer

The book "The Ray Tracer Challenge: A Test Driven Guide to Your First 3D Rendered" [1], is giving you no actual code but behavior specifications (cucumber). The code has to be written by yourself.
Harder parts include pseudo code. The book covers the basic math library and then uses the defined primitives to construct more and more complex features.
The end result is a full ray tracer with shadows reflections and multiple primitives.

2. A Go AI

In the next book: "Deep Learning And The Game Of Go" [2], you build a go AI using the code provided by the authors. The books walks you through the implementation of the go board and rules first. Then the authors walk through classic game AI algorithms such as monte carlo tree search. In the end the authors introduce game state evaluation using convolutional neural networks using Keras.

The resulting system resembles an early version of AlphaGO

3. Bitcoin

The book "Programming Bitcoin" [3] is a mix between the first two. Lot's of code is provided in the form of skeletons but the reader still has to fill in the blanks for several assignments. The book

covers everything from basic cryptography to scripting bitcoin. The end result is a full bitcoin

implementation.

Wrap up

The great thing about all the books mentioned is that all these books do cover the primitives needed too. You don't need to rely on a library but the authors walk you front to back.

My code bases are all on github. Some are work in progress [WIP] since
i did not finish the books yet. For example, in books 2 and 3 I only went through the first chapters:

REFERENCES:

[1] Buck, "The Ray Tracer Challenge: A Test Driven Guide to Your First 3D Rendered", Pragmatic Bookshelf, 2019

[2] Pumperla, Furguson, "Deep Learning And The Game Of Go", Manning, 2019

[3] Song, "Programming Bitcoin: Learn How To Program Bitcoin From Scratch", Oreilly, 2019

[4] Norvig: "Paradigmes of Artificial Intelligence Programming: Case Studies in Lisp", Morgan Kaufmann, 1992
[5] Ian Goodfellow, "Deep Learning", MIT Press, 2016

WebRTC and Tensorflow.js

2019-07-09T07:23:00.002-07:00

WebCam image classification on web page

I will give a quick tutorial on how to connect a webcam in html5/WebRTC to Tensorflow.js for image classification. We will load a pre trained mobile net and then pass it frames from the camera. First, we define our html page and specify a video element in which we later stream the video from the camera,
a label field and also start the WebRtc + neural network code. The code can be found at this gist.

<!DOCTYPE html>
<html>
  <head>
    <title> Hello WebRTC </title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.0.0/dist/tf.min.js"></script>
  </head>
  <body>
    <video id="cam" width="224" height="224" autoplay playsinline></video> <br/>
    <script src="camnet.js"></script><br/>
    <font color="red" size="5"> LABEL: </font><br/>
    <div id="label"> </div>
  </body>
</html>

Then we download the pretrained mobile net and initialize the webcam.

 async function init() {
    try {
 const net = await tf.loadLayersModel(MOBILENET_MODEL_PATH);
        const constraints = window.constraints = {audio: false, video: true};
     const stream = await navigator.mediaDevices.getUserMedia(constraints);
     onSuccess(stream, net);
    } catch (e) {
     onError(e);
    }
}

The camera stream can be retrieved by the getUserMedia function. The onError method simply writes an error to the console. If we are successful, we get the video element from our dom and bind the stream to the video. We then start the detection loop with a method called onFrame.

function onSuccess(stream, net) {    
    const video = document.querySelector('video');
    const videoTracks = stream.getVideoTracks();
    console.log('Got stream with constraints:', constraints);
    console.log(`Using video device: ${videoTracks[0].label}`);
    window.stream = stream;
    video.srcObject = window.stream;
    onFrame(video, net);
}

onFrame's inner function processFrame is an infinite recursion. We grab the video frame by frame and push it into a classify method along with the neural network, the video element as well as the label element.

function onFrame(video, net) {
    var label_element = document.getElementById('label');
    console.log(net.summary());
    async function processFrame() {
 classify(video, label_element, net)
        requestAnimationFrame(processFrame);          
    }
    processFrame();
}

The last method transforms the camera image into a tensor, normalizes the color and then construct a batch with only one example. Based on the prediction from the mobile net, we extract the best class and write it into the label element.

async function classify(img_element, label_element, net) {
    const img = tf.browser.fromPixels(img_element).toFloat();
    const offset = tf.scalar(127.5);
    const normalized = img.sub(offset).div(offset);
    const batched = normalized.reshape([1, IMAGE_SIZE, IMAGE_SIZE, 3]);
    const prediction = await net.predict(batched).data();
    var max_i = 0;
    var max_v = prediction[0];
    for (let i = 0; i < prediction.length; i++) {
 if(prediction[i] > max_v) {
     max_v = prediction[i];
     max_i = i;
 }
    }
    const label = IMAGENET_CLASSES[max_i];
    if (max_v > 0.5) {
 label_element.innerHTML = label + " [" + max_v + "]";
    }
}

Implementing Triplet Loss Function in Tensorflow 2.0

2019-07-07T04:50:00.003-07:00

In this post I will go through an implementation of the triplet loss for siamese neural network architectures in keras (tensorflow 2.0).

The goal of training a neural network with a triplet loss is to learn a metric embedding. That is examples that are conceptually close are also close in euclidean space and examples that are conceptually further are further away in euclidean space, too. The triplet loss is introduced in the facenet paper from google. When learning with the triplet loss, we choose three examples. The first example is called the anchor, which can be any example from our dataset. The positive example is one that is conceptually close. For example, a conceptually close example for face recognition might be a picture of the same person as shown in the anchor image. The negative example might simply be a picture of any other person. The embedding network can actually have any architecture. During training, we push all three examples through the network, compute the loss function with the three embeddings and adjust the weights which in each epoch pushes negative examples further from it's anchors and positive examples closer.

Figure 1: The Triplet loss forces negative examples further from the anchor and positive examples closer.

An example is shown below. The loss consists of three parts. The first is the euclidean distance between the anchor and it's positive example. The second is the euclidean distance between the anchor and it's negative example. Since we want to minimize the loss, the first term should be small and the second term should be large. Furthermore, we introduce a regularization term or margin between the examples.

Figure 2: We embed the anchor, positive and negative examples with the same network and apply the triplet loss. During Backpropagation we achieve higher euclidean distances to the negative example and lower distances for the negative examples.

For unsupervised audio embedding, one way to sample triplets is to pick a window from the audio as the anchor and a close window in time to the anchor as positive (since audio does not change that rapidly). The negative example is simply a sample from another file.

In order to implement such a loss in Tensorflow 2.0 / Keras, we can implement the Loss base class.

class TripletLoss(loss.Loss):

    def __init__(self, margin):
        super().__init__()
        self.margin = margin

    def call(self, y_true, y_pred):
        anchor = y_pred[0]
        pos    = y_pred[1]
        neg    = y_pred[2]
        pos_dist   = tf.reduce_sum(tf.square(tf.subtract(anchor, pos)), axis=-1)    
        neg_dist   = tf.reduce_sum(tf.square(tf.subtract(anchor, neg)), axis=-1)    
        basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), self.margin)    
        loss       = tf.reduce_sum(tf.maximum(basic_loss, 0.0))               
        return loss

As you can see from the code, we do not need the ground truth (y truth) and we simply pass dummy values during training (Keras will still check it's dimension). In order to construct a model using the triplet loss, we can build an embedder model and then use that model three times in the triplet loss model. During back propagation, the three gradients will be summed and then passed through the embedder model (deep learning book chapter 6, Algorithm 6.6). In order to see a full example for
audio data, you can check this gist.

Navigable Small World Graphs for Approximate Nearest Neighbors: Implementation in Rust and Time Series Experiments.

2019-04-01T06:10:00.003-07:00

A navigable small world graph. Image from NSWG

A small world graph or network, is a mathematical graph with certain properties. In such a graph most nodes will not be neighbors but two neighbors of a node will likely be. In other words, traversing the graph from one node to the other can be achieved in very few edge transitions.
The idea is to use such a structure to support fast approximate nearest neighbor queries.
In a navigable small world graph (NSWG) each node or vertex is associated with a vector in $\mathbb{R}^d$. During search, we check the distance to all neighbors of the node and transition to the closest one. We then repeat the search until we found a local minimum. In this post we discuss how to compute the $k$ closest neighbors and explore the method using the UCR time series
dataset. The code can be found on github.

Modelling the Graph and Search Results

The edges in the graph are implemented using a map. For each vertex, we map to the neighboring vertices. The struct also holds the flattened vectors associated with each vertex.

/// A navigable small world graph defined
/// by a sparse edge set and a flat vector set, one for each vertex.
pub struct NavigableSmallWorldGraph {
    pub edges: HashMap<usize, Vec<usize>>,
    pub instances: Vec<f32>,
    pub dim: usize
}

A search result is simply implemented as a struct with a node id as well as a distance to the query.

pub struct SearchResult {
    pub node: usize,
    pub distance: f32
}

Furthermore, we implement comparing to search results by their distance. In order to manage the set of candidates during knn search, we implement a result set as a bineary tree set. In this way, extracting the top neighbors is fast.

pub struct ResultStore {
    ordered: BTreeSet<SearchResult>
}

impl ResultStore {    

    pub fn new() -> ResultStore {
        ResultStore { ordered: BTreeSet::new() }
    }

    pub fn len(&self) -> usize {
        self.ordered.len()
    }

    pub fn take(&self, k: usize) -> Vec<SearchResult> {
        self.ordered.iter().take(k).map(|x| x.clone()).collect()
    }

    pub fn best(&self, at: usize) -> SearchResult {
        let k_best = self.ordered.iter().take(at);
        if k_best.len() < at {
            SearchResult::none()
        } else {
            k_best.last().unwrap().clone()
        }
    }

    pub fn insert(&mut self, result: SearchResult) {
        self.ordered.insert(result);
    }
    
    pub fn remove(&mut self, result: &SearchResult) {
        self.ordered.remove(result);
    }

}

Insertion in this data structure automatically inserts in order. The same goes for the remove option, which keeps the tree ordered. Iterating the elements now allows to get the candidates sorted.

KNN-Search and Graph Construction

First we define the distances we support. For this post, I implement the euclidean distance between a query vector and a node in the graph as well as the dynamic time warping distance, when seeing the vectors as time series. If we give a warping band to the distance function, we compute the dtw distance.

    fn distance(&self, query: &[f32], node: usize, align_band: Option) -> f32 {
        match align_band {
            Some(w) => self.dtw(query, node, w),
            None    => self.euclidean(query, node)
        }
    }

    fn euclidean(&self, query: &[f32], node: usize) -> f32 {
        assert!(query.len() == self.dim);
        let y = &self.instances[node  * self.dim .. (node + 1) * self.dim];
        let mut distance = 0.0;
        for i in 0 .. query.len() {
            distance += f32::powf(query[i] - y[i], 2.0);
        }
        f32::sqrt(distance)
    }

    fn dtw(&self, query: &[f32], node: usize, w: usize) -> f32 {
        let y  = &self.instances[node  * self.dim .. (node + 1) * self.dim];
        let n  = query.len();
        let m  = y.len();
        // allocating here is the main bottleneck
        let mut dp = vec![std::f32::INFINITY; (n + 1) * (m + 1)]; 
        dp[0] = 0.0;
        for i in 1 .. n + 1 {
            for j in usize::max(NavigableSmallWorldGraph::sub(i, w), 1) .. usize::min(i + w, m + 1) {
                let distance = f32::powf(query[i - 1] - y[j - 1], 2.0);                
                dp[i * (n + 1) + j] = distance + NavigableSmallWorldGraph::min3(
                    dp[(i - 1) * (n + 1) + j],
                    dp[i       * (n + 1) + (j - 1)], 
                    dp[(i - 1) * (n + 1) + (j - 1)]
                )
            }
        }
        dp[dp.len() - 1]
    }

We can now turn to the search algorithm. During the search, we maintain an ordered candidate lost, an ordered result list and an unordered visited set. During a search run, we start with a random entry node. We then add said node as the first candidate. In the following steps, we get the best node from the candidate list and add all of it's neighbors as candidates, as long as we did not visit the node before. If a candidate is further away then the k-th neighbor so far, we stop. We restart the search $n$ times from different random start points.

    /// Given a query find the k-nearest neighbors using the nswg graph
    /// * `query` The query vector
    /// * `n_searches` number of restarts
    /// * `k_neighbors` number of results
    /// * `align_band` Sakoe Chiba Band for Dynamic Time Warping, if not set we use Euclidean Distance
    pub fn search(
        &self,
        query: &[f32],
        n_searches: usize,
        k_neighbors: usize,
        align_band: Option<usize>,
    ) -> (Vec<SearchResult>, usize) {
        let mut candidates = ResultStore::new();
        let mut results = ResultStore::new();
        let mut visited = HashSet::new();
        let mut rng = rand::thread_rng();
        let mut n_steps = 0;
        for _attempt in 0..n_searches {
            let mut bsf = ResultStore::new();
            // stop if we visited all of the nodes
            if self.n_instances() == visited.len() {
                break;
            }
            // sample entry point, make sure we never used it before
            // and insert the point as a candidate
            let mut entry_point = rng.gen_range(0, self.n_instances());
            while visited.contains(&entry_point) {
                entry_point = rng.gen_range(0, self.n_instances());
            }
            let distance = self.distance(query, entry_point, align_band);
            candidates.insert(SearchResult::new(entry_point, distance));
            while candidates.len() > 0 {
                // check if we can stop: abendon search if the result is the worst than the top k so far
                let candidate = candidates.best(1);
                let kth = f32::min(
                    results.best(k_neighbors).distance,
                    bsf.best(k_neighbors).distance,
                );
                candidates.remove(&candidate);
                if candidate.distance > kth {
                    break;
                }
                // add expand all edges
                for edge in self.edges[&candidate.node].iter() {
                    if !visited.contains(edge) {
                        n_steps += 1;
                        visited.insert(edge);
                        let edge = *edge;
                        let distance = self.distance(query, edge, align_band);
                        candidates.insert(SearchResult::new(edge, distance));
                        bsf.insert(SearchResult::new(edge, distance));
                    }
                }
            }
            for res in bsf.take(k_neighbors) {
                results.insert(res);
            }
        }
        (results.take(k_neighbors), n_steps)
    }

Testing the Code on UCR Time Series

In order to test the code, we run it against multiple datasets from the UCR time series archive. Our accuracy is similar to the UCR datasets. However, the number of distance computations is way less.

Avro Rust: A first look at data serialisation in rust using avro

2019-03-16T10:00:00.001-07:00

Avro is a data serialisation system in which you can define a schema (basically the fields in a record you want to serialise along with their type) in json and then send the data binary.

In the Rust language, structs can be serialised to multiple formats using a library called serde. In the
last post we saved our word2vec model into a binary format using serde and bincode. In this post, we will look into serialising with serde and avro-rs, an avro implementation in rust and implement sending serialised objects over a tcp stream using framing.

As usual the code is on [:github:]

Data (De) Serialisation

Avro supports several data types for serialisation including more complex types such as arrays.

 #[derive(Debug, Deserialize, Serialize)]
pub struct Test {
    a: i32,
    b: f64,
    c: String,
    d: Vec<i32>
}

fn schema() -> Schema {
    let raw_schema = r#"
            {
                "type": "record",
                "name": "test",
                "fields": [
                    {"name": "a", "type": "int",    "default": 0},
                    {"name": "b", "type": "double", "default": 1.0},
                    {"name": "c", "type": "string", "default": "dansen"},
                    {"name": "d", "type": {"type": "array", "items": "int"}}
                ]
            }
        "#;
    avro_rs::Schema::parse_str(raw_schema).unwrap()
}

In the code above, we define a struct that we want to serialise and define the serialisation schema in json. Nothing more is needed. With this information we can serialise and deserialise the `Test` struct. We can even (de) serialise a collection of these structs. The serialisation results in a vector of bytes and given this vector of bytes we can deserialise into a vector of our original struct.

 #[derive(Debug, Deserialize, Serialize)]
fn write_data(data: &[Test], schema: Schema) -> Vec<u8> {
    let mut writer = Writer::with_codec(&schema, Vec::new(), Codec::Deflate);
    for x in data.iter() {
        writer.append_ser(x).unwrap();
    }
    writer.flush().unwrap();
    writer.into_inner()
}

fn read_data(data: &[u8], schema: Schema) -> Vec<Test> {
    let reader = Reader::with_schema(&schema, data).unwrap();
    reader
        .map(|record| from_value::<Test>(&record.unwrap()).unwrap())
        .collect()
}

Framed Writing and Reading over TCP/IP

In order to send avro over the network, it seems recommended to frame the message. That means breaking our byte stream into chunks we send over, each preceded by it's length and terminated by a 0 sized chunk. Avro messages are framed as a list of buffers. Framing is a layer between messages and the transport. It exists to optimize certain operations. The format of framed message data is: a series of buffers, where each buffer consists of: a four-byte, big-endian buffer length, followed by that many bytes of buffer data. A message is always terminated by a zero-length buffer. Framing is transparent to request and response message formats (described below). Any message may be presented as a single or multiple buffers. Framing can permit readers to more efficiently get different buffers from different sources and for writers to more efficiently store different buffers to different destinations. In particular, it can reduce the number of times large binary objects are copied. For example, if an RPC parameter consists of a megabyte of file data, that data can be copied directly to a socket from a file descriptor, and, on the other end, it could be written directly to a file descriptor, never entering user space. A simple, recommended, framing policy is for writers to create a new segment whenever a single binary object is written that is larger than a normal output buffer. Small objects are then appended in buffers, while larger objects are written as their own buffers. When a reader then tries to read a large object the runtime can hand it an entire buffer directly, without having to copy it. from Avro Page .

Writing over the network with a fixed buffer or chunk size can be implemented by sending slices of the given size over the network, each preceded by the length of it (which really only matters for the last one). As described above we send the size of the buffer as a 4-byte integer, big endian.

 pub fn send_framed(objects: &[Test], address: String, schema: Schema, buf_size: usize) {
    let data = write_data(objects, schema);
    let mut stream = TcpStream::connect(address).unwrap();
    let n = data.len();
    for i in (0..n).step_by(buf_size) {
        // determine size of bytes to write
        let start = i;
        let stop = usize::min(i + buf_size, n);
        // send length of buffer
        let mut buffer_length = [0; 4];
        BigEndian::write_u32(&mut buffer_length, (stop - start) as u32);
        stream.write(&buffer_length).unwrap();
        // write actual data
        stream.write(&data[start..stop]).unwrap();
    }
    // terminate by 0 sized
    let mut buffer_length = [0; 4];
    BigEndian::write_u32(&mut buffer_length, 0);
    stream.write(&buffer_length).unwrap();
    stream.flush().unwrap();
}

When reading we simply read the size of the next buffer, then read it and append it into a vector of bytes until we hit the zero size buffer marking the end. Then we simply use the method above to de serialise into a vector of structs

 
pub fn read_framed(stream: &mut TcpStream, schema: Schema) -> Vec<Test> {
    let mut message: Vec<u8> = Vec::new();
    let mut first_buf = [0; 4];
    stream.read(&mut first_buf).unwrap();
    let mut next = BigEndian::read_u32(&first_buf);
    // read while we see non empty buffers
    while next > 0 {
        // append all bytes to final object
        let mut object = vec![0; next as usize];
        stream.read(&mut object).unwrap();
        message.append(&mut object);
        // read next buffer size
        let mut next_buf = [0; 4];
        stream.read(&mut next_buf).unwrap();
        next = BigEndian::read_u32(&next_buf);
    }
    read_data(&message, schema)
}

Thats it, now we can send avro framed over the network and de-serialise on the other end.

Word2Vec Implementation in Rust: Continuous Bag of Words

2019-03-08T02:59:00.001-08:00

I recently started to get serious about learning rust and I needed a small to medium project to train on.
One thing I explained in this blog and walked through the code is Word2Vec. A method to learn word embeddings. In this post I will explain the method again using my new rust code on github.

Word Similarity - Basic Idea

In the following, I will describe the basic idea behind continuous bag of words, one way to estimate word vectors. Word vectors are a dense vectorised word representation capturing similarity of words in Euclidean space. Basically we have one vector for each word in our vocabulary. The euclidean distance between two vectors of words that are similar is smaller than the distance between vectors of words that are less similar. The assumption in Word2Vec is that words that appear in a similar context are similar. We shift a sliding window over the text and try to predict the center word in the window from the word vectors of the surrounding words in the window.

pub struct Window<'a> {
    pub words:       &'a [usize], 
    pub predict_pos: usize
}

pub struct Document {
    pub words: Vec<usize>
}

impl<'a> Window<'a> {

    pub fn label(&self) -> usize {
        self.words[self.predict_pos]
    }

}

impl Document {

    pub fn window(&self, pos: usize, size: usize) -> Option<Window> {
        let start = sub(pos, size);
        let stop  = usize::min(pos + size, self.words.len() - 1); 
        if stop - start == 2 * size {
            Some(Window {words: &self.words[start as usize .. stop as usize], predict_pos: size})
        } else {
            None
        }
    }

}

The code above defines a document as a vector of word ids and allows to shift a sliding window over it. The return type is not a vector itself but a slice, which is convenient since there is nothing copied. In the following the center word in the window is called a label $l$ and the other words are called the context: $c_1 .. c_m$.

Word Vectors

Now that we have defined the window, we can start implementing a representation for the word vectors. There will be one vector of $d$ dimensions or columns per word in the dictionary.
In code we save the word vectors as a flat vector of doubles. We can then access the $i-th$ vector as a slice from $[i * cols .. (i + 1) * cols]$

#[derive(Serialize, Deserialize)]
pub struct ParameterStore {
    pub cols: usize,
    pub weights: Vec<f64>
}

impl ParameterStore {

    pub fn seeded(rows: usize, cols: usize) -> ParameterStore {
        let mut rng = rand::thread_rng();
        let mut weights = Vec::new();
        for _i in 0 .. (rows * cols) {
            let uniform = (rng.gen_range(0.0, 1.0) - 0.5) / cols as f64;
            weights.push(uniform);
        }
        ParameterStore {cols: cols, weights: weights}
    }

    pub fn at(&self, i: usize) -> &[f64] {
        let from = i * self.cols;
        let to = (i + 1) * self.cols;
        &self.weights[from .. to]
    }

    pub fn update(&mut self, i: usize, v: &Vec<f64>) {
        let from = i * self.cols;
        for i in 0 .. self.cols {
            self.weights[i + from] += v[i];
        }
    }

    pub fn avg(&self, result: &mut Vec<f64>, indices: Vec<usize>) {
        for col in 0 .. self.cols {
            result[col] = 0.0;
            for row in indices.iter() {
                let from = row * self.cols;
                result[col] += self.weights[col + from];
            }
            result[col] /= indices.len() as f64;
        }
    }

}

We initialise the vectors by a random number in the range $\frac{Uniform([-0.5;0.5])}{d}$, which is the standard way to initialise word2vec. The update method simply adds a gradient to a word vector. There is also a method to compute the average of multiple word vectors, needed later during learning.

Putting It All Together: Computing the Gradients

So now we can compute windows from the text and average the word vectors of the context words.

The averaged vectors are our hidden layer's activations. From there we compute the prediction and then back propagate the error. So we need two sets of parameters. One is the word vectors initialised as above, and one for the predictions which is initialised to zeros.

#[derive(Serialize, Deserialize)]
pub struct CBOW {
    pub embed:   ParameterStore,
    pub predict: ParameterStore
}

Given the average embedding for the context as $h$ and the vector for word $i$ from the predictions $w_i$, the forward pass on the neural network is:

$a_i = \sigma(h * w_i)$.

We are predicting the center word of the window, that means we model the output layer as a multinomial where the center word is one hotcoded. So output should be $1$ for the center word
and $0$ for all others. Then the error for the center word is:

$e_i = log(a_i) $

and for all others it is

$e_i = log(1 - a_i)$.

The gradient on the loss is simply $\delta_i = (y - a_i) * \lambda$, with $\lambda$ being the learning rate.

 fn gradient(&self, label: usize, h: &Vecf&lt64>, truth: f64, rate: f64) -> (f64, f64) {        
        let w = self.predict.at(label);
        let a = sigmoid(dot(&h, &w));
        let d = (truth - a) * rate;
        let e = -f64::ln(if truth == 1.0 {a} else {1.0 - a});    
        (d, e)
    }

The gradient applied to the output parameters is: $\delta_i * h$. The gradient for the embeddings is: $\delta_i * w_i$ which is applied to all context words. So all words in the same context are updated with the same gradient, in turn making them more similar, which was the main goal for word embeddings. There is one more detail. Normally, we would have to compute the losses for the complete dictionary, however for a million words, that would take very long. So instead we use something called negative sampling. We compute the gradient for the center word and then sample some negative words as the negative gradient.
The final computation for learning is shown below:

 pub fn negative_sampeling(&mut self, window: &Window, rate: f64, n_samples: usize, unigrams: &Sampler) -> f64 {
        let mut gradient_embed = vec![0.0; self.embed.cols];
        let h = self.embedding(self.ids(window));        
        let (pos, pos_err) = self.gradient(window.label(), &h, 1.0, rate);        
        let mut error = pos_err;
        let mut gradient_pos_predict = vec![0.0; self.predict.cols];
        for i in 0 .. self.embed.cols {        
            gradient_embed[i] += pos * self.predict.at(window.label())[i];
            gradient_pos_predict[i] += pos * h[i];
        }
        self.predict.update(window.label(), &gradient_pos_predict);
        for _sample in 0 .. n_samples {
            let token = unigrams.multinomial();
            let (neg, neg_error) = self.gradient(token, &h, 0.0, rate);
            error += neg_error;
            let mut gradient_neg_predict = vec![0.0; self.predict.cols];
            for i in 0 .. self.embed.cols {
                gradient_embed[i]       += neg * self.predict.at(token)[i];
                gradient_neg_predict[i] += neg * h[i];
            }
            self.predict.update(token, &gradient_neg_predict);
        }
        let ids = self.ids(window);
        for i in 0 .. ids.len() {
            self.embed.update(ids[i], &gradient_embed);
        }
        error
    }

Results

I trained the net on quora questions and when searching for similar vectors we will find similar words:

Searching: India
kashmir: 11.636631273392045
singapore: 12.155846288647625
kerala: 12.31677547739736
dubai: 12.597659271137639
syria: 12.630908898646624
pradesh: 12.657158343619573
pok: 12.667834993874745
malaysia: 12.708228117809442
taiwan: 12.75190112716203
afghanistan: 12.772290465531222
africa: 12.772372266369985
bjp: 12.802207718902666
karnataka: 12.817502941769622
europe: 12.825095372204384

Scala3D: Implementing A Wolfenstein Like Ray Casting Engine

2019-01-10T15:58:00.000-08:00

Introduction

I recently got interested in 3D game engines for realistic images. I am currently reading through
a book describing a complete ray tracing algorithm [3]. However, since there is a lot of coding to do and a lot of complex geometry, I searched for a simpler side project. I then started to look into 2D
ray casting to generate 3D environments, as used in the Wolfenstein 3D engine. There are several
descriptions on how to build one [1,2]. However, the Wolfenstein black book [1] talks a lot about memory optimisations and execution speed. Lots of the chapters include assembly code and c snippets dealing with the old hardware. Implementing something like this is way easier, now.
Computers got way faster and programming languages way more convenient. So I decided to
build a clone in Scala. The result can be seen in the video above. The left side is the 3D view, and the right side shows a 2D map view. The red lines are the rays from the ray casting algorithm which represent what the player sees and the green line is the camera. The code can be found here [:github:]. If you want to move in the map use "WASD".

Assumptions About The World

The world in Wolfenstein 3D is represented by a flat grid. Different heights in the map are not possible. All objects stand on the same plane and have to be boxes. This assumption makes a lot of things easier. In a 3D engine, the main task is to determine what is seen from a specific view (the player's camera). The nice thing is that we do not have to determine what is visible in any possible direction, because we are moving on a plane, which restricts us to a 2D problem. The second assumption of a world on a grid will simplify the calculation of ray to box intersection later.

Implementing The Player And Camera

The basic idea is to describe a player in 2D by it's position: $p = (p_x, p_y)$. Furthermore, we save the position the player is looking at: $d = (d_x, d_y)$. We also save the image plane: $i = (i_x, i_y)$ which is perpendicular to the player.

In the image above the image plane is marked in green, and the rays for collision are marked in green. In order to calculate the direction for each ray, we first calculate the x coordinate in camera space as $c_x = 2 * x / w$ where $x$ is the row in the image and $w$ is the width of the image, then we compute the ray from the player that goes through that position as:

$r_{dx} = d_x + i_x * c_x$
$r_{dy} = d_y + i_y * c_x$

Now we can follow each ray until we hit a wall.

Digital Differential Analysis Ray Casting

In order to find what wall we see at the image row $c_x$, we follow the ray until it hits a wall. We initialize the search by calculating the distance to the first time the ray hits the x-axis (in the image this is the segment $a$ to $b$) and the first time we hit the y axis (in the image this is the segment $a$ to $c$). From there we estimate the slope $\delta = (\delta_x, \delta_y)$ in x and y which tells us how far we have to travel to hit each axis again. We then travel in the closest next grid position and check if the position is a wall hit.

    @tailrec
    private def traceRay(
        grid: Vector2D[Int], 
        sideDist: Vector2D[Double], 
        side: Int,
        deltaDist: Vector2D[Double],
        step: Vector2D[Int], 
        hit: Boolean,
    ): Hit = 
        if(hit) Hit(grid, side)
        else sideDist match {
            case Vector2D(x, y) if x < y => 
                traceRay(
                    Vector2D[Int](grid.x + step.x, grid.y),
                    Vector2D[Double](sideDist.x + deltaDist.x, sideDist.y),
                    0,
                    deltaDist,
                    step,
                    world.collids(grid.x + step.x, grid.y)
                )
            case Vector2D(x, y) if x >= y => 
                traceRay(
                    Vector2D[Int](grid.x, grid.y + step.y),
                    Vector2D[Double](sideDist.x, sideDist.y + deltaDist.y),
                    1,
                    deltaDist,
                    step,
                    world.collids(grid.x, grid.y + step.y)
                )
        }

In the implementation we move through the grid bu the delta distance and the current side distance (distance traveled along the $x$ and $y$ component). We also save which side ($x$ or $y$) we hit the wall at. In the recursion we follow the closest next map square until we find a hit and then return the position of the hit $h = (h_x, h_y)$ as well as the side.

Rendering A Scene

So for each image row, we run the ray tracer until we hit a wall. We then compute the distance to the wall in order to calculate how high the wall appears in the image $w_d$ from the player position, the hit position, the step size and the direction:

    private def projectedDistance(
        hit: Hit,
        pos: Vector2D[Double], 
        step: Vector2D[Int], 
        dir: Vector2D[Double], 
    ): Double = 
            if (hit.side == 0) (hit.pos.x - pos.x + (1 - step.x) / 2.0) / dir.x
            else               (hit.pos.y - pos.y + (1 - step.y) / 2.0) / dir.y

The height of the wall is then simply $l_h = \frac{h}{w_d}$. We then compress the slice of the texture we hit to that height and paint it in the image at the row position we send the ray for.

References

[1] GAME ENGINE BLACK BOOK: WOLFENSTEIN 3D, 2ND EDITION, 2018

[2] RayCasting Tutorial in CPP

[3] Physically Based Rendering: From Theory To Implementation

Lindenmayer system: Implementation

2018-11-06T14:05:00.001-08:00

I recently got interested in Lindenmayer systems or L-system for short and I also wrote some code. These are generative

systems that generate strings, that in turn can be interpreted as turtle graphics. For example,

you can model koch curves and the famous dragon curve. This book describes how to generate plants which you can see below.

Implementation

For example, given a initial string $+A-Bf$, we have terminal symbols $\{+,-,f\}$ and non terminals $\{A, B\}$ that can still be replaced by rules like $A \rightarrow AfA$, which reads: each A can be replaced in the string. Furthermore, the string $+A-Bf$ can be interpreted by a turtle as $+$: turning right, $-$: turning left, all capital letters mean moving forward drawing a line, and all lower case letters mean moving silently to the next position. We start by defining a symbol class.

case class Symbol(id: Char) {

  def isSilent: Boolean = id.isLower

  def isTerminal: Boolean = !id.isLetter || id.isLower

}

Next we define a Lindenmayer system as a start string and a set of rules. Given the start string and a depth limit, we can recursively replace all non terminal symbols with the associated rules. This enables us to generate string.

case class LSystem(start: List[Symbol], rules: Map[Symbol, List[Symbol]]) {

  private def produce(symbol: Symbol): List[Symbol] =
    if(symbol.isTerminal) List(symbol)
    else rules(symbol)

  @tailrec
  private def produce(current: List[Symbol], result: List[Symbol]): List[Symbol] = current match {
    case Nil          => result
    case head :: tail => {
      produce(tail, result ::: produce(head))
    }
  }

  @tailrec
  private def produce(current: List[Symbol], stepsLeft: Int): List[Symbol] =
    if(stepsLeft == 0) current
    else produce(produce(current, Nil), stepsLeft - 1)

  def produce(depth: Int): List[Symbol] = produce(start, depth)

}

Now we are done with the definition of the grammar system. The only thing left is to define a turtle and an interpreter that converts strings from L-systems to a sequence of turtles. I define a turtle by it's posture $[x, y, \alpha]$ where $x,y$ is the turtle's position and the orientation is defined by the angle $\alpha$. For moving forward the next positions are defined as:

$x + \delta * cos(\alpha)$
$y + \delta * sin(\alpha)$

A rotation basically involves increasing the angle by some certain degree.

case class Turtle(x: Int, y: Int, angle: Double, sil: Boolean) {

  def move(distance: Int): Turtle =
    Turtle(
        math.round(x + distance * math.cos(math.toRadians(angle))).toInt, 
        math.round(y + distance * math.sin(math.toRadians(angle))).toInt, angle, false)

  def moveSil(distance: Int): Turtle =
    Turtle(
        math.round(x + distance * math.cos(math.toRadians(angle))).toInt, 
        math.round(y + distance * math.sin(math.toRadians(angle))).toInt, angle, true)

  def turn(by: Double): Turtle = {
    Turtle(x, y, (angle + by) % 360, false)
  }

  def silence: Turtle = Turtle(x, y, angle, true)

Then given a string, we can recursively build a sequence of positions by applying the action associated with each symbol.

class TurtleInterpreter(moveBy: Int, turnBy: Double) {

  import TurtleInterpreter._

  private def execute(cmd: List[Symbol], state: Turtle, stack: List[Turtle], drawable: List[Turtle]): List[Turtle] = cmd match {
    case Nil => drawable
    case RotateRight :: tail => execute(tail, state.turn(turnBy), stack, drawable)
    case RotateLeft  :: tail => execute(tail, state.turn(-turnBy), stack, drawable)
    case PushStack   :: tail => execute(tail, state, state :: stack, drawable)
    case PopStack    :: tail => execute(tail, stack.head.silence, stack.tail, stack.head.silence :: drawable)
    case head        :: tail => {
      val next = if (head.isSilent) state.moveSil(moveBy) else state.move(moveBy)
      execute(tail, next, stack, next :: drawable)
    }
  }

  def execute(cmd: List[Symbol], init: Turtle): List[Turtle] =
    execute(cmd, init, List.empty[Turtle], List(init))

}

object TurtleInterpreter {

  final val RotateRight = Symbol('+')
  final val RotateLeft  = Symbol('-')
  final val PushStack   = Symbol('[')
  final val PopStack    = Symbol(']')

Given that we can define grammars and paint pictures as the ones shown above. I wrote some code that plots all the above images that can be found on github.

Exploring Super Resolution With Neural Networks: Blade Runner Style

2018-11-03T14:52:00.000-07:00

In Blade Runner there is a scene in which Deckard (the protagonist) inspects one of the replicant's (Loean's) photos. The scene features an insane zoom which this blog post estimates the final magnification at a factor of 667.9

While this is science fiction (obviously) there where some breakthroughs in super resolution recently. Basically we use machine learning models to upscale an image without gaining pixelated artefacts.
I used this code to upscale the same photo from the movie using the code that is around using the Lasagne neural network framework.

The top image shows a reflection in a cabinet that is itself a reflection in the round mirror.
The bottom image shows the first zoom in Deckard performs.

Below is the original image (top) and the result (bottom) I got when running the 4 times upscale neural net on the image from Blade Runner.

If you zoom into the mirror in the reconstructed image (open in new tab to see it in it's original resolution) and you use your imagination, I believe you can see a face in it. I do not believe

it is actually there but that it is a reconstruction artefact, but who knows ...

Fast Threshold Estimation In Decision Trees

2018-10-08T12:46:00.002-07:00

In decision trees and related methods such as random forests and gradient boosting, we need
to find thresholds in order to split the continuous attributes. Finding the optimal threshold on a feature using information gain can be computationally expensive. However, there is a method to perform this operation fast. One solution is the histogram method described in [1].

Algorithm Sketch

Basically we build a table where we bin each value of a certain attribute and save the number of occurrences of a label in said bin. Each bin represents a certain threshold we could apply. Once everything is indexed, we estimate the total number of entries for each class. In the example, class one occurs 33 times in total and class two 96 times. We then scan lowest threshold and keep track of all the values below the current. We can then easily compute the number of each class below and
above the threshold and use these numbers to compute the information gain from that.

Implementation

First we need to be able to find a bin given a sorted kist of thresholds and given an attribute values. One algorithm is an adjusted binary search. We basically check the center bin in a range. If this is our bin, we are done. Otherwise, we continue search to the left or the right.

def bisection(val, th, start, stop):
    center = int((stop - start) / 2 + start)
    n = len(th) - 1
    if center <= 0:
        return 0
    elif center >= n:
        return n
    elif val > th[center - 1] and val <= th[center]:
        return center
    elif val < th[center]:
        return bisection(val, th, start, center)
    else:
        return bisection(val, th, center, stop)

The other thing we need is an entropy calculation. Since we are talking about a classification problem, we will implement the discrete entropy calculation.

def entropy(x):
    scaler = x.sum()
    if scaler == 0:
        return 0.0
    else:        
        logScaler = log2(scaler) 
        h = 0
        for i in range(0, len(x)):
            if x[i] > 0:
                p     = x[i] / scaler
                log_p = np.log2(x[i]) - logScaler
                h += p * log_p
        return -h

Finally, we can implement the algorithm sketch from above.

class LabelHistogram:
    
    def __init__(self, thresholds, n_classes):
        self.thresholds = thresholds
        self.n_bins     = len(thresholds)
        self.n_classes  = n_classes
        self.bins       = np.zeros((self.n_bins, n_classes))
                             
    def insert(self, attr, label):
        i = bisection(attr, self.thresholds, 0, self.n_bins)
        self.bins[i, label] += 1        
                
    def gain(self, parent_gain):
        total = np.sum(self.bins, axis=0)
        cur = np.zeros(self.n_classes)
        max_gain = float('-inf')
        max_th = None
        for i in range(0, self.n_bins):
            cur += self.bins[i]
            left    = cur
            right   = total - cur            
            cleft   = left.sum() 
            cright  = right.sum()
            total   = cleft + cright
            gain    = parent_gain - ((cleft / total) * entropy(left) + (cright / total) * entropy(right))
            if gain > max_gain:
                max_gain = gain
                max_th   = self.thresholds[i]
       return (max_gain, max_th)

References

[1] Criminisi, Shotton: "Decision Forests for Computer Vision and Medical Image Analysis", Springer, 2013

Simple Gaussian Processes For Regression

2018-07-28T07:03:00.003-07:00

In this post I am implementing a simple Gaussian Process Regressor. A Gaussian Process is a
probabilistic kernel model. Basically I am following the "Machine Learning: A probabilistic Perspective" book by Kevin Murphy.

The Basics

For an input $x_i$ and output $y_i = f(x_i)$, infer a distribution over functions of the data:
$p(f | X, y)$. In order to make predictions $y*$ for a new data point $x*$ given all the previous observations, we define:

$p(y*|x*, X, y) = \int p(y*|f, x*) p(f|x, y) df$

For our data $X = x_1 ... x_n$ the Gaussian process assumption is that the the distributions over functions $p(f_1 ... f_N)$ is jointly Gaussian with mean $\mu(x)$ and a covariance function $\Sigma_{ij} = k(x_i, x_j)$ and a kernel $k$.

Covariance structure

It is quite common to leaf the mean function at zero. Given a kernel, we need to compute the covariance function between all training and test instances. However, we structure the kernel matrix in the following way.

The sub-matrix $K$ describes the affinity between all training instances and the sub-matrix $K*$ describes the affinity between training and test instances. Finally, the submatrix $K**$ describes the affinity between test instances to itself. The image above shows the final covariance structure. With that, our function $f*$ can be inferred as: $(f, f*) ~ N(0, \Sigma)$.

In order to obtain the prediction calculate: $f* = K_* \bullet K^{-1} \bullet f$.
If we want the find the variance of the predictions we calculate: $trace(K_{**} - K_* \bullet K^{-1} \bullet K_*^T)$

In my opinion the method is very similar to support vector machines and to instance beased learning methods in general. The difference is that a SVM is only using a small set of training examples called support vectors, while the Gaussian process compares every example to all others in order to make a prediction.

A Naive Gaussian Process

Below I implemented a naive Gaussian process using numpy. Basically, I am building the three components of the covariance structure and then solve the prediction equations above. I call my method naive since computing the inverse has a worst case complexity of $O(n^3)$
So it is not advisable to use this method for larger datasets.

class GaussianProcess:
    
    def __init__(self, kernel):
        self.kernel = kernel
        
    def predict(self, train_x, train_y, test_x):
        (n, _) = train_x.shape
        (m, _) = test_x.shape
        covar_train      = np.zeros((n, n))
        covar_train_test = np.zeros((n, m))
        covar_test       = np.zeros((m, m))
        for i in range(0, n):
            for j in range(0, n):
                covar_train[i, j] = self.kernel.k(train_x[i], train_x[j], i == j)
        for i in range(0, m):
            for j in range(0, m):
                covar_test[i, j] = self.kernel.k(test_x[i], test_x[j], i == j)
        for i in range(0, n):
            for j in range(0, m):
                covar_train_test[i, j] = self.kernel.k(train_x[i], test_x[j], False)
        covar_inv  = np.linalg.inv(covar_train)
        prediction = np.dot(np.dot(covar_train_test.T, covar_inv), train_y)
        confidence = covar_test - np.dot(covar_train_test.T, np.dot(covar_inv, covar_train_test))
        confidence = np.sqrt(np.array([confidence[i, i] for i in range(0, m)]))
        return prediction, confidence

Squared Exponential Kernel

For my experiments I use a squared exponential kernel and also add a variance parameter for for noisy observations:

$k(x_i, x_j) = \sigma_f^2 e^{-\frac{1}{2l^2} || x_i - x_j ||_2^2} + \sigma_y^2 \mathbb{I}(i == j)$

The parameters are as follows:

$\sigma_f^2$: Variance giving the scale of the output
$\sigma_y^2 \mathbb{I}(i == j)$: Noise variance applied to diagnoal
$l$: Scale on input

again, the implementation is quite simple in numpy.

class SquaredExponential:

    def __init__(self, variance, alpha, noise):
        self.alpha = alpha
        self.variance = variance
        self.noise = noise
        self.scaler = -1.0 / (2 * alpha**2)
        
    def k(self, a, b, on_diag):
        error  = np.sum(np.square(a - b))
        f      = np.exp(self.scaler * error)
        scaled = self.variance * f
        if on_diag:
            scaled += self.noise
        return scaled

Results on some simple example datasets

I tried my method on three datasets, 2 times, I am intepolating stock marktet data and one time a sine wave. Each time, the index is the input to the method and the output for the stock market prediction is the closing price.
Sine Wave Interpolation:

Stock Interpolation Dax:

Stock Interpolation US:

Financial Machine Learning Experiement

2018-07-21T03:40:00.000-07:00

As a fun project, I tried to predict the stock market and implemented a small library todo so.
Basically, I want to use a sliding window of historic prices of several companies or funds and
predict the price of a single company. In the end, I achieve decent results in a 10 day window and
predicting 5 days into the future. However, my error is still several euros large :)

The code can be found here: [:Github:]

Learning Problem, Features and Modeling

First we define the return of interest (ROI) as:

$roi(t, t + 1) = \frac{x_t - x_{t + 1}}{x_t} $

Which is the percentage of earnings at a time $t + 1$ measured to a previous investment

at $t$. In order to extract the features, we compute sliding windows over our historic prices.

For each sample in a window, we compute the ROI to the start of the window, which represents

the earnings to the start of the window. For a window of $T$ steps and $n$ stocks, we flatten the sliding windows and get a feature vector of size $T x n$ of ROI entries. Our target variable we want

to predict is a stock price $d$ days after the last day of the window. The traget is converted

to the ROI, too. So we try to predict the earnings after $d$ days after the last day of the window

which represents a potential investment.

Red: roi in the window, Blue: ROI for prediction

Some Experimental Results

First we downloaded several historic price datasets from yahoo finance and load them using pandas. We take the date column as the index and interpolate missing values:

 
data = [
    ('euroStoxx50', pd.read_csv('data/stoxx50e.csv', index_col=0, na_values='null').interpolate('linear')),
    ('dax',         pd.read_csv('data/EL4A.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('us',          pd.read_csv('data/EL4Z.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('xing',        pd.read_csv('data/O1BC.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('google',      pd.read_csv('data/GOOGL.csv',    index_col=0, na_values='null').interpolate('linear')),
    ('facebook',    pd.read_csv('data/FB2A.DE.csv',  index_col=0, na_values='null').interpolate('linear')),
    ('amazon',      pd.read_csv('data/AMZN.csv',     index_col=0, na_values='null').interpolate('linear'))
]

We learn the following classifier on our data.

 
predictors = [
    ('RF', RandomForestRegressor(n_estimators=250)),
    ('GP', GaussianProcessRegressor(kernel=RBF(length_scale=2.5))),
    ('NN', KNeighborsRegressor(n_neighbors=80)),
    ('NE', KerasPredictor(model, 10, 512, False)),
    ('GB', GradientBoostingRegressor(n_estimators=250))
]

The neural network's architecture (named model above):

 
hidden = [256, 128, 64, 32]
inp    = (len(data.stoxx) * WIN,)
model  = Sequential()
model.add(Dense(hidden[0], activation='relu', input_shape=inp))
for h in hidden[1:]:
    model.add(Dense(h, activation='relu'))
model.add(Dense(1, activation='linear'))
model.compile('adam', 'mse')

Below we show some prediction results for our classifiers.
We also not the root mean square error in euros.

Keras: Convolutional LSTM

2018-06-25T06:16:00.000-07:00

Stacking recurrent layers on top of convolutional layers can be used to generate sequential output (like text) from structured input (like images or audio) [1].

If we want to stack an LSTM on top of a convolutional layers, we can simply do so, but we need to
reshape the output of the convolutions to the LSTM's expected input. The code below shows an implementation in Keras:

 
T = 128
D = 64

N_FILTERS  = 32
LSTM_T     = int(T / 8)
LSTM_D     = int(D / 2)
LSTM_STATE = 128
POOL       = (8, 2)

i = Input(shape=(T, D, 1))                         # (None, 128, 64, 1)   
x = Conv2D(N_FILTERS, (3, 3), padding = 'same')(i) # (None, 128, 64, 32)
x = MaxPooling2D(pool_size=POOL)(x)                # (None, 16, 32, 32) 
x = Reshape((LSTM_T, LSTM_D * N_FILTERS))(x)       # (None, 16, 1024) 
x = LSTM(LSTM_STATE, return_sequences=True)(x)     # (None, 16, 128)     


model = Model(i, x)
model.summary()

In this example we want to learn the convolutional LSTM on sequences of length 128 with 64 dimensional samples. The first layer is a convolutional layer with 32 filters. So the outputs are 32
sequences, one for each filter. We pool the sequences with a (8, 2) window. So our 32 sequences are now of size (128 / 8 = 16, 64 / 2 = 32). Now we have to combine the dimensions and the filter responses into a single dimension of size (32 * 32 = 1024) so we can feed a sequence into the LSTM which requires a rank 2 ( or 3 with batch) tensor with the first dimension being the time step and the second each frame. Finally we add the LSTM layer.

Character Wise Generation From The Little Book Of Calm

2018-06-13T07:10:00.002-07:00

If you saw the show black books, you saw the calming powers of the little book of calm.
Surprisingly the book actually exists. I extracted the text from the Kindle version and will now train an LSTM with it in order to generate an infinity stream of calming advice. The actual code can be found as a Gist.
For copyright reasons I can not attach the text. So if you want to train this too, you need to buy the Kindle version.

I split all the text into windows of 40 characters and train an LSTM to predict the following character. I use a bidirectional LSTM with a 128 dimensional cell-state. The output mode for the LSTM is many-2-many, which means each input at each time step (the current character) also has an output (the next character). The output is pushed through a dense softmax layer. The LSTM is shown below:

 
length = 40
step   = 1
model = Sequential()
model.add(Bidirectional(LSTM(128, input_shape=(length, n_char), return_sequences=True), input_shape=(length, n_char)))
model.add(TimeDistributed(Dense(n_char, activation='softmax')))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

The Training log is interesting. One can see that the LSTM gets better and better at generation text which can be seen in the notebook. Below I listed some samples. For the sampling, we initialise the generation with a random sequence of 40 characters and let the LSTM generate the rest. The details of the sampling can be seen in the notebook. The text is not great but one can see that the LSTM aims to generate sentences similar to the ones in the book. One might be able to improve it's performance with more layers or more epochs.

However, since the model works on a character level, I am not unhappy with the results:

"thes the upset child, just as a kiss or handshake takes the pleasures eale them to poins of almous to be calming end it carting dive you’ll find calm phosting work is acyies all little forgsimutarly aboveruts a phopees will tell you, nothed (which your speech trangess there are tho defullame and eyebrows phon places of person, joyous every meal with the fores or nacciols, and discontrom to the hear postrous into them most powerm, when they gretophsints to calm. sho exercises when they laverds with the concenturies an much you think as"
"eel calm. recognise addictions for what your has a little choice. devers of a favour things which the worldor tak every moments worry habirothing – you half the way to feel. emplay of the lights – quiet – entoy not worrierate – beting worry king before you think about still, bow feelinggs there is enliet before yourself calm if you feel calm your earl goonicoly disary taskn take the time is efly. kink well assential astimugars asout peace – and have all tall you to can calm. the way you look at that your happenng, puts of orange blome"
"e stand straighter and taller than you beliefs, you will fool your hand, mixitates even the back sede tans bighing take and the ennable treates the beauty the add to think worty as major troutten if you combscents are among the humone situations: you treat at attating them. trank und harty thing an ised. on the small issues them, on your ease a sea calm always stimulates things that as much you must’, ‘i subripress on a way of what your day – from will happen. life for have to live a timens ael add the con rowhing wholly on tat hapte,"
"then you can be calm. sip warm water a glass than politens, put as you tonkelow yourself up. player. sel they is possible not to calm. then you connald ablity to live a cemplemants and englavoide. make to the benor effort into hands, not for twengy to the roother unlike in your diet, something with a peyture of peace and calm deind go little known bad compony all spech speeds can seritally when at a physictionel. your hearly, small issues them, soothes of a hand at also keolitas the moment your feet, then review then ptsic a fatticula"
"learning something you want to know. and you hueld’ go limp. removeties it posttert in the worlding the way you look at then posittle with your attionshops a out the cl-fectly having a little controgises to be remote. by massal favour bady them. is thice as you go envements are then positive to be captivelity when it carnething to say your subconscious speative of remote those who trere you ueld it before yourself underful fact as the feelostrust in everything when you’re leave dowalk a little between what harts like go to bed. whish "
"the moment when you concentrate your attention on a way attiony what it’s only of relaxing those calm, even up, by so fhols, behaw recognise them bach someone, alwings a calm people about beauty phor people als your thoughts, will take whice a groutts a calm person, a floathing down on a grass to help you know, by replyating widg a few chill, your speech, resubcents, you feel calm toom to your can most used the moments art teads. wear wearious from time to be a stimulations in whatever you grust – then reony something your faving to w"
"light one stop early if you consciously start is to relaxed with your speech speeds. by a coach even the most commot flowsh and exprect. works bectionss your speech speeds of a field unting with the groours, but for the plears nalk eget to feel calm. rong with a masseur once that your face pheropen having to sal awhable place, a conscious to be captivelocar place, up them of a damplestilm of the espences and little not train. every sayols around the back ood be sad, them do. then massage them ind tee… as you busly petson time to time "
"time in the world to do whatever you choose a feet and contribute to than wessed for yourself untethink, acchut for the recogail oilk, and the ritimeting snack being (iss than peritups, it can be pressude yourself underle – chrisical that suches of patting your efficienals at, pu… child, joyous pios from time to time little only a timend for hard-working postable full you work, uphilly to go know wholatge them. puss, pursom efficiently you achiely that happens interivation. its massage – then recom in your breet routine your typewrite"
"even more peaceful. break the pattern when you dwell twan fow times on explcomess with nich and feel negative for the rescom only a time. change specially, then ete someon, treards to your faith noat as muchically time sometimes works mication, but for they like gardesly how – you wouldn’t normally your hail, relaxed. wearing a stimulation. be computer you niel. be absold of time, whaner always a compliment possible future, with your speech spee, minutes to times out of standing, deciduerble when the concedints are atssetic eation wit"
"make an apapointment with yourself to deal with worries that as major stressed when you’re helaxed with a resalk post your breathing, when you’re lead to calm. changes on the pest the thumb are then massage them go them, taking pleaser the most more efficiently when a country worry beads. play , any time – hose negne a softet for deeps a chomples. take ba hothing them. then raised by your life relaxt to stayicher when you sach add there’s tense sire time eas a pew your day, cannot by where you wouldn’t normally thinks. peace will not b"

Downloading All NIPS papers

2018-06-13T06:49:00.001-07:00

Recently I wanted to download a large amount of NIPS papers, so I decided to write a little web - scraper in python. Not much more to say, [here] it is. There are two scripts. One downloading the papers and one downloading the abstracts. The script basically lists all books from the nips page,
then lists all papers and from the paper page, it downloads the pdfs or abstract.

I also put the script for the paper download, below. In order to use it, you need the following libraries and python 3:

 
from lxml import html
import requests
import urllib.request


BASE_URL = 'https://papers.nips.cc/'
page = requests.get(BASE_URL)
tree = html.fromstring(page.content)
books = [ href.attrib['href'] for href in tree.xpath('//a') if 'book' in href.attrib['href']]


for book in books:
    book_page = requests.get(BASE_URL + book)
    tree = html.fromstring(book_page.content)
    papers = [ href.attrib['href'] for href in tree.xpath('//a') if 'paper' in href.attrib['href']]
    for paper in papers:
        paper_page = requests.get(BASE_URL + paper)        
        tree = html.fromstring(paper_page.content)
        links = [ href.attrib['href'] for href in tree.xpath('//a') if 'pdf' in href.attrib['href']]
        for link in links:
            local = link.split('/')[-1]
            urllib.request.urlretrieve(BASE_URL + link, local)

Porting Keras Models

2018-03-04T08:29:00.000-08:00

I implemented an LSTM in Java that can read a trained Keras models in order
to use it in some other Java projects. You can check the repository at [:Github:]

The ipython notebook exports the weights of an LSTM
and also exports some test for a random input sequence and the resulting output.
The Java implementation reads the exported weights and then applies them. All
matrix operations are implemented using jBlas.

The numeric helper class implements matrix loading and slicing.
Tanh and the sigmoid functions are also in the helper class.
The LSTM state is a separate class holding the cell state and the hidden activation.
The LSTM holds all the weights and applies them to sequences.

The basic slicing of the weights is implemented using the same weight
coding as Keras [:Github:]

        self.kernel_i = self.kernel[:, :self.units]
        self.kernel_f = self.kernel[:, self.units: self.units * 2]
        self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
        self.kernel_o = self.kernel[:, self.units * 3:]

        self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
        self.recurrent_kernel_f = self.recurrent_kernel[:, self.units: self.units * 2]
        self.recurrent_kernel_c = self.recurrent_kernel[:, self.units * 2: self.units * 3]
        self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]

        if self.use_bias:
            self.bias_i = self.bias[:self.units]
            self.bias_f = self.bias[self.units: self.units * 2]
            self.bias_c = self.bias[self.units * 2: self.units * 3]
            self.bias_o = self.bias[self.units * 3:]

Implementing Gradient Boosted Trees

2017-12-12T07:06:00.002-08:00

Implementing Gradient Boosted Trees

In this post I will share my experience on implementing gradient boosted trees as seen in XGBoost in order to get a better understanding of the method. As in previous posts, the code won't be as efficient as the original. Furthermore, I will use scala again. The basic idea behind boosting is to construct an additive model of a collection of simpler models (base models) for classification or regresion. If you are unfamiliar with the general boosting idea plase consider (Hastie: Elements of Statistics Learning) for a great introduction. This post is mainly based on this blog post and these slides. In boosting, the model is constructed iteratively that is, in an iteration $t$, the new output given a datapoint $x$ is: $$ \hat{y}^t = \hat{y}^{(t - 1)} - f_t(x) $$ The idea of gradient boosting is that in each iteration the next model is determined so it minimizes a given loss function $\mathbb{L}$. Specifically the minimization is performed using the first and second order derivatives of the loss function. Each added model can also be seen as a step in the direction of the gradient. In other words, the iterations can be seen as a form of gradient descent. When extending this idea to regression tree base models and when regularizing the objective function of the optimization, we end up with a powerful algorithm that one multiple Kaggle challenges as well as the Xing hosted recommender systems challenge to years in a row. In the next part, I will derive the optimization step used for decision trees with a cross entropy loss and then "massage" in the regularization.

Cross Entropy loss for Binay Classification

For binary classification, the go-to output function is the sigmoid function: $$P(x = 1) = \frac{1}{1 = e^{-x}}$$ Prominent usage examples include Neural Networks and logistic regressions. Given that our output $P(x = 1)$ is $\hat{y}$ and the true label is $y$, we can measure the goodness of our classifier on our dataset using the cross entropy loss: $$ L(y, \hat{y}) = \sum_i -y_i * log(\hat{y_i}) - (1.0 - y_i) * log(1.0 - \hat{y_i})$$ with its first derivative at $g_i = \hat{y_i} - y_i$ and it's second derivative at $h_i = \hat{y_i} * (1.0 - \hat{y_i})$.

Regularization

One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to penalize large parameters that dominate our result and model complexity. Given a parameter set $\theta$, the L1 regularization is $\Omega(\theta) = \sum_i ||\theta_i||_1$, where $||x||_1$ is the L1 norm or simply the absolute value. In L2 regularization, we switch to $||\theta_i||_2$ where $||x||_2$ is $x^2$. Then for a decision tree, we could measure the complexity as the number of leaf nodes $T$, and the parameters $\theta$ are the output scores of a leaf nodes (we will later use regression trees). For example, we could penalize large trees by $\gamma T$ and large output by $\frac{1}{2}\sum_i \lambda ||\theta_i||_2$. We can control the penalty amount by two parameters $\gamma$ and $\lambda$.

Gradient boosted trees

The combined objective using regularization and some loss function is: $$O = L(y, \hat{y}) + \Omega(\theta)$$ Since we are boosting, we define the objective in terms of the predictions in iteration $t$: $$O^t = L(y, \hat{y}^{t - 1} + f(x)) + \Omega(\theta)$$ while $\hat{y}^{t - 1} + f(x)$ follows directly from the boosting formular $\hat{y}^t = \hat{y}^(t - 1) - f_t(x)$. Since we want to define the gradient in order to optimize f(x), we can apply the Taylor expansion to the loss. Remember that the Taylor expansion is defined as $f(x + \delta x) \approx f(x) + f'(x) * \delta x + \frac{1}{2} f''(x) \delta x^2$ So the loss becomes $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \Omega(\theta))$$ Assuming a decision tree base classifier we can use the example regularization above and the objective becomes: $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \gamma T \frac{1}{2}\sum_i \lambda * ||\theta_i||_2$$ With a few transformations, we can redefine the loss in a way we can directly optimize the function with the standard CART learning algorithm. First we defined $\mathbb{I}(j == i)$ as an indicator function that returns 1 if example $i$ when pushed trough the decision tree, ends up in leaf node $j$. Then the result of $f(x_i)$ is $w_j$ the leaf node score at node $j$ ig $x_i$ evaluates true. Now we can rearange the objective function towards the leaf nodes as: $$O^t = \sum_j [(\sum_i \mathbb{I}(j == i) * g_i) * w_j + \frac{1}{2} (\sum_i \mathbb{I}(j == i) * h_i + \lambda) * w_j^2] + \gamma T$$ redefining the sums of the gradients to $G_j$ and $H_j$ we get: $$O^t = \sum_j [G_j * w_j + \frac{1}{2} (H_j + lambda) * w_j^2] + \gamma T$$ Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero: $$Gj + (H_j + \lambda) = 0$$ and then $w_j = -\frac{G_j}{H_j + \lambda}$. Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result. In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a normal decision tree. The formular is the gradient "amplitude" before the split $(G_l + G_h)^2 / (H_l + H_r + \lambda)$ and after the split $G_l^2 / (H_l + \lambda)$, where $G_l$ and $H_l$ are the sum of the gradients in the left node or the right node. Then the gain is simply: $$Gain = \frac{1}{2} [ \frac{G_l^2}{H_l + \lambda} + \frac{G_r^2}{H_r + \lambda} - \frac{(G_l + G_h)^2}{H_l + H_r + \lambda}]$$ Now we can implement these functions:

package gbt

import math._

trait Loss {

  def gradient(truth: Double, prediction: Double): Double

  def hessian(truth: Double, prediction: Double): Double

  def loss(truth: Double, prediction: Double): Double

  def output(x: Double): Double

}


class Logistic extends Loss {

  private def sigmoid(x: Double): Double = 1.0 / (1.0 + exp(-x))

  override def loss(truth: Double, prediction: Double): Double = {
    val isOne  = -truth * log(sigmoid(prediction))
    val isZero = (1 - truth) * log(1.0 - sigmoid(prediction))
    (isOne - isZero)
  }

  override def gradient(truth: Double, prediction: Double): Double = sigmoid(prediction) - truth

  override def hessian(truth: Double, prediction: Double): Double =
    sigmoid(prediction) * (1.0 - sigmoid(prediction))

  override def output(x: Double) = sigmoid(x)

}


class Squared extends Loss {

  override def loss(truth: Double, prediction: Double): Double = pow(truth - prediction, 2.0)

  override def gradient(truth: Double, prediction: Double): Double = 2.0 * (prediction - truth)

  override def hessian(truth: Double, prediction: Double): Double = 2.0

  override def output(x: Double): Double = x

}


object LossUtils {

  def score(gradients: Vector[Double], hessians: Vector[Double], lambda: Double): Double = {
    val G = gradients.sum
    val H = hessians.sum 
    - G / (H + lambda)
  }

  def gain(
    gradients: Vector[Double],
    hessians: Vector[Double],
    left: Vector[Int],
    right: Vector[Int],
    lambda: Double,
    gamma: Double
  ): Double = {
    val GL = left.map(i => gradients(i)).sum
    val HL = left.map(i => hessians(i)).sum
    val GR = right.map(i => gradients(i)).sum
    val HR = right.map(i => hessians(i)).sum 
    0.5 * (
      (pow(GL, 2) / (HL + lambda)) + (pow(GR, 2) / (HR + lambda)) - (pow(GL + GR, 2) / (HL + HR + lambda))
    ) - gamma
  }

}

The full code can be found on github

Refactoring My Previous Deep Learning Library Ideas

2017-11-29T07:47:00.000-08:00

Update: My Neural Network Posts

Basic Auto Differentiation: Post
Basic Auto Differentiation with Weight Sharing: Post
Auto Differentiation Code from Book: Post
Refactoring The Ideas: Post

in some previous posts I looked at how I could implement a simple deep learning library to gain a better understanding on the inner workings of: Auto Differentiation, Weight Sharing and a good Back Propagation implementation from the deep learning book.

In this post I will present my ideas on a refactored version. The main point is that the graph representation I implemented in the first two posts made it easier to build the computation graph,

but the code to run back propagation, especially with weight sharing was quite messy. So this refactored version first creates a computation tree as before, only that the actual computations are decoupled from the tree. The tree later is translated into the instructions and parent / child relationships used in the book version of the back propagation algorithm.

Now in the first step we write model setup code. For example, a neural network for mnist classification:

As in tensorflow I define placeholders, variables and computation nodes, along with their name.

Furthermore, the variables get a shape and an initialisation method. In the background this method constructs a computation tree rooted at a5 using the following DSL.

Each node, keeps track of the name, an instruction name and a shape, if needed.

We can then convert this tree into a neural network ready to be used in the back propagation code from original post. We build four maps keeping:

Mapping from node names to computations (like add, multiply, ...)
Mapping from node names to their parents node names
Mapping from node names to their child node names
Mapping from node names that are weights to matrices.

All computations such as add or multiply implement the forward pass through the variable

as well as the backward pass.

Using a class I called TensorStore that maps from names to matrices, I implemented the forward and the backward pass as follows:

The code can be found along the other versions on github. An example for the spiral dataset as well as the mnist data set is attached.

Of course the code is not at all up to the popular libraries in terms of speed or functionality so for me it is more an ongoing investigation on how these libraries are implemented and what I can reproduce. :). On Mnist the given network achieves between 96 and 98 % accuracy depending on the starting condition.

Variational Fashion Encoder

2017-09-10T06:40:00.000-07:00

Variational Fashion Encoder

In this experiment I want to try out Variational Auto Encoders, an unsupervised feature learning algorithm on a new fashion classification dataset from Zalando. Since I wanted to play around with both more, I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.

Code can be found on github

Variational Auto Encoder

The bottom part of the model is embedding the input X into a mean and variance vector.

The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.

The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space.

Fashion Reconstruction

The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.

The fashion data consists of 10 - classes of cloths:

tshirt
trouser
pullover
dress
coat
sandal
shirt
sneaker
bag
ankle boo

After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder.

As one can see the reconstructions work quite well indicating the latent space learned is useful.

Investigating Non Linear Optimization in Python

2017-08-01T11:11:00.000-07:00

I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "Machine Learning For Trading". One interesting part was portfolio optimisation.

That is, if you want to invest some amount of money, how do you distribute it
over a selection of assets in a way minimising risk while at the same time
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the course book the optimisation is performed using a
python method for constrained non-linear optimisation. The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.

The original algorithm can be found in the ACM Algorithms catalog (Algorithm 733)
written by D. Kraft. The code is automatically translated using the fortran to python (f2py)
library.

Black Box Variational Inference and PyFlux

2017-07-23T03:46:00.001-07:00

I found an interesting paper on black box variational inference. Variational inference is an
approximate inference algorithm for factored probabilistic models (such as bayesian networks).

In probabilistic inference, we are normally interested in the posterior density of a latent variable
given all our data. The basic idea is to turn the inference problem into an optimisation problem.
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.

The problem is that in order to do this for a new model, we still have to do a lot of manual derivations
and work through a lot of math before we can actually use this algorithm. This new paper
however, shows an algorithm that is easier to use given a new model. In my opinion,
one of the successes in deep learning is auto differentiation, where we do not
have to develop all gradients on our own but let the computer do it (as described in earlier posts).
One library that implements this inference method for complex time series is called pyFlux [2].

The proposed algorithm [1] works as follows. First, we sample the latent variables in our model
from the approximate distribution. We then use the samples to compute a gradient
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).

Reproduced from original paper [1]

The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:

In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,
the final gradient becomes:

Where the lambda parameters will be the mean and variance of each of the gaussians used
with the gradients:

References:

[1] Ranganath, Gerrish, Blei: "Black Box Variational Inference", AISTATS 2014, 2014
[2] PyFlux

A Generic Sequence Alignment Algorithm

2017-05-26T03:11:00.004-07:00

In this post I will describe how a general sequence alignment could be implemented in Scala.
For readers that don't know what sequence alignment is, I recommend reading my
post on building a gesture recogniser under Dynamic Time Warping distance (DTW).

Sequence alignment is often used in several contexts including speech recognition,
gesture recognition, approximate string matching, typing correction, biological
sequence alignment and time series classification and even stereo vision.
All these algorithms are based on a technique called dynamic programming and the algorithms
are all very similar. However, in literature this alignment technique is often found under different names (Dynamic Time Warping, Levensthein Distance, Needleman-Wunsch).

I will first formulate my idea of a general alignment algorithm with a scala implementation and afterwards describe an implementation of the three above algorithms.

Given two sequences X = [x1 .. xi .. xN] and Y = [y .. yj .. yM], we use the following
recursion to find the alignment score S(x, y). At each position i in sequence X and position j in sequence Y we look backwards at the matching score in X only, in Y only and in X and Y simultaneously resulting in the following recursion:

The rank function scores the alignment at position i and j based on the previous alignment scores and
the current elements. The rank function is the main component that changes in all three algorithms.
Looking back in X direction is also called an insertion, in Y direction a deletion and for both it is
called a match so the ranking function becomes:

Furthermore we will also implement a generic initialisation function that starts our recursion.
Ok now the Scala code:

class Alignment[A](
  init: (Int, Int) => Double,
  rank: (Double, Double, Double, A, A) => Double
) {

  def score(x: Array[A], y: Array[A]): Double = {  
    val n  = x.size
    val m  = y.size
    val dp = Array.ofDim[Double](n + 1, m + 1)
    
    for(i <- 0 until n + 1) dp(i)(0) = init(i, 0)
    for(j <- 0 until m + 1) dp(0)(j) = init(0, j)

    for(i <- 1 until n + 1) {
      for(j <- 1 until m + 1) {
        dp(i)(j) = rank(
          dp(i - 1)(j),     // insertion
          dp(i - 1)(j - 1), // match
          dp(i)(j - 1),     // deletion 
          x(i - 1),         // current sample in x 
          y(j - 1)          // current sample in y
        )
      }
    } 

    dp(n)(m)
  }

}

The alignment class has a type parameter specifying the type of the elements in the sequences we want to align and takes the initialising function and the ranking function as input. Now DTW and the Edit distance are just specialisations with a different type parameter and scoring function:

  class DTW extends Alignment[Double](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else Double.PositiveInfinity,
    rank = (insert: Double, matching: Double, delete: Double, x: Double, y: Double) =>
      if(insert < matching && insert < delete) insert + math.pow(x - y, 2)
      else if (delete < insert && delete < matching) delete + math.pow(x - y, 2)
      else matching + math.pow(x - y, 2)
  )

DTW is defined over doubles and is initialising the recursion with 0.0 at position (0,0) and all others to infinity. The ranking adds the euclidian distance to the insertion, deletion and match and then takes the minimum. It is easy to see how this can be generalised to multivariate sequences. The type parameter becomes Seq[Double] and the euclidean distance is implemented for vectors. One can also use kernels instead or the cosine similarity. The next function is the edit distance for discrete sequence elements (here characters):

  class Levensthein extends Alignment[Char](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j else i,
    rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
      if(insert < matching && insert < delete) insert + 1
      else if (delete < insert && delete < matching) delete + 1
      else matching + (if(x == y) 0.0 else 1.0)
  )

The function is initialised as the index at the borders and the function again takes the minimum of insertion, deletion and match. Insertions and deletions count as an error so we add 1. A match can also be a substitution so we add 1 to the error if the characters do not match. The last algorithm is the Needleman-Wunsch algorithm. Instead of minimising the distance we are maximising a score. The similarity between two symbols can be defined and the 'gap' penalty is not 1 as in the edit distance but given.

  final val Sim = Map(
    ('a', 'a') -> 10,
    ('a', 'g') -> -1,
    ('g', 'a') -> -1,
    ('a', 'c') -> -3,
    ('c', 'a') -> -3,
    ('a', 't') -> -4,
    ('t', 'a') -> -4,
    ('g', 'g') ->  7,
    ('g', 'c') -> -5,
    ('c', 'g') -> -5,
    ('g', 't') -> -3,
    ('t', 'g') -> -3,
    ('c', 'c') ->  9,
    ('c', 't') ->  0,
    ('t', 'c') ->  0,
    ('t', 't') ->  8
  )

  final val Gap = -5
  
  class NeedlemanWunsch extends Alignment[Char](
    init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j * Gap else i * Gap,
    rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
      if(insert > matching && insert > delete) insert + Gap
      else if (delete > insert && delete > matching) delete + Gap
      else matching + Sim(x, y)
  )

If one needs the actual alignment, we can backtrack the alignment process in all above cases. The code can be found here [GitHub]