Wednesday, June 13, 2018

Character Wise Generation From The Little Book Of Calm

If you saw the show black books, you saw the calming powers of the little book of calm.
Surprisingly the book actually exists. I extracted the text from the Kindle version and will now train an LSTM with it in order to generate an infinity stream of calming advice. The actual code can be found as a Gist.
For copyright reasons I can not attach the text. So if you want to train this too, you need to buy the Kindle version.
I split all the text into windows of 40 characters and train an LSTM to predict the following character. I use a bidirectional LSTM with a 128 dimensional cell-state. The output mode for the LSTM is many-2-many, which means each input at each time step (the current character) also has an output (the next character). The output is pushed through a dense softmax layer. The LSTM is shown below:
 
length = 40
step   = 1
model = Sequential()
model.add(Bidirectional(LSTM(128, input_shape=(length, n_char), return_sequences=True), input_shape=(length, n_char)))
model.add(TimeDistributed(Dense(n_char, activation='softmax')))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

The Training log is interesting. One can see that the LSTM gets better and better at generation text which can be seen in the notebook. Below I listed some samples. For the sampling, we initialise the generation with a random sequence of 40 characters and let the LSTM generate the rest. The details of the sampling can be seen in the notebook. The text is not great but one can see that the LSTM aims to generate sentences similar to the ones in the book. One might be able to improve it's performance with more layers or more epochs.

However, since the model works on a character level, I am not unhappy with the results:


  1. "thes the upset child, just as a kiss or handshake takes the pleasures eale them to poins of almous to be calming end it carting dive you’ll find calm phosting work is acyies all little forgsimutarly aboveruts a phopees will tell you, nothed (which your speech trangess there are tho defullame and eyebrows phon places of person, joyous every meal with the fores or nacciols, and discontrom to the hear postrous into them most powerm, when they gretophsints to calm. sho exercises when they laverds with the concenturies an much you think as"
  2. "eel calm. recognise addictions for what your has a little choice. devers of a favour things which the worldor tak every moments worry habirothing – you half the way to feel. emplay of the lights – quiet – entoy not worrierate – beting worry king before you think about still, bow feelinggs there is enliet before yourself calm if you feel calm your earl goonicoly disary taskn take the time is efly. kink well assential astimugars asout peace – and have all tall you to can calm. the way you look at that your happenng, puts of orange blome"
  3. "e stand straighter and taller than you beliefs, you will fool your hand, mixitates even the back sede tans bighing take and the ennable treates the beauty the add to think worty as major troutten if you combscents are among the humone situations: you treat at attating them. trank und harty thing an ised. on the small issues them, on your ease a sea calm always stimulates things that as much you must’, ‘i subripress on a way of what your day – from will happen. life for have to live a timens ael add the con rowhing wholly on tat hapte,"
  4. "then you can be calm. sip warm water a glass than politens, put as you tonkelow yourself up. player. sel they is possible not to calm. then you connald ablity to live a cemplemants and englavoide. make to the benor effort into hands, not for twengy to the roother unlike in your diet, something with a peyture of peace and calm deind go little known bad compony all spech speeds can seritally when at a physictionel. your hearly, small issues them, soothes of a hand at also keolitas the moment your feet, then review then ptsic a fatticula"
  5. "learning something you want to know. and you hueld’ go limp. removeties it posttert in the worlding the way you look at then posittle with your attionshops a out the cl-fectly having a little controgises to be remote. by massal favour bady them. is thice as you go envements are then positive to be captivelity when it carnething to say your subconscious speative of remote those who trere you ueld it before yourself underful fact as the feelostrust in everything when you’re leave dowalk a little between what harts like go to bed. whish "
  6. "the moment when you concentrate your attention on a way attiony what it’s only of relaxing those calm, even up, by so fhols, behaw recognise them bach someone, alwings a calm people about beauty phor people als your thoughts, will take whice a groutts a calm person, a floathing down on a grass to help you know, by replyating widg a few chill, your speech, resubcents, you feel calm toom to your can most used the moments art teads. wear wearious from time to be a stimulations in whatever you grust – then reony something your faving to w"
  7. "light one stop early if you consciously start is to relaxed with your speech speeds. by a coach even the most commot flowsh and exprect. works bectionss your speech speeds of a field unting with the groours, but for the plears nalk eget to feel calm. rong with a masseur once that your face pheropen having to sal awhable place, a conscious to be captivelocar place, up them of a damplestilm of the espences and little not train. every sayols around the back ood be sad, them do. then massage them ind tee… as you busly petson time to time "
  8. "time in the world to do whatever you choose a feet and contribute to than wessed for yourself untethink, acchut for the recogail oilk, and the ritimeting snack being (iss than peritups, it can be pressude yourself underle – chrisical that suches of patting your efficienals at, pu… child, joyous pios from time to time little only a timend for hard-working postable full you work, uphilly to go know wholatge them. puss, pursom efficiently you achiely that happens interivation. its massage – then recom in your breet routine your typewrite"
  9. "even more peaceful. break the pattern when you dwell twan fow times on explcomess with nich and feel negative for the rescom only a time. change specially, then ete someon, treards to your faith noat as muchically time sometimes works mication, but for they like gardesly how – you wouldn’t normally your hail, relaxed. wearing a stimulation. be computer you niel. be absold of time, whaner always a compliment possible future, with your speech spee, minutes to times out of standing, deciduerble when the concedints are atssetic eation wit"
  10. "make an apapointment with yourself to deal with worries that as major stressed when you’re helaxed with a resalk post your breathing, when you’re lead to calm. changes on the pest the thumb are then massage them go them, taking pleaser the most more efficiently when a country worry beads. play , any time – hose negne a softet for deeps a chomples. take ba hothing them. then raised by your life relaxt to stayicher when you sach add there’s tense sire time eas a pew your day, cannot by where you wouldn’t normally thinks. peace will not b"

Downloading All NIPS papers

Recently I wanted to download a large amount of NIPS papers, so I decided to write a little web - scraper in python. Not much more to say, [here] it is. There are two scripts. One downloading the papers and one downloading the abstracts. The script basically lists all books from the nips page,
then lists all papers and from the paper page, it downloads the pdfs or abstract.

I also put the script for the paper download, below. In order to use it, you need the following libraries and python 3:
 
from lxml import html
import requests
import urllib.request


BASE_URL = 'https://papers.nips.cc/'
page = requests.get(BASE_URL)
tree = html.fromstring(page.content)
books = [ href.attrib['href'] for href in tree.xpath('//a') if 'book' in href.attrib['href']]


for book in books:
    book_page = requests.get(BASE_URL + book)
    tree = html.fromstring(book_page.content)
    papers = [ href.attrib['href'] for href in tree.xpath('//a') if 'paper' in href.attrib['href']]
    for paper in papers:
        paper_page = requests.get(BASE_URL + paper)        
        tree = html.fromstring(paper_page.content)
        links = [ href.attrib['href'] for href in tree.xpath('//a') if 'pdf' in href.attrib['href']]
        for link in links:
            local = link.split('/')[-1]
            urllib.request.urlretrieve(BASE_URL + link, local)


Sunday, March 4, 2018

Porting Keras Models


I implemented an LSTM in Java that can read a trained Keras models in order
to use it in some other Java projects. You can check the repository at [:Github:]

The ipython notebook exports the weights of an LSTM
and also exports some test for a random input sequence and the resulting output.
The Java implementation reads the exported weights and then applies them. All
matrix operations are implemented using jBlas.

  • The numeric helper class implements matrix loading and slicing.
  • Tanh and the sigmoid functions are also in the helper class.
  • The LSTM state is a separate class holding the cell state and the hidden activation.
  • The LSTM holds all the weights and applies them to sequences.

The basic slicing of the weights is implemented using the same weight
coding as Keras [:Github:]

        self.kernel_i = self.kernel[:, :self.units]
        self.kernel_f = self.kernel[:, self.units: self.units * 2]
        self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
        self.kernel_o = self.kernel[:, self.units * 3:]

        self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
        self.recurrent_kernel_f = self.recurrent_kernel[:, self.units: self.units * 2]
        self.recurrent_kernel_c = self.recurrent_kernel[:, self.units * 2: self.units * 3]
        self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]

        if self.use_bias:
            self.bias_i = self.bias[:self.units]
            self.bias_f = self.bias[self.units: self.units * 2]
            self.bias_c = self.bias[self.units * 2: self.units * 3]
            self.bias_o = self.bias[self.units * 3:]

Tuesday, December 12, 2017

Implementing Gradient Boosted Trees

Implementing Gradient Boosted Trees



In this post I will share my experience on implementing gradient boosted trees as seen in XGBoost in order to get a better understanding of the method. As in previous posts, the code won't be as efficient as the original. Furthermore, I will use scala again. The basic idea behind boosting is to construct an additive model of a collection of simpler models (base models) for classification or regresion. If you are unfamiliar with the general boosting idea plase consider (Hastie: Elements of Statistics Learning) for a great introduction. This post is mainly based on this blog post and these slides. In boosting, the model is constructed iteratively that is, in an iteration $t$, the new output given a datapoint $x$ is: $$ \hat{y}^t = \hat{y}^{(t - 1)} - f_t(x) $$ The idea of gradient boosting is that in each iteration the next model is determined so it minimizes a given loss function $\mathbb{L}$. Specifically the minimization is performed using the first and second order derivatives of the loss function. Each added model can also be seen as a step in the direction of the gradient. In other words, the iterations can be seen as a form of gradient descent. When extending this idea to regression tree base models and when regularizing the objective function of the optimization, we end up with a powerful algorithm that one multiple Kaggle challenges as well as the Xing hosted recommender systems challenge to years in a row. In the next part, I will derive the optimization step used for decision trees with a cross entropy loss and then "massage" in the regularization.


Cross Entropy loss for Binay Classification

For binary classification, the go-to output function is the sigmoid function: $$P(x = 1) = \frac{1}{1 = e^{-x}}$$ Prominent usage examples include Neural Networks and logistic regressions. Given that our output $P(x = 1)$ is $\hat{y}$ and the true label is $y$, we can measure the goodness of our classifier on our dataset using the cross entropy loss: $$ L(y, \hat{y}) = \sum_i -y_i * log(\hat{y_i}) - (1.0 - y_i) * log(1.0 - \hat{y_i})$$ with its first derivative at $g_i = \hat{y_i} - y_i$ and it's second derivative at $h_i = \hat{y_i} * (1.0 - \hat{y_i})$.


Regularization

One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to penalize large parameters that dominate our result and model complexity. Given a parameter set $\theta$, the L1 regularization is $\Omega(\theta) = \sum_i ||\theta_i||_1$, where $||x||_1$ is the L1 norm or simply the absolute value. In L2 regularization, we switch to $||\theta_i||_2$ where $||x||_2$ is $x^2$. Then for a decision tree, we could measure the complexity as the number of leaf nodes $T$, and the parameters $\theta$ are the output scores of a leaf nodes (we will later use regression trees). For example, we could penalize large trees by $\gamma T$ and large output by $\frac{1}{2}\sum_i \lambda ||\theta_i||_2$. We can control the penalty amount by two parameters $\gamma$ and $\lambda$.


Gradient boosted trees

The combined objective using regularization and some loss function is: $$O = L(y, \hat{y}) + \Omega(\theta)$$ Since we are boosting, we define the objective in terms of the predictions in iteration $t$: $$O^t = L(y, \hat{y}^{t - 1} + f(x)) + \Omega(\theta)$$ while $\hat{y}^{t - 1} + f(x)$ follows directly from the boosting formular $\hat{y}^t = \hat{y}^(t - 1) - f_t(x)$. Since we want to define the gradient in order to optimize f(x), we can apply the Taylor expansion to the loss. Remember that the Taylor expansion is defined as $f(x + \delta x) \approx f(x) + f'(x) * \delta x + \frac{1}{2} f''(x) \delta x^2$ So the loss becomes $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \Omega(\theta))$$ Assuming a decision tree base classifier we can use the example regularization above and the objective becomes: $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \gamma T \frac{1}{2}\sum_i \lambda * ||\theta_i||_2$$ With a few transformations, we can redefine the loss in a way we can directly optimize the function with the standard CART learning algorithm. First we defined $\mathbb{I}(j == i)$ as an indicator function that returns 1 if example $i$ when pushed trough the decision tree, ends up in leaf node $j$. Then the result of $f(x_i)$ is $w_j$ the leaf node score at node $j$ ig $x_i$ evaluates true. Now we can rearange the objective function towards the leaf nodes as: $$O^t = \sum_j [(\sum_i \mathbb{I}(j == i) * g_i) * w_j + \frac{1}{2} (\sum_i \mathbb{I}(j == i) * h_i + \lambda) * w_j^2] + \gamma T$$ redefining the sums of the gradients to $G_j$ and $H_j$ we get: $$O^t = \sum_j [G_j * w_j + \frac{1}{2} (H_j + lambda) * w_j^2] + \gamma T$$ Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero: $$Gj + (H_j + \lambda) = 0$$ and then $w_j = -\frac{G_j}{H_j + \lambda}$. Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result. In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a normal decision tree. The formular is the gradient "amplitude" before the split $(G_l + G_h)^2 / (H_l + H_r + \lambda)$ and after the split $G_l^2 / (H_l + \lambda)$, where $G_l$ and $H_l$ are the sum of the gradients in the left node or the right node. Then the gain is simply: $$Gain = \frac{1}{2} [ \frac{G_l^2}{H_l + \lambda} + \frac{G_r^2}{H_r + \lambda} - \frac{(G_l + G_h)^2}{H_l + H_r + \lambda}]$$ Now we can implement these functions:
package gbt

import math._

trait Loss {

  def gradient(truth: Double, prediction: Double): Double

  def hessian(truth: Double, prediction: Double): Double

  def loss(truth: Double, prediction: Double): Double

  def output(x: Double): Double

}


class Logistic extends Loss {

  private def sigmoid(x: Double): Double = 1.0 / (1.0 + exp(-x))

  override def loss(truth: Double, prediction: Double): Double = {
    val isOne  = -truth * log(sigmoid(prediction))
    val isZero = (1 - truth) * log(1.0 - sigmoid(prediction))
    (isOne - isZero)
  }

  override def gradient(truth: Double, prediction: Double): Double = sigmoid(prediction) - truth

  override def hessian(truth: Double, prediction: Double): Double =
    sigmoid(prediction) * (1.0 - sigmoid(prediction))

  override def output(x: Double) = sigmoid(x)

}


class Squared extends Loss {

  override def loss(truth: Double, prediction: Double): Double = pow(truth - prediction, 2.0)

  override def gradient(truth: Double, prediction: Double): Double = 2.0 * (prediction - truth)

  override def hessian(truth: Double, prediction: Double): Double = 2.0

  override def output(x: Double): Double = x

}


object LossUtils {

  def score(gradients: Vector[Double], hessians: Vector[Double], lambda: Double): Double = {
    val G = gradients.sum
    val H = hessians.sum 
    - G / (H + lambda)
  }

  def gain(
    gradients: Vector[Double],
    hessians: Vector[Double],
    left: Vector[Int],
    right: Vector[Int],
    lambda: Double,
    gamma: Double
  ): Double = {
    val GL = left.map(i => gradients(i)).sum
    val HL = left.map(i => hessians(i)).sum
    val GR = right.map(i => gradients(i)).sum
    val HR = right.map(i => hessians(i)).sum 
    0.5 * (
      (pow(GL, 2) / (HL + lambda)) + (pow(GR, 2) / (HR + lambda)) - (pow(GL + GR, 2) / (HL + HR + lambda))
    ) - gamma
  }

}
The full code can be found on github

Wednesday, November 29, 2017

Refactoring My Previous Deep Learning Library Ideas

Refactoring My Previous Deep Learning Library Ideas

in some previous posts I looked at how I could implement a simple deep learning library to gain a better understanding on the inner workings of: Auto Differentiation, Weight Sharing and a good Back Propagation implementation from the deep learning book.

In this post I will present my ideas on a refactored version. The main point is that the graph representation I implemented in the first two posts made it easier to build the computation graph, 
but the code to run back propagation, especially with weight sharing was quite messy. So this refactored version first creates a computation tree as before, only that the actual computations are decoupled from the tree. The tree later is translated into the instructions and parent / child relationships used in the book version of the back propagation algorithm. 

Now in the first step we write model setup code. For example, a neural network for mnist classification:



As in tensorflow I define placeholders, variables and computation nodes, along with their name.
Furthermore, the variables get a shape and an initialisation method. In the background this method constructs a computation tree rooted at a5 using the following DSL.



Each node, keeps track of the name, an instruction name and a shape, if needed.
We can then convert this tree into a neural network ready to be used in the back propagation code from original post. We build four maps keeping:
  1. Mapping from node names to computations (like add, multiply, ...)
  2. Mapping from node names to their parents node names
  3. Mapping from node names to their child node names
  4. Mapping from node names that are weights to matrices.
All computations such as add or multiply implement the forward pass through the variable 
as well as the backward pass.

Using a class I called TensorStore that maps from names to matrices, I implemented the forward and the backward pass as follows:



The code can be found along the other versions on github. An example for the spiral dataset as well as the mnist data set is attached.

Of course the code is not at all up to the popular libraries in terms of speed or functionality so for me it is more an ongoing investigation on how these libraries are implemented and what I can reproduce. :). On Mnist the given network achieves between 96 and 98 % accuracy depending on the starting condition.


Sunday, September 10, 2017

Variational Fashion Encoder

Variational Fashion Encoder


In this experiment I want to try out Variational Auto Encoders, an unsupervised feature learning algorithm on a new fashion classification dataset from Zalando. Since I wanted to play around with both more, I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.
Code can be found on github


Variational Auto Encoder

The bottom part of the model is embedding the input X into a mean and variance vector.
The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.


The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space.

Fashion Reconstruction

The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.
The fashion data consists of 10 - classes of cloths: 
  1. tshirt
  2. trouser
  3. pullover
  4. dress
  5. coat
  6. sandal
  7. shirt
  8. sneaker
  9. bag
  10. ankle boo
After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder.

As one can see the reconstructions work quite well indicating the latent space learned is useful.

Tuesday, August 1, 2017

Investigating Non Linear Optimization in Python

I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "Machine Learning For Trading". One interesting part was portfolio optimisation.
That is, if you want to invest some amount of money, how do you distribute it
over a selection of assets in a way minimising risk while at the same time
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the course book the optimisation is performed using a
python method for constrained non-linear optimisation.  The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.
The original algorithm can be found in the ACM Algorithms catalog (Algorithm 733)
written by D. Kraft. The code is automatically translated using the fortran to python (f2py)
library.