Saturday, July 28, 2018

Simple Gaussian Processes For Regression

In this post I am implementing a simple Gaussian Process Regressor. A Gaussian Process is a
probabilistic kernel model. Basically I am following the "Machine Learning: A probabilistic Perspective" book by Kevin Murphy.

The Basics

For an input $x_i$ and output $y_i = f(x_i)$, infer a distribution over functions of the data:
$p(f | X, y)$. In order to make predictions $y*$ for a new data point $x*$ given all the previous observations, we define:

$p(y*|x*, X, y) = \int p(y*|f, x*) p(f|x, y) df$

For our data $X = x_1 ... x_n$ the Gaussian process assumption is that the the distributions over functions $p(f_1 ... f_N)$ is jointly Gaussian with mean $\mu(x)$ and a covariance function $\Sigma_{ij} = k(x_i, x_j)$ and a kernel $k$.

Covariance structure

It is quite common to leaf the mean function at zero. Given a kernel, we need to compute the covariance function between all training and test instances. However, we structure the kernel matrix in the following way. 

The sub-matrix $K$ describes the affinity between all training instances and the sub-matrix $K*$ describes the affinity between training and test instances. Finally, the submatrix $K**$ describes the affinity between test instances to itself. The image above shows the final covariance structure. With that, our function $f*$ can be inferred as: $(f, f*) ~ N(0, \Sigma)$.

In order to obtain the prediction calculate: $f* = K_*  \bullet K^{-1} \bullet f$.
If we want the find the variance of the predictions we calculate: $trace(K_{**} - K_*  \bullet K^{-1} \bullet  K_*^T)$

In my opinion the method is very similar to support vector machines and to instance beased learning methods in general. The difference is that a SVM is only using a small set of training examples called support vectors, while the Gaussian process compares every example to all others in order to make a prediction.

A Naive Gaussian Process

Below I implemented a naive Gaussian process using numpy. Basically, I am building the three components of the covariance structure and then solve the prediction equations above. I call my method naive since computing the inverse has a worst case complexity of $O(n^3)$
So it is not advisable to use this method for larger datasets.
class GaussianProcess:
    def __init__(self, kernel):
        self.kernel = kernel
    def predict(self, train_x, train_y, test_x):
        (n, _) = train_x.shape
        (m, _) = test_x.shape
        covar_train      = np.zeros((n, n))
        covar_train_test = np.zeros((n, m))
        covar_test       = np.zeros((m, m))
        for i in range(0, n):
            for j in range(0, n):
                covar_train[i, j] = self.kernel.k(train_x[i], train_x[j], i == j)
        for i in range(0, m):
            for j in range(0, m):
                covar_test[i, j] = self.kernel.k(test_x[i], test_x[j], i == j)
        for i in range(0, n):
            for j in range(0, m):
                covar_train_test[i, j] = self.kernel.k(train_x[i], test_x[j], False)
        covar_inv  = np.linalg.inv(covar_train)
        prediction =, covar_inv), train_y)
        confidence = covar_test -,, covar_train_test))
        confidence = np.sqrt(np.array([confidence[i, i] for i in range(0, m)]))
        return prediction, confidence 

Squared Exponential Kernel

For my experiments I use a squared exponential kernel and also add a variance parameter for for noisy observations:

$k(x_i, x_j) = \sigma_f^2 e^{-\frac{1}{2l^2} || x_i - x_j ||_2^2} + \sigma_y^2 \mathbb{I}(i == j)$

The parameters are as follows:
  • $\sigma_f^2$: Variance giving the scale of the output
  • $\sigma_y^2 \mathbb{I}(i == j)$: Noise variance applied to diagnoal
  • $l$: Scale on input
again, the implementation is quite simple in numpy.
class SquaredExponential:

    def __init__(self, variance, alpha, noise):
        self.alpha = alpha
        self.variance = variance
        self.noise = noise
        self.scaler = -1.0 / (2 * alpha**2)
    def k(self, a, b, on_diag):
        error  = np.sum(np.square(a - b))
        f      = np.exp(self.scaler * error)
        scaled = self.variance * f
        if on_diag:
            scaled += self.noise
        return scaled   

Results on some simple example datasets

I tried my method on three datasets, 2 times, I am intepolating stock marktet data and one time a sine wave. Each time, the index is the input to the method and the output for the stock market prediction is the closing price.
Sine Wave Interpolation:

Stock Interpolation Dax:

Stock Interpolation US:

Saturday, July 21, 2018

Financial Machine Learning Experiement

As a fun project, I tried to predict the stock market and implemented a small library todo so.
Basically, I want to use a sliding window of historic prices of several companies or funds and
predict the price of a single company. In the end, I achieve decent results in a 10 day window and
predicting 5 days into the future. However, my error is still several euros large :)

The code can be found here: [:Github:]

Learning Problem, Features and Modeling

First we define the return of interest (ROI) as:

 $roi(t, t + 1) = \frac{x_t - x_{t + 1}}{x_t} $

Which is the percentage of earnings at a time $t + 1$ measured to a previous investment
at $t$. In order to extract the features, we compute sliding windows over our historic prices.
For each sample in a window, we compute the ROI to the start of the window, which represents
the earnings to the start of the window. For a window of $T$ steps and $n$ stocks, we flatten the sliding windows and get a feature vector of size $T x n$ of ROI entries. Our target variable we want
to predict is a stock price $d$ days after the last day of the window. The traget is converted
to the ROI, too. So we try to predict the earnings after $d$ days after the last day of the window
which represents a potential investment.

Red: roi in the window, Blue: ROI for prediction

Some Experimental Results

First we downloaded several historic price datasets from yahoo finance and load them using pandas. We take the date column as the index and interpolate missing values:
data = [
    ('euroStoxx50', pd.read_csv('data/stoxx50e.csv', index_col=0, na_values='null').interpolate('linear')),
    ('dax',         pd.read_csv('data/EL4A.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('us',          pd.read_csv('data/EL4Z.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('xing',        pd.read_csv('data/O1BC.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('google',      pd.read_csv('data/GOOGL.csv',    index_col=0, na_values='null').interpolate('linear')),
    ('facebook',    pd.read_csv('data/FB2A.DE.csv',  index_col=0, na_values='null').interpolate('linear')),
    ('amazon',      pd.read_csv('data/AMZN.csv',     index_col=0, na_values='null').interpolate('linear'))

We learn the following classifier on our data.
predictors = [
    ('RF', RandomForestRegressor(n_estimators=250)),
    ('GP', GaussianProcessRegressor(kernel=RBF(length_scale=2.5))),
    ('NN', KNeighborsRegressor(n_neighbors=80)),
    ('NE', KerasPredictor(model, 10, 512, False)),
    ('GB', GradientBoostingRegressor(n_estimators=250))

The neural network's architecture (named model above):
hidden = [256, 128, 64, 32]
inp    = (len(data.stoxx) * WIN,)
model  = Sequential()
model.add(Dense(hidden[0], activation='relu', input_shape=inp))
for h in hidden[1:]:
    model.add(Dense(h, activation='relu'))
model.add(Dense(1, activation='linear'))
model.compile('adam', 'mse')

Below we show some prediction results for our classifiers.
We also not the root mean square error in euros. 

Monday, June 25, 2018

Keras: Convolutional LSTM

Stacking recurrent layers on top of convolutional layers can be used to generate sequential output (like text) from structured input (like images or audio) [1].

If we want to stack an LSTM on top of a convolutional layers, we can simply do so, but we need to
reshape the output of the convolutions to the LSTM's expected input. The code below shows an implementation in Keras:
T = 128
D = 64

LSTM_T     = int(T / 8)
LSTM_D     = int(D / 2)
POOL       = (8, 2)

i = Input(shape=(T, D, 1))                         # (None, 128, 64, 1)   
x = Conv2D(N_FILTERS, (3, 3), padding = 'same')(i) # (None, 128, 64, 32)
x = MaxPooling2D(pool_size=POOL)(x)                # (None, 16, 32, 32) 
x = Reshape((LSTM_T, LSTM_D * N_FILTERS))(x)       # (None, 16, 1024) 
x = LSTM(LSTM_STATE, return_sequences=True)(x)     # (None, 16, 128)     

model = Model(i, x)

In this example we want to learn the convolutional LSTM on sequences of length 128 with 64 dimensional samples. The first layer is a convolutional layer with 32 filters. So the outputs are 32
sequences, one for each filter. We pool the sequences with a (8, 2) window. So our 32 sequences are now of size (128 / 8 = 16,  64 / 2 = 32). Now we have to combine the dimensions and the filter responses into a single dimension of size (32 * 32 = 1024) so we can feed a sequence into the LSTM which requires a rank 2 ( or 3 with batch) tensor with the first dimension being the time step and the second each frame. Finally we add the LSTM layer.

Wednesday, June 13, 2018

Character Wise Generation From The Little Book Of Calm

If you saw the show black books, you saw the calming powers of the little book of calm.
Surprisingly the book actually exists. I extracted the text from the Kindle version and will now train an LSTM with it in order to generate an infinity stream of calming advice. The actual code can be found as a Gist.
For copyright reasons I can not attach the text. So if you want to train this too, you need to buy the Kindle version.
I split all the text into windows of 40 characters and train an LSTM to predict the following character. I use a bidirectional LSTM with a 128 dimensional cell-state. The output mode for the LSTM is many-2-many, which means each input at each time step (the current character) also has an output (the next character). The output is pushed through a dense softmax layer. The LSTM is shown below:
length = 40
step   = 1
model = Sequential()
model.add(Bidirectional(LSTM(128, input_shape=(length, n_char), return_sequences=True), input_shape=(length, n_char)))
model.add(TimeDistributed(Dense(n_char, activation='softmax')))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

The Training log is interesting. One can see that the LSTM gets better and better at generation text which can be seen in the notebook. Below I listed some samples. For the sampling, we initialise the generation with a random sequence of 40 characters and let the LSTM generate the rest. The details of the sampling can be seen in the notebook. The text is not great but one can see that the LSTM aims to generate sentences similar to the ones in the book. One might be able to improve it's performance with more layers or more epochs.

However, since the model works on a character level, I am not unhappy with the results:

  1. "thes the upset child, just as a kiss or handshake takes the pleasures eale them to poins of almous to be calming end it carting dive you’ll find calm phosting work is acyies all little forgsimutarly aboveruts a phopees will tell you, nothed (which your speech trangess there are tho defullame and eyebrows phon places of person, joyous every meal with the fores or nacciols, and discontrom to the hear postrous into them most powerm, when they gretophsints to calm. sho exercises when they laverds with the concenturies an much you think as"
  2. "eel calm. recognise addictions for what your has a little choice. devers of a favour things which the worldor tak every moments worry habirothing – you half the way to feel. emplay of the lights – quiet – entoy not worrierate – beting worry king before you think about still, bow feelinggs there is enliet before yourself calm if you feel calm your earl goonicoly disary taskn take the time is efly. kink well assential astimugars asout peace – and have all tall you to can calm. the way you look at that your happenng, puts of orange blome"
  3. "e stand straighter and taller than you beliefs, you will fool your hand, mixitates even the back sede tans bighing take and the ennable treates the beauty the add to think worty as major troutten if you combscents are among the humone situations: you treat at attating them. trank und harty thing an ised. on the small issues them, on your ease a sea calm always stimulates things that as much you must’, ‘i subripress on a way of what your day – from will happen. life for have to live a timens ael add the con rowhing wholly on tat hapte,"
  4. "then you can be calm. sip warm water a glass than politens, put as you tonkelow yourself up. player. sel they is possible not to calm. then you connald ablity to live a cemplemants and englavoide. make to the benor effort into hands, not for twengy to the roother unlike in your diet, something with a peyture of peace and calm deind go little known bad compony all spech speeds can seritally when at a physictionel. your hearly, small issues them, soothes of a hand at also keolitas the moment your feet, then review then ptsic a fatticula"
  5. "learning something you want to know. and you hueld’ go limp. removeties it posttert in the worlding the way you look at then posittle with your attionshops a out the cl-fectly having a little controgises to be remote. by massal favour bady them. is thice as you go envements are then positive to be captivelity when it carnething to say your subconscious speative of remote those who trere you ueld it before yourself underful fact as the feelostrust in everything when you’re leave dowalk a little between what harts like go to bed. whish "
  6. "the moment when you concentrate your attention on a way attiony what it’s only of relaxing those calm, even up, by so fhols, behaw recognise them bach someone, alwings a calm people about beauty phor people als your thoughts, will take whice a groutts a calm person, a floathing down on a grass to help you know, by replyating widg a few chill, your speech, resubcents, you feel calm toom to your can most used the moments art teads. wear wearious from time to be a stimulations in whatever you grust – then reony something your faving to w"
  7. "light one stop early if you consciously start is to relaxed with your speech speeds. by a coach even the most commot flowsh and exprect. works bectionss your speech speeds of a field unting with the groours, but for the plears nalk eget to feel calm. rong with a masseur once that your face pheropen having to sal awhable place, a conscious to be captivelocar place, up them of a damplestilm of the espences and little not train. every sayols around the back ood be sad, them do. then massage them ind tee… as you busly petson time to time "
  8. "time in the world to do whatever you choose a feet and contribute to than wessed for yourself untethink, acchut for the recogail oilk, and the ritimeting snack being (iss than peritups, it can be pressude yourself underle – chrisical that suches of patting your efficienals at, pu… child, joyous pios from time to time little only a timend for hard-working postable full you work, uphilly to go know wholatge them. puss, pursom efficiently you achiely that happens interivation. its massage – then recom in your breet routine your typewrite"
  9. "even more peaceful. break the pattern when you dwell twan fow times on explcomess with nich and feel negative for the rescom only a time. change specially, then ete someon, treards to your faith noat as muchically time sometimes works mication, but for they like gardesly how – you wouldn’t normally your hail, relaxed. wearing a stimulation. be computer you niel. be absold of time, whaner always a compliment possible future, with your speech spee, minutes to times out of standing, deciduerble when the concedints are atssetic eation wit"
  10. "make an apapointment with yourself to deal with worries that as major stressed when you’re helaxed with a resalk post your breathing, when you’re lead to calm. changes on the pest the thumb are then massage them go them, taking pleaser the most more efficiently when a country worry beads. play , any time – hose negne a softet for deeps a chomples. take ba hothing them. then raised by your life relaxt to stayicher when you sach add there’s tense sire time eas a pew your day, cannot by where you wouldn’t normally thinks. peace will not b"

Downloading All NIPS papers

Recently I wanted to download a large amount of NIPS papers, so I decided to write a little web - scraper in python. Not much more to say, [here] it is. There are two scripts. One downloading the papers and one downloading the abstracts. The script basically lists all books from the nips page,
then lists all papers and from the paper page, it downloads the pdfs or abstract.

I also put the script for the paper download, below. In order to use it, you need the following libraries and python 3:
from lxml import html
import requests
import urllib.request

page = requests.get(BASE_URL)
tree = html.fromstring(page.content)
books = [ href.attrib['href'] for href in tree.xpath('//a') if 'book' in href.attrib['href']]

for book in books:
    book_page = requests.get(BASE_URL + book)
    tree = html.fromstring(book_page.content)
    papers = [ href.attrib['href'] for href in tree.xpath('//a') if 'paper' in href.attrib['href']]
    for paper in papers:
        paper_page = requests.get(BASE_URL + paper)        
        tree = html.fromstring(paper_page.content)
        links = [ href.attrib['href'] for href in tree.xpath('//a') if 'pdf' in href.attrib['href']]
        for link in links:
            local = link.split('/')[-1]
            urllib.request.urlretrieve(BASE_URL + link, local)

Sunday, March 4, 2018

Porting Keras Models

I implemented an LSTM in Java that can read a trained Keras models in order
to use it in some other Java projects. You can check the repository at [:Github:]

The ipython notebook exports the weights of an LSTM
and also exports some test for a random input sequence and the resulting output.
The Java implementation reads the exported weights and then applies them. All
matrix operations are implemented using jBlas.

  • The numeric helper class implements matrix loading and slicing.
  • Tanh and the sigmoid functions are also in the helper class.
  • The LSTM state is a separate class holding the cell state and the hidden activation.
  • The LSTM holds all the weights and applies them to sequences.

The basic slicing of the weights is implemented using the same weight
coding as Keras [:Github:]

        self.kernel_i = self.kernel[:, :self.units]
        self.kernel_f = self.kernel[:, self.units: self.units * 2]
        self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
        self.kernel_o = self.kernel[:, self.units * 3:]

        self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
        self.recurrent_kernel_f = self.recurrent_kernel[:, self.units: self.units * 2]
        self.recurrent_kernel_c = self.recurrent_kernel[:, self.units * 2: self.units * 3]
        self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]

        if self.use_bias:
            self.bias_i = self.bias[:self.units]
            self.bias_f = self.bias[self.units: self.units * 2]
            self.bias_c = self.bias[self.units * 2: self.units * 3]
            self.bias_o = self.bias[self.units * 3:]

Tuesday, December 12, 2017

Implementing Gradient Boosted Trees

Implementing Gradient Boosted Trees

In this post I will share my experience on implementing gradient boosted trees as seen in XGBoost in order to get a better understanding of the method. As in previous posts, the code won't be as efficient as the original. Furthermore, I will use scala again. The basic idea behind boosting is to construct an additive model of a collection of simpler models (base models) for classification or regresion. If you are unfamiliar with the general boosting idea plase consider (Hastie: Elements of Statistics Learning) for a great introduction. This post is mainly based on this blog post and these slides. In boosting, the model is constructed iteratively that is, in an iteration $t$, the new output given a datapoint $x$ is: $$ \hat{y}^t = \hat{y}^{(t - 1)} - f_t(x) $$ The idea of gradient boosting is that in each iteration the next model is determined so it minimizes a given loss function $\mathbb{L}$. Specifically the minimization is performed using the first and second order derivatives of the loss function. Each added model can also be seen as a step in the direction of the gradient. In other words, the iterations can be seen as a form of gradient descent. When extending this idea to regression tree base models and when regularizing the objective function of the optimization, we end up with a powerful algorithm that one multiple Kaggle challenges as well as the Xing hosted recommender systems challenge to years in a row. In the next part, I will derive the optimization step used for decision trees with a cross entropy loss and then "massage" in the regularization.

Cross Entropy loss for Binay Classification

For binary classification, the go-to output function is the sigmoid function: $$P(x = 1) = \frac{1}{1 = e^{-x}}$$ Prominent usage examples include Neural Networks and logistic regressions. Given that our output $P(x = 1)$ is $\hat{y}$ and the true label is $y$, we can measure the goodness of our classifier on our dataset using the cross entropy loss: $$ L(y, \hat{y}) = \sum_i -y_i * log(\hat{y_i}) - (1.0 - y_i) * log(1.0 - \hat{y_i})$$ with its first derivative at $g_i = \hat{y_i} - y_i$ and it's second derivative at $h_i = \hat{y_i} * (1.0 - \hat{y_i})$.


One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to penalize large parameters that dominate our result and model complexity. Given a parameter set $\theta$, the L1 regularization is $\Omega(\theta) = \sum_i ||\theta_i||_1$, where $||x||_1$ is the L1 norm or simply the absolute value. In L2 regularization, we switch to $||\theta_i||_2$ where $||x||_2$ is $x^2$. Then for a decision tree, we could measure the complexity as the number of leaf nodes $T$, and the parameters $\theta$ are the output scores of a leaf nodes (we will later use regression trees). For example, we could penalize large trees by $\gamma T$ and large output by $\frac{1}{2}\sum_i \lambda ||\theta_i||_2$. We can control the penalty amount by two parameters $\gamma$ and $\lambda$.

Gradient boosted trees

The combined objective using regularization and some loss function is: $$O = L(y, \hat{y}) + \Omega(\theta)$$ Since we are boosting, we define the objective in terms of the predictions in iteration $t$: $$O^t = L(y, \hat{y}^{t - 1} + f(x)) + \Omega(\theta)$$ while $\hat{y}^{t - 1} + f(x)$ follows directly from the boosting formular $\hat{y}^t = \hat{y}^(t - 1) - f_t(x)$. Since we want to define the gradient in order to optimize f(x), we can apply the Taylor expansion to the loss. Remember that the Taylor expansion is defined as $f(x + \delta x) \approx f(x) + f'(x) * \delta x + \frac{1}{2} f''(x) \delta x^2$ So the loss becomes $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \Omega(\theta))$$ Assuming a decision tree base classifier we can use the example regularization above and the objective becomes: $$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \gamma T \frac{1}{2}\sum_i \lambda * ||\theta_i||_2$$ With a few transformations, we can redefine the loss in a way we can directly optimize the function with the standard CART learning algorithm. First we defined $\mathbb{I}(j == i)$ as an indicator function that returns 1 if example $i$ when pushed trough the decision tree, ends up in leaf node $j$. Then the result of $f(x_i)$ is $w_j$ the leaf node score at node $j$ ig $x_i$ evaluates true. Now we can rearange the objective function towards the leaf nodes as: $$O^t = \sum_j [(\sum_i \mathbb{I}(j == i) * g_i) * w_j + \frac{1}{2} (\sum_i \mathbb{I}(j == i) * h_i + \lambda) * w_j^2] + \gamma T$$ redefining the sums of the gradients to $G_j$ and $H_j$ we get: $$O^t = \sum_j [G_j * w_j + \frac{1}{2} (H_j + lambda) * w_j^2] + \gamma T$$ Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero: $$Gj + (H_j + \lambda) = 0$$ and then $w_j = -\frac{G_j}{H_j + \lambda}$. Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result. In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a normal decision tree. The formular is the gradient "amplitude" before the split $(G_l + G_h)^2 / (H_l + H_r + \lambda)$ and after the split $G_l^2 / (H_l + \lambda)$, where $G_l$ and $H_l$ are the sum of the gradients in the left node or the right node. Then the gain is simply: $$Gain = \frac{1}{2} [ \frac{G_l^2}{H_l + \lambda} + \frac{G_r^2}{H_r + \lambda} - \frac{(G_l + G_h)^2}{H_l + H_r + \lambda}]$$ Now we can implement these functions:
package gbt

import math._

trait Loss {

  def gradient(truth: Double, prediction: Double): Double

  def hessian(truth: Double, prediction: Double): Double

  def loss(truth: Double, prediction: Double): Double

  def output(x: Double): Double


class Logistic extends Loss {

  private def sigmoid(x: Double): Double = 1.0 / (1.0 + exp(-x))

  override def loss(truth: Double, prediction: Double): Double = {
    val isOne  = -truth * log(sigmoid(prediction))
    val isZero = (1 - truth) * log(1.0 - sigmoid(prediction))
    (isOne - isZero)

  override def gradient(truth: Double, prediction: Double): Double = sigmoid(prediction) - truth

  override def hessian(truth: Double, prediction: Double): Double =
    sigmoid(prediction) * (1.0 - sigmoid(prediction))

  override def output(x: Double) = sigmoid(x)


class Squared extends Loss {

  override def loss(truth: Double, prediction: Double): Double = pow(truth - prediction, 2.0)

  override def gradient(truth: Double, prediction: Double): Double = 2.0 * (prediction - truth)

  override def hessian(truth: Double, prediction: Double): Double = 2.0

  override def output(x: Double): Double = x


object LossUtils {

  def score(gradients: Vector[Double], hessians: Vector[Double], lambda: Double): Double = {
    val G = gradients.sum
    val H = hessians.sum 
    - G / (H + lambda)

  def gain(
    gradients: Vector[Double],
    hessians: Vector[Double],
    left: Vector[Int],
    right: Vector[Int],
    lambda: Double,
    gamma: Double
  ): Double = {
    val GL = => gradients(i)).sum
    val HL = => hessians(i)).sum
    val GR = => gradients(i)).sum
    val HR = => hessians(i)).sum 
    0.5 * (
      (pow(GL, 2) / (HL + lambda)) + (pow(GR, 2) / (HR + lambda)) - (pow(GL + GR, 2) / (HL + HR + lambda))
    ) - gamma

The full code can be found on github