tag:blogger.com,1999:blog-10405454173016461132024-03-18T20:36:14.279-07:00Daniel@World{Machine Learning, Data Science, Computer Vision} in the trenchesdkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.comBlogger65125tag:blogger.com,1999:blog-1040545417301646113.post-11816242488384646882020-09-27T09:49:00.006-07:002020-09-27T09:49:55.607-07:00Implementing Clustering Loss For Deep Learning in Tensorflow 2.0<h4 style="text-align: left;"><br /></h4><h4 style="text-align: left;"><span style="font-weight: normal;">I recently found a new interesting loss function that can be used for clustering [1]. Basically, we can start building a neural network with a softmax output. </span></h4><h4 style="text-align: left;"><br /><span style="font-weight: normal;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0kIaESjttuUr6Rmwy_FBrhyphenhyphentDctWgasjLsWaJ3BW-nIG6Tb3-1iqYignJEtZpssok8IHKAWKrfLnPk6Wk44XaDNGEHTrFklXLBe9dr126KpEBJrS3NN982kaKqM-pARpfRK_I50J8yPo/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="198" data-original-width="896" height="89" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0kIaESjttuUr6Rmwy_FBrhyphenhyphentDctWgasjLsWaJ3BW-nIG6Tb3-1iqYignJEtZpssok8IHKAWKrfLnPk6Wk44XaDNGEHTrFklXLBe9dr126KpEBJrS3NN982kaKqM-pARpfRK_I50J8yPo/w400-h89/Screen+Shot+2020-09-27+at+6.42.11+PM.png" width="400" /></a></div><br /></span></h4><div>The first term is the average entropy of the probabilities for each example using the softmax probabilities.</div><div>Each example <i>xi</i> in a batch is first embedded <i>f(xi) </i>and then pushed through the softmax. This term basically encourages assignment to a single cluster. The second term is the entropy of the average probability in each cluster over the complete batch. This is done so all clusters are used and the score discourages the assignment of all instances to the same batch. Below I posted my code for the loss function in Tensorflow.</div><div><br /></div><div><br /></div><div><br /></div>
<pre class="brush:python">
class ClusteringLoss(Loss):
def __init__(self, gamma=1.0):
super().__init__()
self.gamma = gamma
def call(self, true, pred):
y = pred + 1e-6
h1 = -1.0 * tf.reduce_sum(
tf.math.log(y) * y, axis=-1
)
H1 = tf.math.reduce_mean(h1)
total_y = tf.math.reduce_mean(y, axis=0)
H2 = -1 * tf.reduce_sum(
tf.math.log(total_y) * total_y
)
return H1 - self.gamma * H2
</pre>
<h4 style="text-align: left;">REFERENCES</h4><div>[1] Jensen et. al.: "Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision", ICASSP 2020.</div>dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.8016936-23.939518639191633 -131.8233064 90 149.4266936tag:blogger.com,1999:blog-1040545417301646113.post-3504922668770016452020-03-12T04:49:00.002-07:002020-03-12T04:49:48.429-07:00Recursive Auto Encoders in Tensorflow 2.0In my <a href="https://daniel-at-world.blogspot.com/2019/09/recursive-neural-networks.html">last post</a>, I described recursive neural networks and how they can be used in PyTorch.<br />
<div>
In this post, I will describe the recursive autoencoder and my implementation in TensorFlow 2.0 using the new eager execution functionality and the gradient tape.</div>
<div>
</div>
<h3>
What is a recursive autoencoder? </h3>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXtQ-NNZsWiWls8kP6SVFZA_V9favwOnAFnE89SugKUfyeTvLs5V26gAalhsQ01xwlBRQHtfhIVsdrdpuCal7rFO6-1qn3XjM528SXOjj9CWStPWtBrmHLgITQA2YGitpbdvTPPxSzkt0/s1600/Screenshot+from+2020-03-12+12-28-20.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="310" data-original-width="324" height="306" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXtQ-NNZsWiWls8kP6SVFZA_V9favwOnAFnE89SugKUfyeTvLs5V26gAalhsQ01xwlBRQHtfhIVsdrdpuCal7rFO6-1qn3XjM528SXOjj9CWStPWtBrmHLgITQA2YGitpbdvTPPxSzkt0/s320/Screenshot+from+2020-03-12+12-28-20.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure from the paper[1] explaining auto encoders</td></tr>
</tbody></table>
<br />
As with the regular autoencoder we build a neural network that decides if we merge two inputs. Instead of having a target, in the autoencoder case we simply reconstruct the input and merge the nodes with the lowest reconstruction error in a greedy manner.<br />
<div>
<div>
<br /></div>
<h3>
Implementation</h3>
<div>
First, we define the auto encoder:</div>
<pre class="brush:python">
def merge_encoder(n_in):
a = tf.keras.layers.Input(n_in)
b = tf.keras.layers.Input(n_in)
c = tf.keras.layers.Concatenate()([a,b])
h = tf.keras.layers.Dense(n_in)(c)
o = tf.keras.layers.Dense(n_in * 2)(h)
merge = tf.keras.models.Model(inputs=[a, b], outputs=[h, c, o])
return merge
</pre>
The model has two inputs that should be merged. These will be concatenated and fed through a dense layer resulting in the hidden representation.
The hidden representation is directly used to reconstruct the input.
In order to keep track of the merges, I define a node class.
<br />
<pre class="brush:python">
class Node:
def __init__(self, i, embedding, score, l = None, r = None):
self.i = i
self.score = score
self.embedding = embedding
self.left = l
self.right = r
def print(self, offset=""):
print("{} {} {} {}".format(offset, self.i, self.score, np.mean(self.embeding)))
if self.left is not None and self.right is not None:
self.left.print(offset + "\t")
self.right.print(offset + "\t")
def merge(self, other, merger):
merged = merger([self.embedding, other.embedding])
h = merged[0]
c = merged[1]
y = merged[2]
score = tf.nn.l2_loss(y - c) + self.score + other.score
return Node(-1, h, score, self, other)
def ts2leafs(ts):
return [Node(i,ts[i], tf.constant(0.0)) for i in range(0, len(ts))]
</pre>
A merge holds the embedding, an id for all leaf nodes as well as the merge score.
Merging nodes will involve computing the hidden representation for the new node and
computing the score from the reconstruction error. At the beginning every vector in a sequence
is simply a leaf node.
Now we can greedily merge the sequences by finding the minimum possible reconstruction error.
We keep merging until only one node is left.
<br />
<pre class="brush:python">
def merge(x, m):
if len(x) > 1:
min_loss = float('inf')
min_node = None
min_i = 0
min_j = 0
for i in range(len(x)):
for j in range(len(x)):
if i < j:
node = x[i].merge(x[j], m)
if node.score < min_loss:
min_node = node
min_loss = node.score
min_i = i
min_j = j
next_x = x.copy()
next_x[min_i] = min_node
next_x = [next_x[idx] for idx in range(0, len(x)) if idx != min_j]
return merge(next_x, m)
else:
return x[0]
</pre>
Now we can learn the model using TensorFlow's gradient tape which keeps track of all
computations during our merges, compute the gradient and update the model.
<br />
<pre class="brush:python">
m = merge_encoder(10)
x = ts2leafs([np.ones((1, 10)) * i for i in range(0, 10)] + [np.ones((1, 10)) * i for i in range(0, 10)])
optimizer = tf.keras.optimizers.Adam()
for i in range(0, 100):
with tf.GradientTape(watch_accessed_variables=True) as tape:
tape.watch(m.variables)
node = merge(x, m)
print(node.score)
g = tape.gradient(node.score, m.variables)
optimizer.apply_gradients(zip(g, m.variables))
m.save('merger.h5')
</pre>
<div>
<br /></div>
<div>
<br /></div>
<span style="font-weight: normal;">My code can be found in this <a href="https://gist.github.com/dkohlsdorf/64cf9c3915010e00ec2a475a09a4220f">gist</a></span><br />
<h3>
REFERENCES</h3>
<div>
<a href="https://www.socher.org/uploads/Main/SocherPenningtonHuangNgManning_EMNLP2011.pdf">[1] Richard Socher Jeffrey Pennington, Eric H. Huang Andrew Y. Ng Christopher D. Manning: "Semi-Supervised Recursive Autoencodersfor Predicting Sentiment Distributions", EMNLP 2011</a></div>
</div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693652.774353199999986 8.1562466 53.384239199999989 9.4471406tag:blogger.com,1999:blog-1040545417301646113.post-9130150963480623302019-09-28T08:34:00.001-07:002019-09-28T08:37:08.487-07:00Recursive Neural Networks<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgc0uNSP-i6nX8Aj54__kIlX4gxqU97y2DYhTS5ETyARfOzc1DiVozEhUpHNEXZmpNmUgscfA9qkBtz78OlUxqdDDXVJiEVWZY0LhlmG84UmpxrdPHaW6fB1lwexcr7YipJeB2hmTsr91k/s1600/Screen+Shot+2019-09-25+at+5.56.26+PM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="372" data-original-width="545" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgc0uNSP-i6nX8Aj54__kIlX4gxqU97y2DYhTS5ETyARfOzc1DiVozEhUpHNEXZmpNmUgscfA9qkBtz78OlUxqdDDXVJiEVWZY0LhlmG84UmpxrdPHaW6fB1lwexcr7YipJeB2hmTsr91k/s400/Screen+Shot+2019-09-25+at+5.56.26+PM.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A Recursive Neural Network</td></tr>
</tbody></table>
<br />
Recursively Neural Networks can learn to structure data such as text or images hierarchically. The network can be seen to work similar to a context free grammar and the output is a parse t<br />
The figure below (from this <a href="https://ai.stanford.edu/~ang/papers/icml11-ParsingWithRecursiveNeuralNetworks.pdf">paper</a>) shows examples of parse trees for an image (top) and a sentence (bottom).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr-YSA2lYhKoB6F_zY7FMTArZnhKBu-18L8H0Jk8GUibg_gAXyht41G7oLhlcBG-wAvTkA86OSxxdYmNfRDumMKNnPBbD8Sp6u3MxdDbgGyF7uMIMvBlqf4DLOsLPxevE5MFdbZSPuLmU/s1600/Illustration-of-our-recursive-neural-network-architecture-which-parses-images-and-natural.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="870" data-original-width="850" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr-YSA2lYhKoB6F_zY7FMTArZnhKBu-18L8H0Jk8GUibg_gAXyht41G7oLhlcBG-wAvTkA86OSxxdYmNfRDumMKNnPBbD8Sp6u3MxdDbgGyF7uMIMvBlqf4DLOsLPxevE5MFdbZSPuLmU/s320/Illustration-of-our-recursive-neural-network-architecture-which-parses-images-and-natural.png" width="312" /></a></div>
<br />
A recursive neural network for sequences always merges two consecutive input vectors and predicts if these two vectors can be merged. If so, we replace the two vectors with best merging score with the hidden vector responsible for the prediction. If done in a recursive manner, we can construct a parse tree.<br />
<br />
Formally we can construct the network as follows. Given an input sequence $x = x_1 ... x_T, x_i \in \mathbb{R}^D$ then for two neighboring inputs $x_t, x_{t + 1}$ we predict a hidden representation:<br />
<br />
$h_t = \sigma(x_t W_l + b_l)$<br />
$h_{t + 1} = \sigma(x_{t + 1} W_r + b_r)$<br />
<br />
The hidden layer used for replacement is computed from the concatenation of the two previous hidden layers: $h=\sigma([h_t, h_{t + 1}] W_h + b_h)$. Finally, the prediction is simply another layer predicting the merging score.<br />
<br />
For the implementation we define the neural network with the merging decision first.
Given a node with the parent representations (for leaf nodes that are the input vectors)
we build the hidden representations and then compute the merging score.
<br />
<pre class="brush:python">class NeuralGrammar(nn.Module):
'''
Given a tree node we project the left child and right child into
a feature space cl and cr. The concatenated [cl, cr] is projected
into the representation of that node. Furthermore, we predict
the label from the hidden representation.
'''
def __init__(self):
super(NeuralGrammar, self).__init__()
self.left = nn.Linear(IN, H, bias=True)
self.right = nn.Linear(IN, H, bias=True)
self.parent = nn.Linear(2 * H, IN, bias=True)
self.projection = nn.Linear(IN, 1, bias=True)
def forward(self, node):
l = node.left_child.representation
r = node.right_child.representation
y = node.label
x = torch.cat([torch.relu(self.left(l)), torch.relu(self.right(r))], 0)
p = torch.tanh(self.parent(x))
score = self.projection(p)
return (p, score)
</pre>
Now we can implement our node class which also handles the parsing, by greedily merging up to a tree.
A node in the tree holds it's representation as well as a left and a right child. In other words,
each node represents a merging decision with the two children being the nodes merged and the parent
representation being the hidden layer in the merger grammar. Parsing a tree involves merging the vectors
with the highest score and replacing the nodes with their parent node.
<br />
<pre class="brush:python">
class TreeNode:
'''
A tree node in a neural grammar.
Represented as a binary tree. Each node is also represented by
an embedding vector.
'''
def __init__(self, representation=None):
self.representation = representation
self.left_child = None
self.right_child = None
def greedy_tree(cls, x, grammar):
'''
Greedily merges bottom up using a feature extractor and grammar
'''
T, D = x.shape
leafs = [cls(x[i, :]) for i in range(T)]
while len(leafs) >= 2:
max_score = float('-inf')
max_hypothesis = None
max_idx = None
for i in range(1, len(leafs)):
# difference of reconstruction of the center of the span of the subtree in the spectrogram is the score
# inspired by continuous bag of words
hypothesis = cls(leafs[i - 1].start, leafs[i].stop)
hypothesis.left_child = leafs[i - 1]
hypothesis.right_child = leafs[i]
representation, score = grammar(hypothesis)
hypothesis.representation = representation
if score > max_score:
max_score = score
max_hypothesis = hypothesis
max_idx = i
leafs[max_idx - 1:max_idx + 1] = [max_hypothesis]
return leafs[0]
</pre>
During learning, we collect the scores of the subtrees and use labels to decide if a merge was correct or not.
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-26249006959017297562019-08-01T06:26:00.000-07:002019-08-01T06:29:27.091-07:00Code Along Books: Bitcoin, Go Game, Ray Tracers<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcn7sdFXZiJGAPMZ0eEsqmmQdArI7S0gGD0Q9Dd_P137qeVASkMBA4ywNTaWql6-sheFSHQTeH9aaWVsxPaQAHkSZ4HPHl80W-Arp17524HwP-V_mwMFlh7AIu4raScDDeyVWsPjM4gck/s1600/scene.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="256" data-original-width="512" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcn7sdFXZiJGAPMZ0eEsqmmQdArI7S0gGD0Q9Dd_P137qeVASkMBA4ywNTaWql6-sheFSHQTeH9aaWVsxPaQAHkSZ4HPHl80W-Arp17524HwP-V_mwMFlh7AIu4raScDDeyVWsPjM4gck/s400/scene.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A scene rendered with your own ray tracer</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
I like learning by doing. In my opinion the only real self test whether you understood something is a working prototype. For example, when learning more about the details of deep learning or xgboost, I implemented my own <a href="http://daniel-at-world.blogspot.com/2017/03/auto-differentiation-implementing-book.html">auto differentiation</a> and <a href="http://daniel-at-world.blogspot.com/2017/12/implementing-gradient-boosted-trees.html">gradient boosted trees</a>.<br />
<div>
<div>
<br /></div>
<div>
Recently I found several books that guide you through implementations of fairly complex pieces of software. After finishing a chapter it feels like making progress on a project. So it is easier to follow along and you don't have to search for all the references needed yourself. You still have the benefit from learning by doing.<br />
<div>
<br /></div>
<div>
The book based projects I found are: </div>
<div>
<ol>
<li>A ray tracer [1]</li>
<li>A go AI [2]</li>
<li>A bitcoin implementation [3]</li>
<li>A Prolog interpreter [4]</li>
</ol>
<div>
The prolog interpreter needed a little more adjustment since the book's topic is classic<br />
AI in lisp and I wanted to implement my prolog in scala.<br />
<br />
The previous projects on neural networks were based on Goodfellow's Deep Learning book [5] and many other references such as online classes so it was not as easy to code along. The same is true for the gradient boosting implementation which was mostly based on papers and books.<br />
<br />
However the other projects simply involved writing code and the book was the only thing required to get into the topic. The quality all of these books show is that it is not a collection of separated code pieces that explain a topic but the topic's are explained by a coherent code based that you develop yourself form scratch.<br />
<br />
<h3>
1. A Ray Tracer</h3>
The book "The Ray Tracer Challenge: A Test Driven Guide to Your First 3D Rendered" [1], is giving you no actual code but behavior specifications (cucumber). The code has to be written by yourself.<br />
Harder parts include pseudo code. The book covers the basic math library and then uses the defined primitives to construct more and more complex features.<br />
The end result is a full ray tracer with shadows reflections and multiple primitives.<br />
<br />
<h3>
2. A Go AI</h3>
<div>
In the next book: "Deep Learning And The Game Of Go" [2], you build a go AI using the code provided by the authors. The books walks you through the implementation of the go board and rules first. Then the authors walk through classic game AI algorithms such as monte carlo tree search. In the end the authors introduce game state evaluation using convolutional neural networks using Keras.</div>
<div>
The resulting system resembles an early version of AlphaGO</div>
<br />
<br />
<h3>
3. Bitcoin</h3>
<div>
The book "Programming Bitcoin" [3] is a mix between the first two. Lot's of code is provided in the form of skeletons but the reader still has to fill in the blanks for several assignments. The book</div>
<div>
covers everything from basic cryptography to scripting bitcoin. The end result is a full bitcoin </div>
<div>
implementation.</div>
<div>
<br /></div>
<div>
<br /></div>
<h3>
Wrap up</h3>
<div>
The great thing about all the books mentioned is that all these books do cover the primitives needed too. You don't need to rely on a library but the authors walk you front to back.</div>
<br />
My code bases are all on github. Some are work in progress [WIP] since<br />
i did not finish the books yet. For example, in books 2 and 3 I only went through the first chapters:<br />
<br />
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<br />
<ol style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
<li><a href="https://github.com/dkohlsdorf/RayTracer">The ray tracer</a></li>
<li><a href="https://github.com/dkohlsdorf/gameplay">The go AI</a></li>
<li><a href="https://github.com/dkohlsdorf/pycoin">The bitcoin implementation</a></li>
<li><a href="https://github.com/dkohlsdorf/Scalog">The Prolog interpreter</a></li>
</ol>
</div>
<div>
<br />
<h3>
REFERENCES:</h3>
</div>
<div>
<br /></div>
<div>
[1] Buck, "The Ray Tracer Challenge: A Test Driven Guide to Your First 3D Rendered", Pragmatic Bookshelf, 2019</div>
<div>
[2] Pumperla, Furguson, "Deep Learning And The Game Of Go", Manning, 2019</div>
<div>
[3] Song, "Programming Bitcoin: Learn How To Program Bitcoin From Scratch", Oreilly, 2019</div>
<div>
[4] Norvig: "Paradigmes of Artificial Intelligence Programming: Case Studies in Lisp", Morgan Kaufmann, 1992<br />
[5] Ian Goodfellow, "Deep Learning", MIT Press, 2016</div>
</div>
</div>
</div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-16789143647320129192019-07-09T07:23:00.002-07:002019-07-09T07:23:57.984-07:00WebRTC and Tensorflow.js<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0Quc8izFlr33Aa9BZ5FRAOEePcKNpeBRe4SQB5q1dwKT1wYyPThbX2bt9eTJKMyHt0u7bs7pqW6zCHVhQushPIBTocHaoZEB8dQvB-cVK7lAkAHf5bCsIcsU8Nn1EZRamPLi0jlByFR0/s1600/Screen+Shot+2019-07-09+at+4.21.04+PM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="468" data-original-width="558" height="167" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0Quc8izFlr33Aa9BZ5FRAOEePcKNpeBRe4SQB5q1dwKT1wYyPThbX2bt9eTJKMyHt0u7bs7pqW6zCHVhQushPIBTocHaoZEB8dQvB-cVK7lAkAHf5bCsIcsU8Nn1EZRamPLi0jlByFR0/s200/Screen+Shot+2019-07-09+at+4.21.04+PM.png" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">WebCam image classification on web page</td></tr>
</tbody></table>
<br /><br />
I will give a quick tutorial on how to connect a webcam in <i><a href="https://en.wikipedia.org/wiki/WebRTC">html5/WebRTC</a> </i>to <a href="https://www.tensorflow.org/js">Tensorflow.js</a> for image classification. We will load a pre trained mobile net and then pass it frames from the camera. First, we define our html page and specify a video element in which we later stream the video from the camera,<br />
a label field and also start the WebRtc + neural network code. The code can be found at this <a href="https://gist.github.com/dkohlsdorf/b52861b2ef2319683fd0af19ee288e06">gist</a>.<br />
<br />
<br />
<pre class="brush:xhtml"><!DOCTYPE html>
<html>
<head>
<title> Hello WebRTC </title>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.0.0/dist/tf.min.js"></script>
</head>
<body>
<video id="cam" width="224" height="224" autoplay playsinline></video> <br/>
<script src="camnet.js"></script><br/>
<font color="red" size="5"> LABEL: </font><br/>
<div id="label"> </div>
</body>
</html>
</pre>
<br />
<br />
Then we download the pretrained mobile net and initialize the webcam. <br />
<pre class="brush:javascript"> async function init() {
try {
const net = await tf.loadLayersModel(MOBILENET_MODEL_PATH);
const constraints = window.constraints = {audio: false, video: true};
const stream = await navigator.mediaDevices.getUserMedia(constraints);
onSuccess(stream, net);
} catch (e) {
onError(e);
}
}
</pre>
<br />
<br />
The camera stream can be retrieved by the getUserMedia function. The onError method simply writes an error to the console.
If we are successful, we get the video element from our dom and bind the stream to the video. We then start the detection
loop with a method called onFrame.
<br />
<pre class="brush:javascript">function onSuccess(stream, net) {
const video = document.querySelector('video');
const videoTracks = stream.getVideoTracks();
console.log('Got stream with constraints:', constraints);
console.log(`Using video device: ${videoTracks[0].label}`);
window.stream = stream;
video.srcObject = window.stream;
onFrame(video, net);
}
</pre>
<br />
<br />
onFrame's inner function processFrame is an infinite recursion. We grab the
video frame by frame and push it into a classify method along with the
neural network, the video element as well as the label element.
<br />
<pre class="brush:javascript">function onFrame(video, net) {
var label_element = document.getElementById('label');
console.log(net.summary());
async function processFrame() {
classify(video, label_element, net)
requestAnimationFrame(processFrame);
}
processFrame();
}
</pre>
The last method transforms the camera image into a tensor, normalizes the color and then
construct a batch with only one example. Based on the prediction from the mobile net,
we extract the best class and write it into the label element.
<br />
<pre class="brush:javascript">async function classify(img_element, label_element, net) {
const img = tf.browser.fromPixels(img_element).toFloat();
const offset = tf.scalar(127.5);
const normalized = img.sub(offset).div(offset);
const batched = normalized.reshape([1, IMAGE_SIZE, IMAGE_SIZE, 3]);
const prediction = await net.predict(batched).data();
var max_i = 0;
var max_v = prediction[0];
for (let i = 0; i < prediction.length; i++) {
if(prediction[i] > max_v) {
max_v = prediction[i];
max_i = i;
}
}
const label = IMAGENET_CLASSES[max_i];
if (max_v > 0.5) {
label_element.innerHTML = label + " [" + max_v + "]";
}
}
</pre>
<br />
<br />
<br />dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-7561501718786789702019-07-07T04:50:00.003-07:002019-07-07T04:50:53.552-07:00Implementing Triplet Loss Function in Tensorflow 2.0In this post I will go through an implementation of the triplet loss for siamese neural network architectures in keras (tensorflow 2.0).<br />
<br />
The goal of training a neural network with a triplet loss is to learn a metric embedding. That is examples that are conceptually close are also close in euclidean space and examples that are conceptually further are further away in euclidean space, too. The triplet loss is introduced in the <a href="https://arxiv.org/abs/1503.03832">facenet</a> paper from google. When learning with the triplet loss, we choose three examples. The first example is called the anchor, which can be any example from our dataset. The positive example is one that is conceptually close. For example, a conceptually close example for face recognition might be a picture of the same person as shown in the anchor image. The negative example might simply be a picture of any other person. The embedding network can actually have any architecture. During training, we push all three examples through the network, compute the loss function with the three embeddings and adjust the weights which in each epoch pushes negative examples further from it's anchors and positive examples closer.<br />
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtsR2g9M0XBY4qGN4GjYQFYZo9JnD9Eb-Z9Keg3zCqYYDcSGdd8iW-ohH3timyC4EVuXzgNQUSO1aafI3xNzLXLdfEPXNooy8zOolYBwliQD-TqKi_56q7BpuV8MMYBmRA_z9Z541PHOc/s1600/TripletLoss.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="360" data-original-width="480" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtsR2g9M0XBY4qGN4GjYQFYZo9JnD9Eb-Z9Keg3zCqYYDcSGdd8iW-ohH3timyC4EVuXzgNQUSO1aafI3xNzLXLdfEPXNooy8zOolYBwliQD-TqKi_56q7BpuV8MMYBmRA_z9Z541PHOc/s400/TripletLoss.gif" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: The Triplet loss forces negative examples further from the anchor and positive examples closer. </td></tr>
</tbody></table>
An example is shown below. The loss consists of three parts. The first is the euclidean distance between the anchor and it's positive example. The second is the euclidean distance between the anchor and it's negative example. Since we want to minimize the loss, the first term should be small and the second term should be large. Furthermore, we introduce a regularization term or margin between the examples.<br />
<div>
<br />
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><img border="0" data-original-height="842" data-original-width="1600" height="336" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhI3ktcsqPgnlbfSQdd5iVK5SyqNDp_kNqovB4LAT_RShB5glIFSOZZzvQiDed-WSn19YSfoKK6h8sMNNsPGQ7drtmqvr7N0Izq5CcE2m9K8VlV4Hn31pLwudiAR7DYjlLtRqIxWu16vys/s640/Screen+Shot+2019-07-07+at+1.24.36+PM.png" style="margin-left: auto; margin-right: auto;" width="640" /></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2: We embed the anchor, positive and negative examples with the same network and apply the triplet loss. During Backpropagation we achieve higher euclidean distances to the negative example and lower distances for the negative examples. </td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhI3ktcsqPgnlbfSQdd5iVK5SyqNDp_kNqovB4LAT_RShB5glIFSOZZzvQiDed-WSn19YSfoKK6h8sMNNsPGQ7drtmqvr7N0Izq5CcE2m9K8VlV4Hn31pLwudiAR7DYjlLtRqIxWu16vys/s1600/Screen+Shot+2019-07-07+at+1.24.36+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a><br /></div>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhI3ktcsqPgnlbfSQdd5iVK5SyqNDp_kNqovB4LAT_RShB5glIFSOZZzvQiDed-WSn19YSfoKK6h8sMNNsPGQ7drtmqvr7N0Izq5CcE2m9K8VlV4Hn31pLwudiAR7DYjlLtRqIxWu16vys/s1600/Screen+Shot+2019-07-07+at+1.24.36+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a><br />
For <a href="https://arxiv.org/abs/1711.02209">unsupervised audio embedding</a>, one way to sample triplets is to pick a window from the audio as the anchor and a close window in time to the anchor as positive (since audio does not change that rapidly). The negative example is simply a sample from another file.<br />
<br />
In order to implement such a loss in Tensorflow 2.0 / Keras, we can implement the <a href="https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses/Loss">Loss base class</a>.<br />
<br />
<pre class="brush:python">class TripletLoss(loss.Loss):
def __init__(self, margin):
super().__init__()
self.margin = margin
def call(self, y_true, y_pred):
anchor = y_pred[0]
pos = y_pred[1]
neg = y_pred[2]
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, pos)), axis=-1)
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, neg)), axis=-1)
basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), self.margin)
loss = tf.reduce_sum(tf.maximum(basic_loss, 0.0))
return loss
</pre>
<br />
As you can see from the code, we do not need the ground truth (y truth) and we simply pass dummy values during training (Keras will still check it's dimension). In order to construct a model using the triplet loss, we can build an embedder model and then use that model three times in the triplet loss model. During back propagation, the three gradients will be summed and then passed through the embedder model (<a href="https://www.deeplearningbook.org/contents/mlp.html">deep learning book chapter 6</a>, Algorithm 6.6). In order to see a full example for<br />
audio data, you can check this <a href="https://gist.github.com/dkohlsdorf/90df4721cd70c6f4420b7e049796280b">gist</a>.<br />
<br />
<br />
<br />
<br />
<br /></div>
</div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-67618457361814419802019-04-01T06:10:00.003-07:002019-07-09T07:33:17.080-07:00Navigable Small World Graphs for Approximate Nearest Neighbors: Implementation in Rust and Time Series Experiments.<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5iiZnz9Rjlod3g6C-k9Q9ymmkqKsZVwrbkpiok9GhLi6u3V05HdobVmpGu9-IDAUxFkx7QZjEtp8vKiMG0nf9pVRKzoBDWXGLC7S5bsXq-D6ODKhXIMwr0DSrAqG924jD7TrJ7ZFZvPo/s1600/Screenshot+2019-03-31+at+23.20.56.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="602" data-original-width="1238" height="311" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5iiZnz9Rjlod3g6C-k9Q9ymmkqKsZVwrbkpiok9GhLi6u3V05HdobVmpGu9-IDAUxFkx7QZjEtp8vKiMG0nf9pVRKzoBDWXGLC7S5bsXq-D6ODKhXIMwr0DSrAqG924jD7TrJ7ZFZvPo/s640/Screenshot+2019-03-31+at+23.20.56.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A navigable small world graph. Image from <a href="https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059" style="font-size: medium; text-align: start;">NSWG</a></td></tr>
</tbody></table>
<br />
A small world graph or network, is a mathematical graph with certain properties. In such a graph most nodes will not be neighbors but two neighbors of a node will likely be. In other words, traversing the graph from one node to the other can be achieved in very few edge transitions.<br />
The idea is to use such a structure to support fast approximate nearest neighbor queries.<br />
In a navigable small world graph (<a href="https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059">NSWG</a>) each node or vertex is associated with a vector in $\mathbb{R}^d$. During search, we check the distance to all neighbors of the node and transition to the closest one. We then repeat the search until we found a local minimum. In this post we discuss how to compute the $k$ closest neighbors and explore the method using the <a href="https://www.cs.ucr.edu/~eamonn/time_series_data_2018/">UCR time series</a><br />
<a href="https://www.cs.ucr.edu/~eamonn/time_series_data_2018/">dataset</a>. The code can be found on <a href="https://github.com/dkohlsdorf/NSWG">github</a>.<br />
<br />
<h3>
Modelling the Graph and Search Results</h3>
<div>
The edges in the graph are implemented using a map. For each vertex, we map to the
neighboring vertices. The struct also holds the flattened vectors associated with each vertex.
<br />
<pre class="language-rust"><code class="language-rust">/// A navigable small world graph defined
/// by a sparse edge set and a flat vector set, one for each vertex.
pub struct NavigableSmallWorldGraph {
pub edges: HashMap<usize, Vec<usize>>,
pub instances: Vec<f32>,
pub dim: usize
}
</code></pre>
A search result is simply implemented as a struct with a node id as well as a distance
to the query.
<br />
<pre class="language-rust"><code class="language-rust">pub struct SearchResult {
pub node: usize,
pub distance: f32
}
</code></pre>
<br /></div>
Furthermore, we implement comparing to search results by their distance. In
order to manage the set of candidates during knn search, we implement a
result set as a bineary tree set. In this way, extracting the top neighbors is
fast.
<br />
<pre class="language-rust"><code class="language-rust">pub struct ResultStore {
ordered: BTreeSet<SearchResult>
}
impl ResultStore {
pub fn new() -> ResultStore {
ResultStore { ordered: BTreeSet::new() }
}
pub fn len(&self) -> usize {
self.ordered.len()
}
pub fn take(&self, k: usize) -> Vec<SearchResult> {
self.ordered.iter().take(k).map(|x| x.clone()).collect()
}
pub fn best(&self, at: usize) -> SearchResult {
let k_best = self.ordered.iter().take(at);
if k_best.len() < at {
SearchResult::none()
} else {
k_best.last().unwrap().clone()
}
}
pub fn insert(&mut self, result: SearchResult) {
self.ordered.insert(result);
}
pub fn remove(&mut self, result: &SearchResult) {
self.ordered.remove(result);
}
}
</code></pre>
Insertion in this data structure automatically inserts in order.
The same goes for the remove option, which keeps the tree ordered.
Iterating the elements now allows to get the candidates sorted.
<br />
<h3>
KNN-Search and Graph Construction</h3>
First we define the distances we support. For this post, I implement the euclidean distance between
a query vector and a node in the graph as well as the dynamic time warping distance, when seeing the
vectors as time series. If we give a warping band to the distance function, we compute the dtw distance.
<br />
<pre class="language-rust"><code class="language-rust"> fn distance(&self, query: &[f32], node: usize, align_band: Option<usize>) -> f32 {
match align_band {
Some(w) => self.dtw(query, node, w),
None => self.euclidean(query, node)
}
}
fn euclidean(&self, query: &[f32], node: usize) -> f32 {
assert!(query.len() == self.dim);
let y = &self.instances[node * self.dim .. (node + 1) * self.dim];
let mut distance = 0.0;
for i in 0 .. query.len() {
distance += f32::powf(query[i] - y[i], 2.0);
}
f32::sqrt(distance)
}
fn dtw(&self, query: &[f32], node: usize, w: usize) -> f32 {
let y = &self.instances[node * self.dim .. (node + 1) * self.dim];
let n = query.len();
let m = y.len();
// allocating here is the main bottleneck
let mut dp = vec![std::f32::INFINITY; (n + 1) * (m + 1)];
dp[0] = 0.0;
for i in 1 .. n + 1 {
for j in usize::max(NavigableSmallWorldGraph::sub(i, w), 1) .. usize::min(i + w, m + 1) {
let distance = f32::powf(query[i - 1] - y[j - 1], 2.0);
dp[i * (n + 1) + j] = distance + NavigableSmallWorldGraph::min3(
dp[(i - 1) * (n + 1) + j],
dp[i * (n + 1) + (j - 1)],
dp[(i - 1) * (n + 1) + (j - 1)]
)
}
}
dp[dp.len() - 1]
}
</usize></code></pre>
We can now turn to the search algorithm. During the search, we
maintain an ordered candidate lost, an ordered result list and an unordered
visited set. During a search run, we start with a random entry node.
We then add said node as the first candidate. In the following steps,
we get the best node from the candidate list and add all of it's neighbors
as candidates, as long as we did not visit the node before. If a candidate
is further away then the k-th neighbor so far, we stop.
We restart the search $n$ times from different random start points.
<br />
<pre class="language-rust"><code class="language-rust"> /// Given a query find the k-nearest neighbors using the nswg graph
/// * `query` The query vector
/// * `n_searches` number of restarts
/// * `k_neighbors` number of results
/// * `align_band` Sakoe Chiba Band for Dynamic Time Warping, if not set we use Euclidean Distance
pub fn search(
&self,
query: &[f32],
n_searches: usize,
k_neighbors: usize,
align_band: Option<usize>,
) -> (Vec<SearchResult>, usize) {
let mut candidates = ResultStore::new();
let mut results = ResultStore::new();
let mut visited = HashSet::new();
let mut rng = rand::thread_rng();
let mut n_steps = 0;
for _attempt in 0..n_searches {
let mut bsf = ResultStore::new();
// stop if we visited all of the nodes
if self.n_instances() == visited.len() {
break;
}
// sample entry point, make sure we never used it before
// and insert the point as a candidate
let mut entry_point = rng.gen_range(0, self.n_instances());
while visited.contains(&entry_point) {
entry_point = rng.gen_range(0, self.n_instances());
}
let distance = self.distance(query, entry_point, align_band);
candidates.insert(SearchResult::new(entry_point, distance));
while candidates.len() > 0 {
// check if we can stop: abendon search if the result is the worst than the top k so far
let candidate = candidates.best(1);
let kth = f32::min(
results.best(k_neighbors).distance,
bsf.best(k_neighbors).distance,
);
candidates.remove(&candidate);
if candidate.distance > kth {
break;
}
// add expand all edges
for edge in self.edges[&candidate.node].iter() {
if !visited.contains(edge) {
n_steps += 1;
visited.insert(edge);
let edge = *edge;
let distance = self.distance(query, edge, align_band);
candidates.insert(SearchResult::new(edge, distance));
bsf.insert(SearchResult::new(edge, distance));
}
}
}
for res in bsf.take(k_neighbors) {
results.insert(res);
}
}
(results.take(k_neighbors), n_steps)
}
</code></pre>
<div>
<br /></div>
<h3>
Testing the Code on UCR Time Series</h3>
In order to test the code, we run it against multiple datasets from the UCR time series archive.
Our accuracy is similar to the UCR datasets. However, the number of distance computations is way less.
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-35305223865054758812019-03-16T10:00:00.001-07:002019-03-16T10:03:02.344-07:00Avro Rust: A first look at data serialisation in rust using avro<a href="https://avro.apache.org/docs/current/">Avro</a> is a data serialisation system in which you can define a schema (basically the fields in a record you want to serialise along with their type) in json and then send the data binary.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5zFL-uLaur4DDWuewcwaKFYdtSWmVUfWxBnjs3PQBLBZZPOzAZVR5c1NplrNtVoBKchrRkk3njb975Naq94PSyr3iUZ8eLc9wHEhkBfpC42oX2kS6pllKgHrMqYhZG5lN6ucdgvlNM7c/s1600/Screenshot+2019-03-16+at+18.02.15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="610" data-original-width="654" height="298" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5zFL-uLaur4DDWuewcwaKFYdtSWmVUfWxBnjs3PQBLBZZPOzAZVR5c1NplrNtVoBKchrRkk3njb975Naq94PSyr3iUZ8eLc9wHEhkBfpC42oX2kS6pllKgHrMqYhZG5lN6ucdgvlNM7c/s320/Screenshot+2019-03-16+at+18.02.15.png" width="320" /></a></div>
<br />
<br />
<br />
In the Rust language, structs can be serialised to multiple formats using a library called <a href="https://github.com/serde-rs/serde">serde</a>. In the<br />
<a href="http://daniel-at-world.blogspot.com/2019/03/word2vec-implementation-in-rust.html">last post </a> we saved our word2vec model into a binary format using serde and bincode. In this post, we will look into serialising with serde and <a href="https://github.com/flavray/avro-rs">avro-rs</a>, an avro implementation in rust and implement sending serialised objects over a tcp stream using <a href="https://avro.apache.org/docs/1.8.1/spec.html#Message+Framing">framing</a>.<br />
<br />
As usual the code is on <a href="https://gist.github.com/dkohlsdorf/a17039267ebeba3da183a04049b6ebfd#file-avro_tcp_ip_frame-rs">[:github:]</a><br />
<br />
<br />
<h3>
Data (De) Serialisation </h3>
<div>
Avro supports several data types for serialisation including more complex types such as arrays.<br />
<br />
<br />
<pre class="language-rust"><code class="language-rust"> #[derive(Debug, Deserialize, Serialize)]
pub struct Test {
a: i32,
b: f64,
c: String,
d: Vec<i32>
}
fn schema() -> Schema {
let raw_schema = r#"
{
"type": "record",
"name": "test",
"fields": [
{"name": "a", "type": "int", "default": 0},
{"name": "b", "type": "double", "default": 1.0},
{"name": "c", "type": "string", "default": "dansen"},
{"name": "d", "type": {"type": "array", "items": "int"}}
]
}
"#;
avro_rs::Schema::parse_str(raw_schema).unwrap()
}
</code></pre>
In the code above, we define a struct that we want to serialise and define the serialisation schema in json.
Nothing more is needed. With this information we can serialise and deserialise the `Test` struct. We
can even (de) serialise a collection of these structs. The serialisation results in a vector of bytes and
given this vector of bytes we can deserialise into a vector of our original struct.
<br />
<pre class="language-rust"><code class="language-rust"> #[derive(Debug, Deserialize, Serialize)]
fn write_data(data: &[Test], schema: Schema) -> Vec<u8> {
let mut writer = Writer::with_codec(&schema, Vec::new(), Codec::Deflate);
for x in data.iter() {
writer.append_ser(x).unwrap();
}
writer.flush().unwrap();
writer.into_inner()
}
fn read_data(data: &[u8], schema: Schema) -> Vec<Test> {
let reader = Reader::with_schema(&schema, data).unwrap();
reader
.map(|record| from_value::<Test>(&record.unwrap()).unwrap())
.collect()
}
</code></pre>
<br />
<h3>
Framed Writing and Reading over TCP/IP</h3>
</div>
<div>
In order to send avro over the network, it seems recommended to frame the message. That means
breaking our byte stream into chunks we send over, each preceded by it's length and terminated
by a 0 sized chunk.
<q>Avro messages are framed as a list of buffers.
Framing is a layer between messages and the transport. It exists to optimize certain operations.
The format of framed message data is:
a series of buffers, where each buffer consists of:
a four-byte, big-endian buffer length, followed by
that many bytes of buffer data.
A message is always terminated by a zero-length buffer.
Framing is transparent to request and response message formats (described below). Any message may be presented as a single or multiple buffers.
Framing can permit readers to more efficiently get different buffers from different sources and for writers to more efficiently store different buffers to different destinations. In particular, it can reduce the number of times large binary objects are copied. For example, if an RPC parameter consists of a megabyte of file data, that data can be copied directly to a socket from a file descriptor, and, on the other end, it could be written directly to a file descriptor, never entering user space.
A simple, recommended, framing policy is for writers to create a new segment whenever a single binary object is written that is larger than a normal output buffer. Small objects are then appended in buffers, while larger objects are written as their own buffers. When a reader then tries to read a large object the runtime can hand it an entire buffer directly, without having to copy it. </q> from <a href="https://avro.apache.org/docs/1.8.1/spec.html#Message+Framing"> Avro Page</a> .<br />
<br />
Writing over the network with a fixed buffer or chunk size can be implemented by sending slices of the given size
over the network, each preceded by the length of it (which really only matters for the last one). As described above
we send the size of the buffer as a 4-byte integer, big endian.
<br />
<br />
<br />
<pre class="language-rust"><code class="language-rust"> pub fn send_framed(objects: &[Test], address: String, schema: Schema, buf_size: usize) {
let data = write_data(objects, schema);
let mut stream = TcpStream::connect(address).unwrap();
let n = data.len();
for i in (0..n).step_by(buf_size) {
// determine size of bytes to write
let start = i;
let stop = usize::min(i + buf_size, n);
// send length of buffer
let mut buffer_length = [0; 4];
BigEndian::write_u32(&mut buffer_length, (stop - start) as u32);
stream.write(&buffer_length).unwrap();
// write actual data
stream.write(&data[start..stop]).unwrap();
}
// terminate by 0 sized
let mut buffer_length = [0; 4];
BigEndian::write_u32(&mut buffer_length, 0);
stream.write(&buffer_length).unwrap();
stream.flush().unwrap();
}
</code></pre>
<br />
<br />
When reading we simply read the size of the next buffer, then read it and append it into a vector of bytes until
we hit the zero size buffer marking the end. Then we simply use the method above to de serialise into a vector of structs
<br />
<pre class="language-rust"><code class="language-rust">
pub fn read_framed(stream: &mut TcpStream, schema: Schema) -> Vec<Test> {
let mut message: Vec<u8> = Vec::new();
let mut first_buf = [0; 4];
stream.read(&mut first_buf).unwrap();
let mut next = BigEndian::read_u32(&first_buf);
// read while we see non empty buffers
while next > 0 {
// append all bytes to final object
let mut object = vec![0; next as usize];
stream.read(&mut object).unwrap();
message.append(&mut object);
// read next buffer size
let mut next_buf = [0; 4];
stream.read(&mut next_buf).unwrap();
next = BigEndian::read_u32(&next_buf);
}
read_data(&message, schema)
}
</code></pre>
</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Thats it, now we can send avro framed over the network and de-serialise on the other end.</div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com1Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-11633455757080060552019-03-08T02:59:00.001-08:002019-03-10T07:28:43.513-07:00Word2Vec Implementation in Rust: Continuous Bag of Words<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmfAp5instwzu93kvHlL0FttD26GMpGpoD7Ymqub01GyodcwDLGh4dKbaOMKFXS3550y4VI_fry48s83GvTRXd7LTuL0kJEN0YguyP0LTgGGHVbMxohwsi5grfX7AdZJCV1Z1uvdgh5sM/s1600/Screenshot+2019-03-08+at+11.38.55.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="852" data-original-width="1600" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmfAp5instwzu93kvHlL0FttD26GMpGpoD7Ymqub01GyodcwDLGh4dKbaOMKFXS3550y4VI_fry48s83GvTRXd7LTuL0kJEN0YguyP0LTgGGHVbMxohwsi5grfX7AdZJCV1Z1uvdgh5sM/s640/Screenshot+2019-03-08+at+11.38.55.png" width="640" /></a></div>
<br />
I recently started to get serious about learning rust and I needed a small to medium project to train on.<br />
One thing I explained in this <a href="https://daniel-at-world.blogspot.com/2016/10/reading-through-word2vec-code.html">blog</a> and walked through the code is Word2Vec. A method to learn word embeddings. In this post I will explain the method again using my new rust code on <a href="https://github.com/dkohlsdorf/RustWord2Vec">github</a>.<br />
<br />
<h3>
Word Similarity - Basic Idea </h3>
<div>
In the following, I will describe the basic idea behind continuous bag of words, one way to estimate word vectors. Word vectors are a dense vectorised word representation capturing similarity of words in Euclidean space. Basically we have one vector for each word in our vocabulary. The euclidean distance between two vectors of words that are similar is smaller than the distance between vectors
of words that are less similar. The assumption in Word2Vec is that words that appear in a similar
context are similar. We shift a sliding window over the text and try to predict the center word
in the window from the word vectors of the surrounding words in the window. </div>
<br />
<pre class="language-rust"><code class="language-rust">pub struct Window<'a> {
pub words: &'a [usize],
pub predict_pos: usize
}
pub struct Document {
pub words: Vec<usize>
}
impl<'a> Window<'a> {
pub fn label(&self) -> usize {
self.words[self.predict_pos]
}
}
impl Document {
pub fn window(&self, pos: usize, size: usize) -> Option<Window> {
let start = sub(pos, size);
let stop = usize::min(pos + size, self.words.len() - 1);
if stop - start == 2 * size {
Some(Window {words: &self.words[start as usize .. stop as usize], predict_pos: size})
} else {
None
}
}
}</code></pre>
<br />
<div>
The code above defines a document as a vector of word ids and allows to shift a sliding window over it. The return type is not a vector itself but a slice, which is convenient since there is nothing copied. In the following the center word in the window is called a label $l$ and the other words are called the context: $c_1 .. c_m$.<br />
<br /></div>
<h3>
Word Vectors</h3>
<div>
Now that we have defined the window, we can start implementing a representation for the word vectors. There will be one vector of $d$ dimensions or columns per word in the dictionary.<br />
In code we save the word vectors as a flat vector of doubles. We can then access the $i-th$ vector as a slice from $[i * cols .. (i + 1) * cols]$<br />
<br />
<br />
<br /></div>
<div>
<pre class="language-rust"><code class="language-rust">#[derive(Serialize, Deserialize)]
pub struct ParameterStore {
pub cols: usize,
pub weights: Vec<f64>
}
impl ParameterStore {
pub fn seeded(rows: usize, cols: usize) -> ParameterStore {
let mut rng = rand::thread_rng();
let mut weights = Vec::new();
for _i in 0 .. (rows * cols) {
let uniform = (rng.gen_range(0.0, 1.0) - 0.5) / cols as f64;
weights.push(uniform);
}
ParameterStore {cols: cols, weights: weights}
}
pub fn at(&self, i: usize) -> &[f64] {
let from = i * self.cols;
let to = (i + 1) * self.cols;
&self.weights[from .. to]
}
pub fn update(&mut self, i: usize, v: &Vec<f64>) {
let from = i * self.cols;
for i in 0 .. self.cols {
self.weights[i + from] += v[i];
}
}
pub fn avg(&self, result: &mut Vec<f64>, indices: Vec<usize>) {
for col in 0 .. self.cols {
result[col] = 0.0;
for row in indices.iter() {
let from = row * self.cols;
result[col] += self.weights[col + from];
}
result[col] /= indices.len() as f64;
}
}
}
</code></pre>
We initialise the vectors by a random number in the range $\frac{Uniform([-0.5;0.5])}{d}$, which is the standard way to initialise word2vec. The update method simply adds a gradient to a word vector. There is also a method to compute the average of multiple word vectors, needed later during learning.<br />
<br />
<h3>
Putting It All Together: Computing the Gradients</h3>
<div>
So now we can compute windows from the text and average the word vectors of the context words. </div>
<div>
The averaged vectors are our hidden layer's activations. From there we compute the prediction and then back propagate the error. So we need two sets of parameters. One is the word vectors initialised as above, and one for the predictions which is initialised to zeros. </div>
<br />
<pre class="language-rust"><code class="language-rust">#[derive(Serialize, Deserialize)]
pub struct CBOW {
pub embed: ParameterStore,
pub predict: ParameterStore
}
</code></pre>
<br />
Given the average embedding for the context as $h$ and the vector for word $i$ from the predictions $w_i$, the forward pass on the neural network is:<br />
<br />
$a_i = \sigma(h * w_i)$.<br />
<br />
We are predicting the center word of the window, that means we model the output layer as a multinomial where the center word is one hotcoded. So output should be $1$ for the center word<br />
and $0$ for all others. Then the error for the center word is:<br />
<br />
$e_i = log(a_i) $<br />
<br />
and for all others it is<br />
<br />
$e_i = log(1 - a_i)$.<br />
<br />
The gradient on the loss is simply $\delta_i = (y - a_i) * \lambda$, with $\lambda$ being the learning rate.<br />
<br />
<pre class="language-rust"><code class="language-rust"> fn gradient(&self, label: usize, h: &Vecf&lt64>, truth: f64, rate: f64) -> (f64, f64) {
let w = self.predict.at(label);
let a = sigmoid(dot(&h, &w));
let d = (truth - a) * rate;
let e = -f64::ln(if truth == 1.0 {a} else {1.0 - a});
(d, e)
}
</code></pre>
<br />
The gradient applied to the output parameters is: $\delta_i * h$.
The gradient for the embeddings is: $\delta_i * w_i$ which is applied to all context words.
So all words in the same context are updated with the same gradient, in turn making them more similar, which
was the main goal for word embeddings. There is one more detail.
Normally, we would have to compute the losses for the complete dictionary, however for a million words, that would take very long. So instead we use something called negative sampling. We compute the gradient for the center word and then sample some negative words as the negative gradient.<br />
The final computation for learning is shown below:
<br />
<pre class="language-rust"><code class="language-rust"> pub fn negative_sampeling(&mut self, window: &Window, rate: f64, n_samples: usize, unigrams: &Sampler) -> f64 {
let mut gradient_embed = vec![0.0; self.embed.cols];
let h = self.embedding(self.ids(window));
let (pos, pos_err) = self.gradient(window.label(), &h, 1.0, rate);
let mut error = pos_err;
let mut gradient_pos_predict = vec![0.0; self.predict.cols];
for i in 0 .. self.embed.cols {
gradient_embed[i] += pos * self.predict.at(window.label())[i];
gradient_pos_predict[i] += pos * h[i];
}
self.predict.update(window.label(), &gradient_pos_predict);
for _sample in 0 .. n_samples {
let token = unigrams.multinomial();
let (neg, neg_error) = self.gradient(token, &h, 0.0, rate);
error += neg_error;
let mut gradient_neg_predict = vec![0.0; self.predict.cols];
for i in 0 .. self.embed.cols {
gradient_embed[i] += neg * self.predict.at(token)[i];
gradient_neg_predict[i] += neg * h[i];
}
self.predict.update(token, &gradient_neg_predict);
}
let ids = self.ids(window);
for i in 0 .. ids.len() {
self.embed.update(ids[i], &gradient_embed);
}
error
}
</code></pre>
</div>
<br />
<h3>
Results</h3>
<div>
I trained the net on quora questions and when searching for similar vectors we will find similar words:</div>
<div>
<br /></div>
<div>
<pre style="background-color: #f6f8fa; border-radius: 3px; box-sizing: border-box; color: #24292e; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; line-height: 1.45; margin-bottom: 16px; overflow-wrap: normal; overflow: auto; padding: 16px;"><code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; line-height: inherit; margin: 0px; overflow-wrap: normal; overflow: visible; padding: 0px; word-break: normal;">Searching: India
kashmir: 11.636631273392045
singapore: 12.155846288647625
kerala: 12.31677547739736
dubai: 12.597659271137639
syria: 12.630908898646624
pradesh: 12.657158343619573
pok: 12.667834993874745
malaysia: 12.708228117809442
taiwan: 12.75190112716203
afghanistan: 12.772290465531222
africa: 12.772372266369985
bjp: 12.802207718902666
karnataka: 12.817502941769622
europe: 12.825095372204384
</code></pre>
</div>
<div>
<code style="background: transparent; border-radius: 3px; border: 0px; box-sizing: border-box; display: inline; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.6px; line-height: inherit; margin: 0px; overflow-wrap: normal; overflow: visible; padding: 0px; word-break: normal;"><br /></code></div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-62946725397964777132019-01-10T15:58:00.000-08:002019-01-12T08:52:23.372-08:00Scala3D: Implementing A Wolfenstein Like Ray Casting Engine<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/9U5KTrdYfXI/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/9U5KTrdYfXI?feature=player_embedded" width="320"></iframe></div>
<h3>Introduction</h3>
<div class="separator" style="clear: both; text-align: left;">
I recently got interested in 3D game engines for realistic images. I am currently reading through<br />
a book describing a complete ray tracing algorithm [3]. However, since there is a lot of coding to do and a lot of complex geometry, I searched for a simpler side project. I then started to look into 2D<br />
ray casting to generate 3D environments, as used in the Wolfenstein 3D engine. There are several<br />
descriptions on how to build one [1,2]. However, the Wolfenstein black book [1] talks a lot about memory optimisations and execution speed. Lots of the chapters include assembly code and c snippets dealing with the old hardware. Implementing something like this is way easier, now.<br />
Computers got way faster and programming languages way more convenient. So I decided to<br />
build a clone in Scala. The result can be seen in the video above. The left side is the 3D view, and the right side shows a 2D
map view. The red lines are the rays from the ray casting algorithm which represent what the player sees and the
green line is the camera. The code can be found here <a href="https://github.com/dkohlsdorf/Scala3D"> [:github:]</a>. If you want to move in the map use "WASD".
</div><br/>
<h4>Assumptions About The World</h4>
<div class="separator" style="clear: both; text-align: left;">
The world in Wolfenstein 3D is represented by a flat grid. Different heights in the map are
not possible. All objects stand on the same plane and have to be boxes. This assumption makes
a lot of things easier. In a 3D engine, the main task is to determine what is seen from a specific
view (the player's camera). The nice thing is that we do not have to determine what is visible in any possible direction, because we are moving on a plane, which restricts us to a 2D problem. The second
assumption of a world on a grid will simplify the calculation of ray to box intersection later.
</div><br/>
<h4 style="clear: both;"> Implementing The Player And Camera</h4>
<div class="separator" style="clear: both; text-align: left;">
The basic idea is to describe a player in 2D by it's position: $p = (p_x, p_y)$.
Furthermore, we save the position the player is looking at: $d = (d_x, d_y)$.
We also save the image plane: $i = (i_x, i_y)$ which is perpendicular to the player.
<br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVApdPWX1p5egGfsqlxo0Rtmf7xdxzYoRLlb-WGFFJ9f3TB_m93lj2rX7ACqirHzTJZSWWvmAVex8yvMnR3aLf82a5sWCpTicW0nNeS8Udmnr94_Hgd3n1_SwU1ocpwBKOwKbHGwvqJ8g/s1600/Screenshot+2019-01-11+at+16.06.36.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVApdPWX1p5egGfsqlxo0Rtmf7xdxzYoRLlb-WGFFJ9f3TB_m93lj2rX7ACqirHzTJZSWWvmAVex8yvMnR3aLf82a5sWCpTicW0nNeS8Udmnr94_Hgd3n1_SwU1ocpwBKOwKbHGwvqJ8g/s400/Screenshot+2019-01-11+at+16.06.36.png" width="400" height="269" data-original-width="872" data-original-height="586" /></a></div>
<br/>
</div>
In the image above the image plane is marked in green, and the rays for collision are marked in green. In order to
calculate the direction for each ray, we first calculate the x coordinate in camera space as $c_x = 2 * x / w$ where $x$
is the row in the image and $w$ is the width of the image, then we compute the ray from the player that goes through that
position as:
<ul>
<li> $r_{dx} = d_x + i_x * c_x$ </li>
<li> $r_{dy} = d_y + i_y * c_x$ </li>
</ul>
Now we can follow each ray until we hit a wall.
<br/>
<br/>
<h4> Digital Differential Analysis Ray Casting</h4>
<br/>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPzJBdkAz8KkmHYAcfc5T7XFnSpnWdu6FpT55R65Mmg-bLKPsbEhcwix6ITzhi7PdmRLYdusrd-z-0fml1s6ctq6hCIsToXKlQaQ6YhPdXvP4gp482BcLhvCHcVlOUkUn2VKqQobRn9IU/s1600/Screenshot+2019-01-12+at+17.17.31.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPzJBdkAz8KkmHYAcfc5T7XFnSpnWdu6FpT55R65Mmg-bLKPsbEhcwix6ITzhi7PdmRLYdusrd-z-0fml1s6ctq6hCIsToXKlQaQ6YhPdXvP4gp482BcLhvCHcVlOUkUn2VKqQobRn9IU/s400/Screenshot+2019-01-12+at+17.17.31.png" width="358" height="400" data-original-width="1330" data-original-height="1488" /></a></div>
<br/>
In order to find what wall we see at the image row $c_x$, we follow the ray until it hits a wall.
We initialize the search by calculating the distance to the first time the ray hits the x-axis (in the image this is the segment $a$ to $b$)
and the first time we hit the y axis (in the image this is the segment $a$ to $c$). From there we estimate the slope $\delta = (\delta_x, \delta_y)$ in x and y which tells
us how far we have to travel to hit each axis again. We then travel in the closest next grid position and check if the position is a
wall hit.
<br/>
<pre class="brush:scala">
@tailrec
private def traceRay(
grid: Vector2D[Int],
sideDist: Vector2D[Double],
side: Int,
deltaDist: Vector2D[Double],
step: Vector2D[Int],
hit: Boolean,
): Hit =
if(hit) Hit(grid, side)
else sideDist match {
case Vector2D(x, y) if x < y =>
traceRay(
Vector2D[Int](grid.x + step.x, grid.y),
Vector2D[Double](sideDist.x + deltaDist.x, sideDist.y),
0,
deltaDist,
step,
world.collids(grid.x + step.x, grid.y)
)
case Vector2D(x, y) if x >= y =>
traceRay(
Vector2D[Int](grid.x, grid.y + step.y),
Vector2D[Double](sideDist.x, sideDist.y + deltaDist.y),
1,
deltaDist,
step,
world.collids(grid.x, grid.y + step.y)
)
}
</pre>
In the implementation we move through the grid bu the delta distance and the current side distance (distance traveled along the $x$ and $y$ component). We also save which side ($x$ or $y$) we hit the wall at. In the recursion we follow the closest next map square until we
find a hit and then return the position of the hit $h = (h_x, h_y)$ as well as the side.
<br/>
<br/>
<h4> Rendering A Scene </h4>
<br/>
So for each image row, we run the ray tracer until we hit a wall. We then compute the distance to the wall in order to
calculate how high the wall appears in the image $w_d$ from the player position, the hit position, the step size and the direction:
<br/>
<pre class="brush:scala">
private def projectedDistance(
hit: Hit,
pos: Vector2D[Double],
step: Vector2D[Int],
dir: Vector2D[Double],
): Double =
if (hit.side == 0) (hit.pos.x - pos.x + (1 - step.x) / 2.0) / dir.x
else (hit.pos.y - pos.y + (1 - step.y) / 2.0) / dir.y
</pre>
The height of the wall is then simply $l_h = \frac{h}{w_d}$. We then compress the slice of the texture we hit to that height
and paint it in the image at the row position we send the ray for.
<h3>References</h3>
<div>[1] GAME ENGINE BLACK BOOK: WOLFENSTEIN 3D, 2ND EDITION, 2018</div>
<div>[2] <a href="https://lodev.org/cgtutor/raycasting.html">RayCasting Tutorial in CPP</a></div>
<div>[3] <a href="http://www.pbr-book.org/3ed-2018/contents.html">Physically Based Rendering: From Theory To Implementation</a></div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-16589683787750266532018-11-06T14:05:00.001-08:002018-11-06T14:06:15.859-08:00Lindenmayer system: Implementation<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqmliwFWyU97hTDzQXppabjl6b1olOFRdJ0dR5EzVzxot86oppjgXTDY4sc9y1_KalireCS0P_S7UDOvafJr-MsgQYush6CxWKzDvhu1_uFRW9xP5840y48y8k0qsMWeFImm_6ulypVd4/s1600/Screen+Shot+2018-11-06+at+22.28.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="764" data-original-width="1600" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqmliwFWyU97hTDzQXppabjl6b1olOFRdJ0dR5EzVzxot86oppjgXTDY4sc9y1_KalireCS0P_S7UDOvafJr-MsgQYush6CxWKzDvhu1_uFRW9xP5840y48y8k0qsMWeFImm_6ulypVd4/s640/Screen+Shot+2018-11-06+at+22.28.10.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
I recently got interested in Lindenmayer systems or L-system for short and I also wrote some <a href="https://github.com/dkohlsdorf/LindenmayerSystem/">code</a>. These are generative </div>
<div class="separator" style="clear: both; text-align: left;">
systems that generate strings, that in turn can be interpreted as turtle graphics. For example, </div>
<div class="separator" style="clear: both; text-align: left;">
you can model koch curves and the famous dragon curve. This <a href="http://algorithmicbotany.org/papers/abop/abop.pdf">book</a> describes how to generate plants which you can see below.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVCBAngfBMxvLdbYBfDyBYSBdgCs1J-akz95k58T8MbqPJJ-SXSyvHIj3NUTauGFOxwtD49oDhgepQDw_jt9LakIhyiIwSf27dEU6AWYnn9drxfZ__DOImVi4XGJ7XP9ADzUDYEZEcMS0/s1600/Screen+Shot+2018-11-06+at+22.28.05.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1158" data-original-width="1506" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVCBAngfBMxvLdbYBfDyBYSBdgCs1J-akz95k58T8MbqPJJ-SXSyvHIj3NUTauGFOxwtD49oDhgepQDw_jt9LakIhyiIwSf27dEU6AWYnn9drxfZ__DOImVi4XGJ7XP9ADzUDYEZEcMS0/s320/Screen+Shot+2018-11-06+at+22.28.05.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<h3 style="clear: both; text-align: left;">
Implementation</h3>
<br />
For example, given a initial string $+A-Bf$, we have terminal symbols $\{+,-,f\}$ and non terminals $\{A, B\}$ that can still be replaced
by rules like $A \rightarrow AfA$, which reads: each A can be replaced in the string. Furthermore, the string $+A-Bf$ can be interpreted
by a turtle as $+$: turning right, $-$: turning left, all capital letters mean moving forward drawing a line, and all lower case letters mean
moving silently to the next position. We start by defining a symbol class.
<br />
<pre class="brush:scala">case class Symbol(id: Char) {
def isSilent: Boolean = id.isLower
def isTerminal: Boolean = !id.isLetter || id.isLower
}
</pre>
<br />
Next we define a Lindenmayer system as a start string and a set of rules.
Given the start string and a depth limit, we can recursively replace
all non terminal symbols with the associated rules. This enables us
to generate string.
<br />
<pre class="brush:scala">case class LSystem(start: List[Symbol], rules: Map[Symbol, List[Symbol]]) {
private def produce(symbol: Symbol): List[Symbol] =
if(symbol.isTerminal) List(symbol)
else rules(symbol)
@tailrec
private def produce(current: List[Symbol], result: List[Symbol]): List[Symbol] = current match {
case Nil => result
case head :: tail => {
produce(tail, result ::: produce(head))
}
}
@tailrec
private def produce(current: List[Symbol], stepsLeft: Int): List[Symbol] =
if(stepsLeft == 0) current
else produce(produce(current, Nil), stepsLeft - 1)
def produce(depth: Int): List[Symbol] = produce(start, depth)
}
</pre>
Now we are done with the definition of the grammar system. The only thing left is to define a turtle and
an interpreter that converts strings from L-systems to a sequence of turtles. I define a turtle
by it's posture $[x, y, \alpha]$ where $x,y$ is the turtle's position and the orientation is defined
by the angle $\alpha$. For moving forward the next positions are defined as:
<br />
<ul>
<li> $x + \delta * cos(\alpha)$ </li>
<li> $y + \delta * sin(\alpha)$ </li>
</ul>
A rotation basically involves increasing the angle by some certain degree.
<br />
<pre class="brush:scala">case class Turtle(x: Int, y: Int, angle: Double, sil: Boolean) {
def move(distance: Int): Turtle =
Turtle(
math.round(x + distance * math.cos(math.toRadians(angle))).toInt,
math.round(y + distance * math.sin(math.toRadians(angle))).toInt, angle, false)
def moveSil(distance: Int): Turtle =
Turtle(
math.round(x + distance * math.cos(math.toRadians(angle))).toInt,
math.round(y + distance * math.sin(math.toRadians(angle))).toInt, angle, true)
def turn(by: Double): Turtle = {
Turtle(x, y, (angle + by) % 360, false)
}
def silence: Turtle = Turtle(x, y, angle, true)
</pre>
Then given a string, we can recursively build a sequence of positions by
applying the action associated with each symbol.
<br />
<pre class="brush:scala">class TurtleInterpreter(moveBy: Int, turnBy: Double) {
import TurtleInterpreter._
private def execute(cmd: List[Symbol], state: Turtle, stack: List[Turtle], drawable: List[Turtle]): List[Turtle] = cmd match {
case Nil => drawable
case RotateRight :: tail => execute(tail, state.turn(turnBy), stack, drawable)
case RotateLeft :: tail => execute(tail, state.turn(-turnBy), stack, drawable)
case PushStack :: tail => execute(tail, state, state :: stack, drawable)
case PopStack :: tail => execute(tail, stack.head.silence, stack.tail, stack.head.silence :: drawable)
case head :: tail => {
val next = if (head.isSilent) state.moveSil(moveBy) else state.move(moveBy)
execute(tail, next, stack, next :: drawable)
}
}
def execute(cmd: List[Symbol], init: Turtle): List[Turtle] =
execute(cmd, init, List.empty[Turtle], List(init))
}
object TurtleInterpreter {
final val RotateRight = Symbol('+')
final val RotateLeft = Symbol('-')
final val PushStack = Symbol('[')
final val PopStack = Symbol(']')
</pre>
Given that we can define grammars and paint pictures as the ones shown above. I wrote some code that
plots all the above images that can be found on <a href="https://github.com/dkohlsdorf/LindenmayerSystem/">github</a>.
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-66312304998558973332018-11-03T14:52:00.000-07:002019-07-17T13:41:53.666-07:00Exploring Super Resolution With Neural Networks: Blade Runner StyleIn <a href="https://www.imdb.com/title/tt0083658/">Blade Runner</a> there is a scene in which Deckard (the protagonist) inspects one of the replicant's (Loean's) photos.
The scene features an insane zoom which this <a href="https://typesetinthefuture.com/2016/06/19/bladerunner/">blog post</a> estimates the final magnification at a factor of <em style="border: 0px; box-sizing: border-box; font-family: inherit; font-size: 18px; font-weight: inherit; margin: 0px; outline: 0px; padding: 0px; vertical-align: baseline;">667.9</em><br />
<br />
While this is science fiction (obviously) there where some breakthroughs in super resolution recently.
Basically we use machine learning models to upscale an image without gaining pixelated artefacts.<br />
I used this code to upscale the same photo from the movie using the <a href="https://raw.githubusercontent.com/alexjc/neural-enhance/master/enhance.py">code</a> that is around using the <a href="https://github.com/Lasagne/Lasagne">Lasagne</a> neural network framework.<br />
<br />
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjf2MXRx8iWsn16QV5TYL-ud6VAjKgVPAeDis9Rg9hdfYYJoCK7cA25BNK2ZCj4gBufQcPOWk0kBHle5emiXYDTDfiSz04A5wwoySM0TkaBKmnrSj_S6Ljs8IZ0nAybCupMPMYTJPxXDTo/s1600/Screen+Shot+2018-11-02+at+18.12.56.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1020" data-original-width="1600" height="255" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjf2MXRx8iWsn16QV5TYL-ud6VAjKgVPAeDis9Rg9hdfYYJoCK7cA25BNK2ZCj4gBufQcPOWk0kBHle5emiXYDTDfiSz04A5wwoySM0TkaBKmnrSj_S6Ljs8IZ0nAybCupMPMYTJPxXDTo/s400/Screen+Shot+2018-11-02+at+18.12.56.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The top image shows a reflection in a cabinet that is itself a reflection in the round mirror.<br />
The bottom image shows the first zoom in Deckard performs.</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Below is the original image (top) and the result (bottom) I got when running the 4 times upscale neural net on the image from Blade Runner. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDQu7MxOtxbjOkg9TgO8c0F11qgNXmymY4A7nINkuU0WknNSzMS9wyM0uuB057oSSVixakjtSse5ITxmdVzKofrs5bWY6iCwBTx3iQr9qbNR-aTgD9GpWGwgZVQ-QqVW-TrYIx_A1JtEg/s1600/leons_photo.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="380" data-original-width="500" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDQu7MxOtxbjOkg9TgO8c0F11qgNXmymY4A7nINkuU0WknNSzMS9wyM0uuB057oSSVixakjtSse5ITxmdVzKofrs5bWY6iCwBTx3iQr9qbNR-aTgD9GpWGwgZVQ-QqVW-TrYIx_A1JtEg/s1600/leons_photo.jpg" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMa3nciUdzYYAus7PdySjnYmY_LJEocPeC7wQehYQwUxzkI6wLoAoKzr3rxlZzD2wTm7Dw9LrefCSxif6PxXdp9amb6Vhy-HfqTjKYtwYLWvVZ2RtfM9LZyBXCrnk9r2x6Jb0_5u5JS6g/s1600/result4x.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1216" data-original-width="1600" height="484" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMa3nciUdzYYAus7PdySjnYmY_LJEocPeC7wQehYQwUxzkI6wLoAoKzr3rxlZzD2wTm7Dw9LrefCSxif6PxXdp9amb6Vhy-HfqTjKYtwYLWvVZ2RtfM9LZyBXCrnk9r2x6Jb0_5u5JS6g/s640/result4x.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
If you zoom into the mirror in the reconstructed image (open in new tab to see it in it's original resolution) and you use your imagination, I believe you can see a face in it. I do not believe</div>
<div class="separator" style="clear: both; text-align: left;">
it is actually there but that it is a reconstruction artefact, but who knows ... </div>
</div>
<div>
<br /></div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-20061938191946311412018-10-08T12:46:00.002-07:002018-10-08T12:46:37.972-07:00Fast Threshold Estimation In Decision TreesIn decision trees and related methods such as random forests and gradient boosting, we need<br />
to find thresholds in order to split the continuous attributes. Finding the optimal threshold on a feature using information gain can be computationally expensive. However, there is a method to perform this operation fast. One solution is the histogram method described in [1].<br />
<br />
<h3>
Algorithm Sketch</h3>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEYqImaQ1ZpfnPlb8s4DjbQDNAzwUIsZYMvLraVvoRoTMkuVXOSMfQM6oNymXaV7yvEvhYGqCzjm5uJhzeUB64FQGJtasdFxX4hx4uFOsLeUkJx-COOrCcmEkEyq-ZsS93XHa7IhovkbA/s1600/Screen+Shot+2018-10-08+at+18.33.12.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="914" data-original-width="1552" height="235" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEYqImaQ1ZpfnPlb8s4DjbQDNAzwUIsZYMvLraVvoRoTMkuVXOSMfQM6oNymXaV7yvEvhYGqCzjm5uJhzeUB64FQGJtasdFxX4hx4uFOsLeUkJx-COOrCcmEkEyq-ZsS93XHa7IhovkbA/s400/Screen+Shot+2018-10-08+at+18.33.12.png" width="400" /></a></div>
Basically we build a table where we bin each value of a certain attribute and save the number of occurrences of a label in said bin. Each bin represents a certain threshold we could apply. Once everything is indexed, we estimate the total number of entries for each class. In the example, class one occurs 33 times in total and class two 96 times. We then scan lowest threshold and keep track of all the values below the current. We can then easily compute the number of each class below and<br />
above the threshold and use these numbers to compute the information gain from that.<br />
<br />
<h3>
Implementation</h3>
</div>
<div>
First we need to be able to find a bin given a sorted kist of thresholds and given an attribute values.
One algorithm is an adjusted binary search. We basically check the center bin in a range. If this is
our bin, we are done. Otherwise, we continue search to the left or the right.
</div>
<pre class="brush:python">
def bisection(val, th, start, stop):
center = int((stop - start) / 2 + start)
n = len(th) - 1
if center <= 0:
return 0
elif center >= n:
return n
elif val > th[center - 1] and val <= th[center]:
return center
elif val < th[center]:
return bisection(val, th, start, center)
else:
return bisection(val, th, center, stop)
</pre>
<div>
The other thing we need is an entropy calculation. Since we are talking about a classification problem, we will implement
the discrete entropy calculation.
</div>
<pre class="brush:python">
def entropy(x):
scaler = x.sum()
if scaler == 0:
return 0.0
else:
logScaler = log2(scaler)
h = 0
for i in range(0, len(x)):
if x[i] > 0:
p = x[i] / scaler
log_p = np.log2(x[i]) - logScaler
h += p * log_p
return -h
</pre>
<div>
Finally, we can implement the algorithm sketch from above.
</div>
<pre class="brush:python">
class LabelHistogram:
def __init__(self, thresholds, n_classes):
self.thresholds = thresholds
self.n_bins = len(thresholds)
self.n_classes = n_classes
self.bins = np.zeros((self.n_bins, n_classes))
def insert(self, attr, label):
i = bisection(attr, self.thresholds, 0, self.n_bins)
self.bins[i, label] += 1
def gain(self, parent_gain):
total = np.sum(self.bins, axis=0)
cur = np.zeros(self.n_classes)
max_gain = float('-inf')
max_th = None
for i in range(0, self.n_bins):
cur += self.bins[i]
left = cur
right = total - cur
cleft = left.sum()
cright = right.sum()
total = cleft + cright
gain = parent_gain - ((cleft / total) * entropy(left) + (cright / total) * entropy(right))
if gain > max_gain:
max_gain = gain
max_th = self.thresholds[i]
return (max_gain, max_th)
</pre>
<div>
<br /></div>
<div>
<br /></div>
<h3>
References</h3>
<div>
[1] Criminisi, Shotton: "Decision Forests for Computer Vision and Medical Image Analysis", Springer, 2013</div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-5625011678939878952018-07-28T07:03:00.003-07:002018-07-28T07:03:45.546-07:00Simple Gaussian Processes For RegressionIn this post I am implementing a simple Gaussian Process Regressor. A Gaussian Process is a<br />
probabilistic kernel model. Basically I am following the "<a href="https://mitpress.mit.edu/books/machine-learning-1">Machine Learning: A probabilistic Perspective</a>" book by Kevin Murphy.<br />
<br />
<h3>
The Basics</h3>
For an input $x_i$ and output $y_i = f(x_i)$, infer a distribution over functions of the data:<br />
$p(f | X, y)$. In order to make predictions $y*$ for a new data point $x*$ given all the previous observations, we define:<br />
<br />
$p(y*|x*, X, y) = \int p(y*|f, x*) p(f|x, y) df$<br />
<br />
For our data $X = x_1 ... x_n$ the Gaussian process assumption is that the the distributions over functions $p(f_1 ... f_N)$ is jointly Gaussian with mean $\mu(x)$ and a covariance function $\Sigma_{ij} = k(x_i, x_j)$ and a kernel $k$.<br />
<h3>
</h3>
<h3>
Covariance structure</h3>
<div>
It is quite common to leaf the mean function at zero. Given a kernel, we need to compute the covariance function between all training and test instances. However, we structure the kernel matrix in the following way. </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLUF2jS1k7dAyv0ZI7L4qBakI6NT3XeJQ1mxq3g5raNMafa8UeM7Me7JBgoM63SV11X5SzExGHVpflFh_R_HPkSfF3aVcCc_rXCj5C4K6UGgYNPA2_l7zRuG8_qSm7hV8lACvXr33PlBU/s1600/kernel_matrix.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="852" data-original-width="924" height="295" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLUF2jS1k7dAyv0ZI7L4qBakI6NT3XeJQ1mxq3g5raNMafa8UeM7Me7JBgoM63SV11X5SzExGHVpflFh_R_HPkSfF3aVcCc_rXCj5C4K6UGgYNPA2_l7zRuG8_qSm7hV8lACvXr33PlBU/s320/kernel_matrix.png" width="320" /></a></div>
<br />
<br />
The sub-matrix $K$ describes the affinity between all training instances and the sub-matrix $K*$ describes the affinity between training and test instances. Finally, the submatrix $K**$ describes the affinity between test instances to itself. The image above shows the final covariance structure. With that, our function $f*$ can be inferred as: $(f, f*) ~ N(0, \Sigma)$.<br />
<br />
In order to obtain the prediction calculate: $f* = K_* \bullet K^{-1} \bullet f$.<br />
If we want the find the variance of the predictions we calculate: $trace(K_{**} - K_* \bullet K^{-1} \bullet K_*^T)$<br />
<br />
In my opinion the method is very similar to support vector machines and to instance beased learning methods in general. The difference is that a SVM is only using a small set
of training examples called support vectors, while the Gaussian process compares every example to all others in order to make a prediction.
<br />
<h3>
<br /></h3>
<h3>
A Naive Gaussian Process</h3>
Below I implemented a naive Gaussian process using numpy.
Basically, I am building the three components of the covariance
structure and then solve the prediction equations above.
I call my method naive since computing the inverse
has a worst case complexity of $O(n^3)$ <br />
So it is not advisable to use this method for larger datasets.
<br />
<pre class="brush:python">class GaussianProcess:
def __init__(self, kernel):
self.kernel = kernel
def predict(self, train_x, train_y, test_x):
(n, _) = train_x.shape
(m, _) = test_x.shape
covar_train = np.zeros((n, n))
covar_train_test = np.zeros((n, m))
covar_test = np.zeros((m, m))
for i in range(0, n):
for j in range(0, n):
covar_train[i, j] = self.kernel.k(train_x[i], train_x[j], i == j)
for i in range(0, m):
for j in range(0, m):
covar_test[i, j] = self.kernel.k(test_x[i], test_x[j], i == j)
for i in range(0, n):
for j in range(0, m):
covar_train_test[i, j] = self.kernel.k(train_x[i], test_x[j], False)
covar_inv = np.linalg.inv(covar_train)
prediction = np.dot(np.dot(covar_train_test.T, covar_inv), train_y)
confidence = covar_test - np.dot(covar_train_test.T, np.dot(covar_inv, covar_train_test))
confidence = np.sqrt(np.array([confidence[i, i] for i in range(0, m)]))
return prediction, confidence
</pre>
<br />
<br />
<h3>
Squared Exponential Kernel</h3>
For my experiments I use a squared exponential kernel and also add a variance parameter for
for noisy observations:
<br />
<br />
$k(x_i, x_j) = \sigma_f^2 e^{-\frac{1}{2l^2} || x_i - x_j ||_2^2} + \sigma_y^2 \mathbb{I}(i == j)$
<br />
<br />
The parameters are as follows:
<br />
<ul>
<li> $\sigma_f^2$: Variance giving the scale of the output</li>
<li> $\sigma_y^2 \mathbb{I}(i == j)$: Noise variance applied to diagnoal</li>
<li> $l$: Scale on input </li>
</ul>
again, the implementation is quite simple in numpy.
<br />
<pre class="brush:python">class SquaredExponential:
def __init__(self, variance, alpha, noise):
self.alpha = alpha
self.variance = variance
self.noise = noise
self.scaler = -1.0 / (2 * alpha**2)
def k(self, a, b, on_diag):
error = np.sum(np.square(a - b))
f = np.exp(self.scaler * error)
scaled = self.variance * f
if on_diag:
scaled += self.noise
return scaled
</pre>
<br />
<br />
<h3>
Results on some simple example datasets</h3>
I tried my method on three datasets, 2 times, I am intepolating
stock marktet data and one time a sine wave. Each time, the index
is the input to the method and the output for the stock market prediction
is the closing price.
<br />
Sine Wave Interpolation:<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXRl0nYWbbOb3rZIG4IluBr8b9UIs4R84Hc6-TVFi_tR7qAojH51VdSliSqeuw598iXLe0QcwXeA6vm6E7lWRUiFTpVKNc2oLSOCOl9FodlSmBQmvUlcQOPhDFKIV23JkX0ENuveIWSeA/s1600/sine.png" imageanchor="1"><img border="0" data-original-height="288" data-original-width="432" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXRl0nYWbbOb3rZIG4IluBr8b9UIs4R84Hc6-TVFi_tR7qAojH51VdSliSqeuw598iXLe0QcwXeA6vm6E7lWRUiFTpVKNc2oLSOCOl9FodlSmBQmvUlcQOPhDFKIV23JkX0ENuveIWSeA/s320/sine.png" width="320" /></a> <br />
Stock Interpolation Dax:
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmeTHeJqo9r-NSSasfYDNY7G8_TfFzv-rGNgBdG68rNObVCAyW61Pkidi-32O6LjTG4KZMlY_K2gldbyOzGi7x0VO7w_LWZllAS1PzTdoBmy4YAg8H9KXvAhGSk5mTtBGEwHp2ocF2xGA/s1600/dax.png" imageanchor="1"><img border="0" data-original-height="288" data-original-width="432" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmeTHeJqo9r-NSSasfYDNY7G8_TfFzv-rGNgBdG68rNObVCAyW61Pkidi-32O6LjTG4KZMlY_K2gldbyOzGi7x0VO7w_LWZllAS1PzTdoBmy4YAg8H9KXvAhGSk5mTtBGEwHp2ocF2xGA/s320/dax.png" width="320" /></a> <br />
Stock Interpolation US:
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTRi2PoeRi_G9KtMtWQzFjz8S3QH3n5AELhvAIJNP6ozbpimwHSdbQRCPPMDwrq42SzmDieckvBVJB8QDUoCIZzV8ALGrLlElUIdP4NFuNuXwQgibXW-fR0rHQgvoBEYOfRyY2pdJfopw/s1600/us.png" imageanchor="1"><img border="0" data-original-height="288" data-original-width="432" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTRi2PoeRi_G9KtMtWQzFjz8S3QH3n5AELhvAIJNP6ozbpimwHSdbQRCPPMDwrq42SzmDieckvBVJB8QDUoCIZzV8ALGrLlElUIdP4NFuNuXwQgibXW-fR0rHQgvoBEYOfRyY2pdJfopw/s320/us.png" width="320" /></a> dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-41559052384181117222018-07-21T03:40:00.000-07:002018-07-21T11:00:12.635-07:00Financial Machine Learning ExperiementAs a fun project, I tried to predict the stock market and implemented a small library todo so.<br />
Basically, I want to use a sliding window of historic prices of several companies or funds and<br />
predict the price of a single company. In the end, I achieve decent results in a 10 day window and<br />
predicting 5 days into the future. However, my error is still several euros large :)<br />
<br />
The code can be found here: <a href="https://github.com/dkohlsdorf/StockPrediction">[:Github:]</a><br />
<h3>
Learning Problem, Features and Modeling</h3>
<div>
First we define the return of interest (ROI) as:</div>
<div>
<br /></div>
<div>
$roi(t, t + 1) = \frac{x_t - x_{t + 1}}{x_t} $</div>
<div>
<br /></div>
<div>
Which is the percentage of earnings at a time $t + 1$ measured to a previous investment</div>
<div>
at $t$. In order to extract the features, we compute sliding windows over our historic prices.</div>
<div>
For each sample in a window, we compute the ROI to the start of the window, which represents</div>
<div>
the earnings to the start of the window. For a window of $T$ steps and $n$ stocks, we flatten the sliding windows and get a feature vector of size $T x n$ of ROI entries. Our target variable we want</div>
<div>
to predict is a stock price $d$ days after the last day of the window. The traget is converted</div>
<div>
to the ROI, too. So we try to predict the earnings after $d$ days after the last day of the window</div>
<div>
which represents a potential investment.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp320r_Q7ynirazmGRvHexPrHX1iGUy5sl4Rw4Bu3neo_qojyZn3o9uKbXcC8JjW4h7OKmJn17csuhBDmr1RrTJaJynRKZRExmlJeyZr50eQQTFQiQibPKRbAIkmL26hykjV27WANUbC0/s1600/features.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="449" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp320r_Q7ynirazmGRvHexPrHX1iGUy5sl4Rw4Bu3neo_qojyZn3o9uKbXcC8JjW4h7OKmJn17csuhBDmr1RrTJaJynRKZRExmlJeyZr50eQQTFQiQibPKRbAIkmL26hykjV27WANUbC0/s640/features.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Red: roi in the window, Blue: ROI for prediction</td></tr>
</tbody></table>
<br /></div>
<h3>
Some Experimental Results</h3>
<div>
First we downloaded several historic price datasets from yahoo finance and load them using pandas.
We take the date column as the index and interpolate missing values:
<br />
<pre class="brush:python">
data = [
('euroStoxx50', pd.read_csv('data/stoxx50e.csv', index_col=0, na_values='null').interpolate('linear')),
('dax', pd.read_csv('data/EL4A.F.csv', index_col=0, na_values='null').interpolate('linear')),
('us', pd.read_csv('data/EL4Z.F.csv', index_col=0, na_values='null').interpolate('linear')),
('xing', pd.read_csv('data/O1BC.F.csv', index_col=0, na_values='null').interpolate('linear')),
('google', pd.read_csv('data/GOOGL.csv', index_col=0, na_values='null').interpolate('linear')),
('facebook', pd.read_csv('data/FB2A.DE.csv', index_col=0, na_values='null').interpolate('linear')),
('amazon', pd.read_csv('data/AMZN.csv', index_col=0, na_values='null').interpolate('linear'))
]
</pre>
<br />
We learn the following classifier on our data.
<br />
<pre class="brush:python">
predictors = [
('RF', RandomForestRegressor(n_estimators=250)),
('GP', GaussianProcessRegressor(kernel=RBF(length_scale=2.5))),
('NN', KNeighborsRegressor(n_neighbors=80)),
('NE', KerasPredictor(model, 10, 512, False)),
('GB', GradientBoostingRegressor(n_estimators=250))
]
</pre>
<br />
The neural network's architecture (named model above):
<br />
<pre class="brush:python">
hidden = [256, 128, 64, 32]
inp = (len(data.stoxx) * WIN,)
model = Sequential()
model.add(Dense(hidden[0], activation='relu', input_shape=inp))
for h in hidden[1:]:
model.add(Dense(h, activation='relu'))
model.add(Dense(1, activation='linear'))
model.compile('adam', 'mse')
</pre>
<br />
Below we show some prediction results for our classifiers.<br />
We also not the root mean square error in euros. </div>
<div>
<br /></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMN3v_tlK-hscEPtejOABtTB_8aRW6uqCSW7jVeKUTPCEYZPiTTvTVbFyoHRzVziO6uw9WRK-yr9ds3B6-L-nUKQHoW1Tfux_vPrxKQS39TsDNW6Q1p0wQ9OybukEZnJ78F7ZiXTv9WTw/s1600/amazon.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMN3v_tlK-hscEPtejOABtTB_8aRW6uqCSW7jVeKUTPCEYZPiTTvTVbFyoHRzVziO6uw9WRK-yr9ds3B6-L-nUKQHoW1Tfux_vPrxKQS39TsDNW6Q1p0wQ9OybukEZnJ78F7ZiXTv9WTw/s640/amazon.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNZ9u8Vvxk54vc2m90oVBRtdSBJ_RuwZ3Vrx4uxb_MkMWd3SM6wBs-c8wLCHq9bZLtkUb81CpWnYNrGFV0kn7nQJzoZnax3sn5X-F13UPOyLshb10biL3eSd3Oqsa5yHgkYRQwetlYD7w/s1600/dax.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgNZ9u8Vvxk54vc2m90oVBRtdSBJ_RuwZ3Vrx4uxb_MkMWd3SM6wBs-c8wLCHq9bZLtkUb81CpWnYNrGFV0kn7nQJzoZnax3sn5X-F13UPOyLshb10biL3eSd3Oqsa5yHgkYRQwetlYD7w/s640/dax.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhLXVhPLsjPmeTjDUlbjRptIposrbPkyNeTOBw11xfFp8MN5Q-EPZFTiVOtsGr_y1g80E3Ub43UPGt1xRGGFf2ccR3ihrje9Pkrc_Qot91zHM5dsOGLG_SWW_FY5VhaDJNT1B59WFV9Pfo/s1600/euroStoxx50.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhLXVhPLsjPmeTjDUlbjRptIposrbPkyNeTOBw11xfFp8MN5Q-EPZFTiVOtsGr_y1g80E3Ub43UPGt1xRGGFf2ccR3ihrje9Pkrc_Qot91zHM5dsOGLG_SWW_FY5VhaDJNT1B59WFV9Pfo/s640/euroStoxx50.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG5M7688CqTNFeMlsJ7WDaAC7qWXRAAIl5I0P8h3mnNJaf_SP1-_L6iKEiyhvdv3XJx0EQffQ3h-4HUYsnWjEvetOACFLJH2tMq2JizBd6CzjOeGxfrdrOO0Zm0omuYeXEYeauX1mHWK0/s1600/facebook.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG5M7688CqTNFeMlsJ7WDaAC7qWXRAAIl5I0P8h3mnNJaf_SP1-_L6iKEiyhvdv3XJx0EQffQ3h-4HUYsnWjEvetOACFLJH2tMq2JizBd6CzjOeGxfrdrOO0Zm0omuYeXEYeauX1mHWK0/s640/facebook.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPM7Xnp2ttVj8U4-tbV69uBP-X1TW8ZehBfHQ89Z-NpXLRmiHBeCP_ktVG596Xnxkqg4sIRVgvxojGf2E7xa1oI9Yt0NrpQ-MryW4Xuw5Q1_95K3R5ZlCN0cOGwPFvuDi9mR4xRVbe2U4/s1600/google.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPM7Xnp2ttVj8U4-tbV69uBP-X1TW8ZehBfHQ89Z-NpXLRmiHBeCP_ktVG596Xnxkqg4sIRVgvxojGf2E7xa1oI9Yt0NrpQ-MryW4Xuw5Q1_95K3R5ZlCN0cOGwPFvuDi9mR4xRVbe2U4/s640/google.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEUmxVrcukkH-NUQ4RQ8gNwd4dTgKcMAqCgAa4BOGo2ZRs6J9Lf_la_P3Q0X9jO9cJOlaHcqTFQjqJdsSx4VHMWUlABIB8ZBUvQSwNXDDp-mHhCpADpAVgWmnVu5mOI34UYO-XTdy7mwo/s1600/us.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEUmxVrcukkH-NUQ4RQ8gNwd4dTgKcMAqCgAa4BOGo2ZRs6J9Lf_la_P3Q0X9jO9cJOlaHcqTFQjqJdsSx4VHMWUlABIB8ZBUvQSwNXDDp-mHhCpADpAVgWmnVu5mOI34UYO-XTdy7mwo/s640/us.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgF_I1XHXKEhDLoyMiHpryHEKNs0i5sFypAq7d8HA6KpyKxpldBvcNLiWT69BnKK1ukGYComlhn-wpcPq6SKx-_VwqFiMy6FE9jYX8I98HpFeTmlUg4keQxgTur6qVh7k_haPJYnxoFiw/s1600/xing.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgF_I1XHXKEhDLoyMiHpryHEKNs0i5sFypAq7d8HA6KpyKxpldBvcNLiWT69BnKK1ukGYComlhn-wpcPq6SKx-_VwqFiMy6FE9jYX8I98HpFeTmlUg4keQxgTur6qVh7k_haPJYnxoFiw/s640/xing.png" width="640" /></a></div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-63053145267258825082018-06-25T06:16:00.000-07:002018-06-25T06:16:42.140-07:00Keras: Convolutional LSTMStacking recurrent layers on top of convolutional layers can be used to generate sequential output (like text) from structured input (like images or audio) [<a href="https://arxiv.org/pdf/1411.4389.pdf">1</a>].<br />
<br />
If we want to stack an LSTM on top of a convolutional layers, we can simply do so, but we need to<br />
reshape the output of the convolutions to the LSTM's expected input. The code below shows an implementation in Keras:
<br />
<pre class="brush:python">
T = 128
D = 64
N_FILTERS = 32
LSTM_T = int(T / 8)
LSTM_D = int(D / 2)
LSTM_STATE = 128
POOL = (8, 2)
i = Input(shape=(T, D, 1)) # (None, 128, 64, 1)
x = Conv2D(N_FILTERS, (3, 3), padding = 'same')(i) # (None, 128, 64, 32)
x = MaxPooling2D(pool_size=POOL)(x) # (None, 16, 32, 32)
x = Reshape((LSTM_T, LSTM_D * N_FILTERS))(x) # (None, 16, 1024)
x = LSTM(LSTM_STATE, return_sequences=True)(x) # (None, 16, 128)
model = Model(i, x)
model.summary()
</pre>
<br />
<br />
In this example we want to learn the convolutional LSTM on sequences of length 128 with 64 dimensional samples. The first layer is a convolutional layer with 32 filters. So the outputs are 32<br />
sequences, one for each filter. We pool the sequences with a (8, 2) window. So our 32 sequences are now of size (128 / 8 = 16, 64 / 2 = 32). Now we have to combine the dimensions and the filter responses into a single dimension of size (32 * 32 = 1024) so we can feed a sequence into the LSTM which requires a rank 2 ( or 3 with batch) tensor with the first dimension being the time step and the second each frame. Finally we add the LSTM layer.<br />
<br />dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-53355405376678468182018-06-13T07:10:00.002-07:002018-06-13T07:17:59.934-07:00Character Wise Generation From The Little Book Of CalmIf you saw the show <i>black books</i>, you saw the calming powers of the little book of calm.<br />
Surprisingly the <a href="https://www.amazon.de/Little-Book-Calm-Paul-Wilson-ebook/dp/B00APNQ114/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1528812260&sr=8-1">book</a> actually exists. I extracted the text from the Kindle version and will now train an LSTM with it in order to generate an infinity stream of calming advice. The actual code can be found as a <a href="https://gist.github.com/dkohlsdorf/992f63b11779d8647727377c436533c7">Gist</a>.<br />
For copyright reasons I can not attach the text. So if you want to train this too, you need to buy the Kindle version.
<br />
<div style="text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTkhElQ3MIp1nJgR4m4JJ4ZxBmfeAQg4UIxhorlXsvDWhCq71J7tGbu1JKZkcGjvPIcgUUBJgX1X6Zuq0yHe032XoTMpYpjJFpLNJnIh1Vez_M7_jMyEUcN3J3uLg-fjM022o-m8SWHp4/s1600/calm.gif" imageanchor="1"><img border="0" data-original-height="289" data-original-width="500" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTkhElQ3MIp1nJgR4m4JJ4ZxBmfeAQg4UIxhorlXsvDWhCq71J7tGbu1JKZkcGjvPIcgUUBJgX1X6Zuq0yHe032XoTMpYpjJFpLNJnIh1Vez_M7_jMyEUcN3J3uLg-fjM022o-m8SWHp4/s320/calm.gif" width="320" /></a>
</div>
I split all the text into windows of 40 characters and train an LSTM to predict the following character.
I use a bidirectional LSTM with a 128 dimensional cell-state. The output mode for the LSTM is many-2-many,
which means each input at each time step (the current character) also has an output (the next character).
The output is pushed through a dense softmax layer.
The LSTM is shown below:<br />
<pre class="brush:python">
length = 40
step = 1
model = Sequential()
model.add(Bidirectional(LSTM(128, input_shape=(length, n_char), return_sequences=True), input_shape=(length, n_char)))
model.add(TimeDistributed(Dense(n_char, activation='softmax')))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()
</pre>
<br />
The Training log is interesting. One can see that the LSTM gets better and better at generation text which can be seen in the <a href="https://gist.github.com/dkohlsdorf/992f63b11779d8647727377c436533c7">notebook</a>.
Below I listed some samples. For the sampling, we initialise the generation with a random sequence of 40 characters and let the LSTM generate the rest. The details of the sampling can be seen in the notebook.
The text is not great but one can see that the LSTM aims to generate sentences similar to the ones in the book.
One might be able to improve it's performance with more layers or more epochs.<br />
<br />
However, since the model works on a character level, I am not unhappy with the
results:
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"><br /></span>
<br />
<ol>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"thes the upset child, just as a kiss or handshake takes the pleasures eale them to poins of almous to be calming end it carting dive you’ll find calm phosting work is acyies all little forgsimutarly aboveruts a phopees will tell you, nothed (which your speech trangess there are tho defullame and eyebrows phon places of person, joyous every meal with the fores or nacciols, and discontrom to the hear postrous into them most powerm, when they gretophsints to calm. sho exercises when they laverds with the concenturies an much you think as"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"eel calm. recognise addictions for what your has a little choice. devers of a favour things which the worldor tak every moments worry habirothing – you half the way to feel. emplay of the lights – quiet – entoy not worrierate – beting worry king before you think about still, bow feelinggs there is enliet before yourself calm if you feel calm your earl goonicoly disary taskn take the time is efly. kink well assential astimugars asout peace – and have all tall you to can calm. the way you look at that your happenng, puts of orange blome"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"e stand straighter and taller than you beliefs, you will fool your hand, mixitates even the back sede tans bighing take and the ennable treates the beauty the add to think worty as major troutten if you combscents are among the humone situations: you treat at attating them. trank und harty thing an ised. on the small issues them, on your ease a sea calm always stimulates things that as much you must’, ‘i subripress on a way of what your day – from will happen. life for have to live a timens ael add the con rowhing wholly on tat hapte,"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"then you can be calm. sip warm water a glass than politens, put as you tonkelow yourself up. player. sel they is possible not to calm. then you connald ablity to live a cemplemants and englavoide. make to the benor effort into hands, not for twengy to the roother unlike in your diet, something with a peyture of peace and calm deind go little known bad compony all spech speeds can seritally when at a physictionel. your hearly, small issues them, soothes of a hand at also keolitas the moment your feet, then review then ptsic a fatticula"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"learning something you want to know. and you hueld’ go limp. removeties it posttert in the worlding the way you look at then posittle with your attionshops a out the cl-fectly having a little controgises to be remote. by massal favour bady them. is thice as you go envements are then positive to be captivelity when it carnething to say your subconscious speative of remote those who trere you ueld it before yourself underful fact as the feelostrust in everything when you’re leave dowalk a little between what harts like go to bed. whish "</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"the moment when you concentrate your attention on a way attiony what it’s only of relaxing those calm, even up, by so fhols, behaw recognise them bach someone, alwings a calm people about beauty phor people als your thoughts, will take whice a groutts a calm person, a floathing down on a grass to help you know, by replyating widg a few chill, your speech, resubcents, you feel calm toom to your can most used the moments art teads. wear wearious from time to be a stimulations in whatever you grust – then reony something your faving to w"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"light one stop early if you consciously start is to relaxed with your speech speeds. by a coach even the most commot flowsh and exprect. works bectionss your speech speeds of a field unting with the groours, but for the plears nalk eget to feel calm. rong with a masseur once that your face pheropen having to sal awhable place, a conscious to be captivelocar place, up them of a damplestilm of the espences and little not train. every sayols around the back ood be sad, them do. then massage them ind tee… as you busly petson time to time "</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"time in the world to do whatever you choose a feet and contribute to than wessed for yourself untethink, acchut for the recogail oilk, and the ritimeting snack being (iss than peritups, it can be pressude yourself underle – chrisical that suches of patting your efficienals at, pu… child, joyous pios from time to time little only a timend for hard-working postable full you work, uphilly to go know wholatge them. puss, pursom efficiently you achiely that happens interivation. its massage – then recom in your breet routine your typewrite"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"even more peaceful. break the pattern when you dwell twan fow times on explcomess with nich and feel negative for the rescom only a time. change specially, then ete someon, treards to your faith noat as muchically time sometimes works mication, but for they like gardesly how – you wouldn’t normally your hail, relaxed. wearing a stimulation. be computer you niel. be absold of time, whaner always a compliment possible future, with your speech spee, minutes to times out of standing, deciduerble when the concedints are atssetic eation wit"</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">"make an apapointment with yourself to deal with worries that as major stressed when you’re helaxed with a resalk post your breathing, when you’re lead to calm. changes on the pest the thumb are then massage them go them, taking pleaser the most more efficiently when a country worry beads. play , any time – hose negne a softet for deeps a chomples. take ba hothing them. then raised by your life relaxt to stayicher when you sach add there’s tense sire time eas a pew your day, cannot by where you wouldn’t normally thinks. peace will not b"</span></li>
</ol>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-19038026689946896202018-06-13T06:49:00.001-07:002019-03-12T07:44:17.476-07:00Downloading All NIPS papersRecently I wanted to download a large amount of NIPS papers, so I decided to write a little web - scraper in python. Not much more to say, [<a href="https://gist.github.com/dkohlsdorf/3aa0d19c1afc7c3adc41e0eff0a104b0#file-nips-py">here</a>] it is. There are two scripts. One downloading the papers and one downloading the abstracts. The script basically lists all books from the nips page,<br />
then lists all papers and from the paper page, it downloads the pdfs or abstract.<br />
<div align="center">
</div>
I also put the script for the paper download, below. In order to use it, you need the following libraries and python 3:
<br />
<ul>
<li> <a href="http://lxml.de/"> lxml </a></li>
<li> <a href="http://docs.python-requests.org/en/master/"> requests </a> </li>
</ul>
<br />
<pre class="brush:python">
from lxml import html
import requests
import urllib.request
BASE_URL = 'https://papers.nips.cc/'
page = requests.get(BASE_URL)
tree = html.fromstring(page.content)
books = [ href.attrib['href'] for href in tree.xpath('//a') if 'book' in href.attrib['href']]
for book in books:
book_page = requests.get(BASE_URL + book)
tree = html.fromstring(book_page.content)
papers = [ href.attrib['href'] for href in tree.xpath('//a') if 'paper' in href.attrib['href']]
for paper in papers:
paper_page = requests.get(BASE_URL + paper)
tree = html.fromstring(paper_page.content)
links = [ href.attrib['href'] for href in tree.xpath('//a') if 'pdf' in href.attrib['href']]
for link in links:
local = link.split('/')[-1]
urllib.request.urlretrieve(BASE_URL + link, local)
</pre>
<br />dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-29047348512558704412018-03-04T08:29:00.000-08:002018-03-04T08:29:16.252-08:00Porting Keras Models<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbmTQRx76Bd797wsZMpf-PmYeL1aMq_-mIrxil0qtkR3r19IT4j-3p3hmxR_rvZeEbdVIpQ-lixRI23h_xJcaT6JxR73ayblAd5VqFwHkXi1NxBt8P-jNwLFBoncGhBi1f7EvF_EXzsl0/s1600/lstm.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="866" data-original-width="1600" height="173" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbmTQRx76Bd797wsZMpf-PmYeL1aMq_-mIrxil0qtkR3r19IT4j-3p3hmxR_rvZeEbdVIpQ-lixRI23h_xJcaT6JxR73ayblAd5VqFwHkXi1NxBt8P-jNwLFBoncGhBi1f7EvF_EXzsl0/s320/lstm.png" width="320" /></a></div>
<br />
I implemented an LSTM in Java that can read a trained Keras models in order<br />
to use it in some other Java projects. You can check the repository at <a href="https://github.com/dkohlsdorf/KerasLSTM">[:Github:]</a><br />
<br />
The ipython notebook exports the weights of an LSTM<br />
and also exports some test for a random input sequence and the resulting output.<br />
The Java implementation reads the exported weights and then applies them. All<br />
matrix operations are implemented using jBlas.<br />
<br />
<ul>
<li>The numeric helper class implements matrix loading and slicing.</li>
<li>Tanh and the sigmoid functions are also in the helper class.</li>
<li>The LSTM state is a separate class holding the cell state and the hidden activation.</li>
<li>The LSTM holds all the weights and applies them to sequences.</li>
</ul>
<br />
The basic slicing of the weights is implemented using the same weight<br />
coding as Keras <a href="https://github.com/keras-team/keras/blob/2.1.4/keras/layers/recurrent.py#L1776">[:Github:]</a><br />
<br />
<pre class="brush:java"> self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
self.recurrent_kernel_f = self.recurrent_kernel[:, self.units: self.units * 2]
self.recurrent_kernel_c = self.recurrent_kernel[:, self.units * 2: self.units * 3]
self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]
if self.use_bias:
self.bias_i = self.bias[:self.units]
self.bias_f = self.bias[self.units: self.units * 2]
self.bias_c = self.bias[self.units * 2: self.units * 3]
self.bias_o = self.bias[self.units * 3:]
</pre>
<div>
<br /></div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021452.774353199999986 8.15624660000002 53.384239199999989 9.4471406000000222tag:blogger.com,1999:blog-1040545417301646113.post-10039668340004713462017-12-12T07:06:00.002-08:002017-12-15T04:21:18.867-08:00Implementing Gradient Boosted Trees<h1>
Implementing Gradient Boosted Trees</h1>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTOE2IFihev7DsyM51HIaTp6rMBlngXPxK5OfzG6UaJQm3OZeoFCCEBTa3uP29CrkoKj1xPkaIq9RhCol0dVNUoCRcvkLZVJThpLQw-356Kyfu394iq5g7BAodLOeqFb-f6dbqoNzWREI/s1600/header.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="740" data-original-width="1356" height="174" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTOE2IFihev7DsyM51HIaTp6rMBlngXPxK5OfzG6UaJQm3OZeoFCCEBTa3uP29CrkoKj1xPkaIq9RhCol0dVNUoCRcvkLZVJThpLQw-356Kyfu394iq5g7BAodLOeqFb-f6dbqoNzWREI/s320/header.png" width="320" /></a></div>
<br />
<br />
In this post I will share my experience on implementing
gradient boosted trees as seen in XGBoost in order to
get a better understanding of the method. As in previous
posts, the code won't be as efficient as the original.
Furthermore, I will use scala again.
The basic idea behind boosting is to construct an
additive model of a collection of simpler models (base models)
for classification or regresion. If you are unfamiliar with the
general boosting idea plase consider (Hastie: Elements of Statistics Learning)
for a great introduction. This post is mainly based on this blog <a href="http://www.stokastik.in/programming-gradient-boosted-trees-from-scratch/">post </a>
and these <a href="https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf">slides</a>. In boosting, the model is constructed iteratively
that is, in an iteration $t$, the new output given a datapoint $x$ is:
$$
\hat{y}^t = \hat{y}^{(t - 1)} - f_t(x)
$$
The idea of gradient boosting is that in each iteration the next model
is determined so it minimizes a given loss function $\mathbb{L}$.
Specifically the minimization is performed using the first and second order derivatives
of the loss function. Each added model can also be seen as a step in the direction
of the gradient. In other words, the iterations can be seen as a form
of gradient descent. When extending this idea to regression tree base models and
when regularizing the objective function of the optimization, we end up with
a powerful algorithm that one multiple Kaggle challenges as well as
the Xing hosted recommender systems challenge to years in a row.
In the next part, I will derive the optimization step used for decision trees with a cross entropy
loss and then "massage" in the regularization.
<br />
<h2>
<br /></h2>
<h2>
Cross Entropy loss for Binay Classification</h2>
For binary classification, the go-to output function is the sigmoid function:
$$P(x = 1) = \frac{1}{1 = e^{-x}}$$
Prominent usage examples include Neural Networks and logistic regressions.
Given that our output $P(x = 1)$ is $\hat{y}$ and the true label is $y$,
we can measure the goodness of our classifier on our dataset using the
cross entropy loss:
$$ L(y, \hat{y}) = \sum_i -y_i * log(\hat{y_i}) - (1.0 - y_i) * log(1.0 - \hat{y_i})$$
with its first derivative at $g_i = \hat{y_i} - y_i$ and it's second derivative at $h_i = \hat{y_i} * (1.0 - \hat{y_i})$.
<br />
<h2>
<br /></h2>
<h2>
Regularization</h2>
One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to
penalize large parameters that dominate our result and model complexity.
Given a parameter set $\theta$, the L1 regularization is $\Omega(\theta) = \sum_i ||\theta_i||_1$,
where $||x||_1$ is the L1 norm or simply the absolute value. In L2 regularization, we switch to
$||\theta_i||_2$ where $||x||_2$ is $x^2$.
Then for a decision tree, we could measure the complexity as the number of leaf nodes $T$,
and the parameters $\theta$ are the output scores of a leaf nodes (we will later use regression trees).
For example, we could penalize large trees by $\gamma T$ and large output by $\frac{1}{2}\sum_i \lambda ||\theta_i||_2$.
We can control the penalty amount by two parameters $\gamma$ and $\lambda$.
<br />
<h2>
<br /></h2>
<h2>
Gradient boosted trees</h2>
The combined objective using regularization and some loss function is:
$$O = L(y, \hat{y}) + \Omega(\theta)$$
Since we are boosting, we define the objective in terms
of the predictions in iteration $t$:
$$O^t = L(y, \hat{y}^{t - 1} + f(x)) + \Omega(\theta)$$
while $\hat{y}^{t - 1} + f(x)$ follows directly from the boosting formular $\hat{y}^t = \hat{y}^(t - 1) - f_t(x)$.
Since we want to define the gradient in order to optimize f(x), we can apply the Taylor expansion to the loss.
Remember that the Taylor expansion is defined as $f(x + \delta x) \approx f(x) + f'(x) * \delta x + \frac{1}{2} f''(x) \delta x^2$
So the loss becomes
$$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \Omega(\theta))$$
Assuming a decision tree base classifier we can use the example regularization above and the objective becomes:
$$O^t = (L(y, \hat{y}^{t - 1}) + g_i * f(x) + \frac{1}{2} * h_i * f(x)^2) + \gamma T \frac{1}{2}\sum_i \lambda * ||\theta_i||_2$$
With a few transformations, we can redefine the loss in a way we can directly optimize the function
with the standard CART learning algorithm.
First we defined $\mathbb{I}(j == i)$ as an indicator function that returns 1 if example $i$ when pushed trough
the decision tree, ends up in leaf node $j$. Then the result of $f(x_i)$ is $w_j$ the leaf node score at node $j$
ig $x_i$ evaluates true. Now we can rearange the objective function towards the leaf nodes as:
$$O^t = \sum_j [(\sum_i \mathbb{I}(j == i) * g_i) * w_j + \frac{1}{2} (\sum_i \mathbb{I}(j == i) * h_i + \lambda) * w_j^2] + \gamma T$$
redefining the sums of the gradients to $G_j$ and $H_j$ we get:
$$O^t = \sum_j [G_j * w_j + \frac{1}{2} (H_j + lambda) * w_j^2] + \gamma T$$
Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero:
$$Gj + (H_j + \lambda) = 0$$
and then $w_j = -\frac{G_j}{H_j + \lambda}$.
Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second
order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result.
In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a
normal decision tree. The formular is the gradient "amplitude" before the split $(G_l + G_h)^2 / (H_l + H_r + \lambda)$
and after the split $G_l^2 / (H_l + \lambda)$, where $G_l$ and $H_l$ are the sum of the gradients in the left node or the
right node. Then the gain is simply:
$$Gain = \frac{1}{2} [ \frac{G_l^2}{H_l + \lambda} + \frac{G_r^2}{H_r + \lambda} - \frac{(G_l + G_h)^2}{H_l + H_r + \lambda}]$$
Now we can implement these functions:
<pre class="brush:scala">
package gbt
import math._
trait Loss {
def gradient(truth: Double, prediction: Double): Double
def hessian(truth: Double, prediction: Double): Double
def loss(truth: Double, prediction: Double): Double
def output(x: Double): Double
}
class Logistic extends Loss {
private def sigmoid(x: Double): Double = 1.0 / (1.0 + exp(-x))
override def loss(truth: Double, prediction: Double): Double = {
val isOne = -truth * log(sigmoid(prediction))
val isZero = (1 - truth) * log(1.0 - sigmoid(prediction))
(isOne - isZero)
}
override def gradient(truth: Double, prediction: Double): Double = sigmoid(prediction) - truth
override def hessian(truth: Double, prediction: Double): Double =
sigmoid(prediction) * (1.0 - sigmoid(prediction))
override def output(x: Double) = sigmoid(x)
}
class Squared extends Loss {
override def loss(truth: Double, prediction: Double): Double = pow(truth - prediction, 2.0)
override def gradient(truth: Double, prediction: Double): Double = 2.0 * (prediction - truth)
override def hessian(truth: Double, prediction: Double): Double = 2.0
override def output(x: Double): Double = x
}
object LossUtils {
def score(gradients: Vector[Double], hessians: Vector[Double], lambda: Double): Double = {
val G = gradients.sum
val H = hessians.sum
- G / (H + lambda)
}
def gain(
gradients: Vector[Double],
hessians: Vector[Double],
left: Vector[Int],
right: Vector[Int],
lambda: Double,
gamma: Double
): Double = {
val GL = left.map(i => gradients(i)).sum
val HL = left.map(i => hessians(i)).sum
val GR = right.map(i => gradients(i)).sum
val HR = right.map(i => hessians(i)).sum
0.5 * (
(pow(GL, 2) / (HL + lambda)) + (pow(GR, 2) / (HR + lambda)) - (pow(GL + GR, 2) / (HL + HR + lambda))
) - gamma
}
}
</pre>
The full code can be found on <a href = "https://github.com/dkohlsdorf/GradientBoostedTrees">github</a>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Hamburg, Germany53.5510846 9.993681899999955952.9475551 8.7027883999999567 54.1546141 11.284575399999955tag:blogger.com,1999:blog-1040545417301646113.post-44143309604485251052017-11-29T07:47:00.000-08:002019-07-12T04:00:58.295-07:00Refactoring My Previous Deep Learning Library Ideas<h2>Update: My Neural Network Posts</h2>
<ul>
<li> Basic Auto Differentiation: <a href="http://daniel-at-world.blogspot.com/2017/02/ai-implement-your-own-automatic.html"> Post </a> </li>
<li> Basic Auto Differentiation with Weight Sharing: <a href="http://daniel-at-world.blogspot.com/2017/02/implementation-notes-on-my-auto.html"> Post </a></li>
<li> Auto Differentiation Code from Book: <a href="http://daniel-at-world.blogspot.com/2017/03/auto-differentiation-implementing-book.html"> Post </a></li>
<li> Refactoring The Ideas: <a href="http://daniel-at-world.blogspot.com/2017/11/refactoring-my-previous-deep-learning.html"> Post </a></li>
</ul>
<div>
in some previous posts I looked at how I could implement a simple deep learning library to gain a better understanding on the inner workings of: <b><a href="http://daniel-at-world.blogspot.de/2017/02/ai-implement-your-own-automatic.html">Auto Differentiation</a></b>, <b><a href="http://daniel-at-world.blogspot.de/2017/02/implementation-notes-on-my-auto.html">Weight Sharing</a></b> and a good <b><a href="http://daniel-at-world.blogspot.de/2017/03/auto-differentiation-implementing-book.html">Back Propagation</a></b> implementation from the <b><a href="http://www.deeplearningbook.org/">deep learning book</a></b>.</div>
<div>
<br /></div>
<div>
In this post I will present my ideas on a refactored version. The main point is that the graph representation I implemented in the first two posts made it easier to build the computation graph, </div>
<div>
but the code to run back propagation, especially with weight sharing was quite messy. So this refactored version first creates a computation tree as before, only that the actual computations are decoupled from the tree. The tree later is translated into the instructions and parent / child relationships used in the book version of the back propagation algorithm. </div>
<div>
<br /></div>
<div>
Now in the first step we write model setup code. For example, a neural network for mnist classification:</div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyK4WzXIMOM6eQbOKg0ZIwMmXphNJ5-_XiPPfiHSNURm-uNo9dJPYEAFXpjRFrYxDj6rXDg2wSYDsRpWliDli5WjXVU4sEUr3suJN-f40QIOdIwLd8BlEpha4KDVjmV9Cqchijs-u93GY/s1600/Screen+Shot+2017-11-29+at+16.51.17.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="646" data-original-width="1316" height="195" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyK4WzXIMOM6eQbOKg0ZIwMmXphNJ5-_XiPPfiHSNURm-uNo9dJPYEAFXpjRFrYxDj6rXDg2wSYDsRpWliDli5WjXVU4sEUr3suJN-f40QIOdIwLd8BlEpha4KDVjmV9Cqchijs-u93GY/s400/Screen+Shot+2017-11-29+at+16.51.17.png" width="400" /></a></div>
<br /></div>
<div>
<br /></div>
<div>
As in tensorflow I define placeholders, variables and computation nodes, along with their name.</div>
<div>
Furthermore, the variables get a shape and an initialisation method. In the background this method constructs a computation tree rooted at a5 using the following DSL.</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6mmegN6rRkfhtFDC71atgPLgaQ8LMwalq_5iLhSqW-t5Uj8XARN1Hb40e4GshZRopzZSgrFrHmUlbJR73EUY_Kl75q2M-FPni_Jiy8_oHCn-TfsEzok_Tj5QiXdQ8qzsx7ov9xig1GX0/s1600/Screen+Shot+2017-11-29+at+16.51.07.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="643" data-original-width="1312" height="195" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6mmegN6rRkfhtFDC71atgPLgaQ8LMwalq_5iLhSqW-t5Uj8XARN1Hb40e4GshZRopzZSgrFrHmUlbJR73EUY_Kl75q2M-FPni_Jiy8_oHCn-TfsEzok_Tj5QiXdQ8qzsx7ov9xig1GX0/s400/Screen+Shot+2017-11-29+at+16.51.07.png" width="400" /></a></div>
<br />
<br /></div>
<div>
Each node, keeps track of the name, an instruction name and a shape, if needed.</div>
<div>
We can then convert this tree into a neural network ready to be used in the back propagation code from original post. We build four maps keeping:</div>
<div>
<ol>
<li>Mapping from node names to computations (like add, multiply, ...)</li>
<li>Mapping from node names to their parents node names</li>
<li>Mapping from node names to their child node names</li>
<li>Mapping from node names that are weights to matrices.</li>
</ol>
<div>
All computations such as add or multiply implement the forward pass through the variable </div>
<div>
as well as the backward pass.</div>
<div>
<br /></div>
<div>
Using a class I called TensorStore that maps from names to matrices, I implemented the forward and the backward pass as follows:</div>
</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVN7guiYphyphenhyphenBJnYVfL3EeB0_12MIhERS4NwfXVuAUC2nHIBZfPOyJDZ2r0dFtzkbNaBMfnNy8U_AhzOTYjrYQyV_Fpav3fitYFMBVFlanLaOcGM5Ywo6BnygLJSpp8XtMz4QDtLDhgtC0/s1600/Screen+Shot+2017-11-29+at+16.50.38.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="864" data-original-width="1054" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVN7guiYphyphenhyphenBJnYVfL3EeB0_12MIhERS4NwfXVuAUC2nHIBZfPOyJDZ2r0dFtzkbNaBMfnNy8U_AhzOTYjrYQyV_Fpav3fitYFMBVFlanLaOcGM5Ywo6BnygLJSpp8XtMz4QDtLDhgtC0/s400/Screen+Shot+2017-11-29+at+16.50.38.png" width="400" /></a></div>
<br />
<br /></div>
<div>
The code can be found along the other versions on <a href="https://github.com/dkohlsdorf/RefactoredDeepLearningIdeas">github</a>. An example for the spiral dataset as well as the mnist data set is attached.<br />
<br />
Of course the code is not at all up to the popular libraries in terms of speed or functionality so for me it is more an ongoing investigation on how these libraries are implemented and what I can reproduce. :). On Mnist the given network achieves between 96 and 98 % accuracy depending on the starting condition.<br />
<br /></div>
<div>
<br /></div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Bremen, Germany53.079296199999987 8.801693600000021453.079296199999987 8.8016936000000214 53.079296199999987 8.8016936000000214tag:blogger.com,1999:blog-1040545417301646113.post-57275758370541209112017-09-10T06:40:00.000-07:002017-09-10T10:17:11.504-07:00Variational Fashion Encoder<h2>
Variational Fashion Encoder</h2>
<br />
<div class="cell border-box-sizing code_cell rendered" style="-webkit-box-align: stretch; -webkit-box-orient: vertical; -webkit-text-stroke-width: 0px; align-items: stretch; background-color: white; border: 1px solid transparent; box-sizing: border-box; color: black; display: flex; flex-direction: column; font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: normal; letter-spacing: normal; margin: 0px; orphans: 2; outline: none; padding: 5px 5px 5px 0px; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; width: 928px; word-spacing: 0px;">
<span style="text-align: justify;">In this experiment I want to try out <a href="https://arxiv.org/pdf/1606.05908.pdf" style="color: #0088cc; display: inline !important; margin: 0px; padding: 0px;">Variational Auto Encoders</a>, an unsupervised feature learning algorithm </span><span style="text-align: justify;">on a new fashion classification dataset from <a href="https://github.com/zalandoresearch/fashion-mnist" style="color: #0088cc; display: inline !important; margin: 0px; padding: 0px;">Zalando</a>. Since I wanted to play around with both more, </span><span style="text-align: justify;">I will build a generic Variational Auto Encoder and then learn it on the new fashion-mnist dataset.</span></div>
Code can be found on <a href="https://github.com/dkohlsdorf/Variational-Fashion-Encoder">github</a><br />
<br />
<br />
<h3>
Variational Auto Encoder</h3>
<div>
<span style="background-color: white; font-family: "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 14px;">The bottom part of the model is embedding the input X into a mean and variance vector.</span></div>
<div>
<span style="background-color: white; font-family: "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 14px;">The mean and variance represent the parameters of a gaussian that is trained to be close to a standard normal distribution N(0, I).</span><span style="background-color: white; font-family: "helvetica neue" , "helvetica" , "arial" , sans-serif; font-size: 14px;">The decoder network is trying to reconstruct the input from a sample from said distribution. The encoder as well as the decoder network can have multiple layers in my case 3 each.</span></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirVja3i1xexmcTDOg9HRvnHJvbbgojUdKKThPKImqtJ6gBzI19zNK7oIdam1SuzPzuuVERlct5KTPEsKPRdW_wfDQfKrpamWi5sM5G3AWj_AMDDUBjeFeOZ6yrIu1ldgCA9kyUjBGfe4I/s1600/vae.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1140" data-original-width="1076" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirVja3i1xexmcTDOg9HRvnHJvbbgojUdKKThPKImqtJ6gBzI19zNK7oIdam1SuzPzuuVERlct5KTPEsKPRdW_wfDQfKrpamWi5sM5G3AWj_AMDDUBjeFeOZ6yrIu1ldgCA9kyUjBGfe4I/s320/vae.png" width="302" /></a></div>
<br />
The loss function is a combination of the Kullback-Leibler divergence between the distribution from the encoder network and a standard normal distribution, basically centering the latent space around 0 with standard deviation 1. This can be seen as a form of regularisation. Furthermore, the decoder<br />
does not reconstruct directly from the encoder output but from a sample of the distribution in the latent space. <br />
<br />
<h3>
Fashion Reconstruction</h3>
<div>
The images in the fashion dataset are 28 pixels by 28 pixels, in the same way as the MNIST dataset.</div>
<div>
The fashion data consists of 10 - classes of cloths: </div>
<div>
<ol>
<li>tshirt</li>
<li>trouser</li>
<li>pullover</li>
<li>dress</li>
<li>coat</li>
<li>sandal</li>
<li>shirt</li>
<li>sneaker</li>
<li>bag</li>
<li>ankle boo</li>
</ol>
</div>
<div>
After training for 5 epochs on the training set, I plotted the reconstructions from the Variational Auto Encoder. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzLEW-OY4PRwb6uhAKirtqBQonhVdd8iU-gK21qu33CbSVn2QQKgFSUhN-kDn4VY69i7AifITCXSuYgEmONEqJ4bVjXWqOsr03rAzihtd96kvKTyQezgCrMmqfe0DnulJ9dm2ATTzMPDo/s1600/prediction.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1600" data-original-width="1600" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzLEW-OY4PRwb6uhAKirtqBQonhVdd8iU-gK21qu33CbSVn2QQKgFSUhN-kDn4VY69i7AifITCXSuYgEmONEqJ4bVjXWqOsr03rAzihtd96kvKTyQezgCrMmqfe0DnulJ9dm2ATTzMPDo/s640/prediction.png" width="640" /></a></div>
As one can see the reconstructions work quite well indicating the latent space learned is useful.</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Hamburg, Germany53.5510846 9.993681899999955952.9475551 8.7027883999999567 54.1546141 11.284575399999955tag:blogger.com,1999:blog-1040545417301646113.post-20450902807928691412017-08-01T11:11:00.000-07:002017-08-01T11:12:46.336-07:00Investigating Non Linear Optimization in Python<div style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
I recently skimmed through some class videos by Tucker Balch, Professor at Georgia Tech called "<a href="https://www.udacity.com/course/machine-learning-for-trading--ud501">Machine Learning For Trading</a>". One interesting part was portfolio optimisation.</div>
<div>
That is, if you want to invest some amount of money, how do you distribute it
<br />
over a selection of assets in a way minimising risk while at the same time
<br />
being able to earn money through your investment. A portfolio is basically an allocation of the money into different options or in other words, the percentage of the total amount of money allocated to an option. Anyways, in the class as well as the <a href="http://shop.oreilly.com/product/0636920032441.do">course book</a> the optimisation is performed using a<br />
python method for constrained non-linear optimisation. The algorithm used to crunch these problems is called Sequential Least Squares Programming, which is implement using a fortran wrapper.<br />
<div>
The original algorithm can be found in the <a href="http://calgo.acm.org/">ACM Algorithms catalog</a> (<a href="https://github.com/scipy/scipy/edit/master/scipy/optimize/slsqp/slsqp_optmz.f">Algorithm 733</a>)
<br />
written by D. Kraft. The code is automatically translated using the fortran to python (<a href="http://cens.ioc.ee/projects/f2py2e/">f2py</a>)
<br />
library. </div>
</div>
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Hamburg, Germany53.5510846 9.9936817999999952.9475551 8.702788299999991 54.1546141 11.28457529999999tag:blogger.com,1999:blog-1040545417301646113.post-27316450799689802912017-07-23T03:46:00.001-07:002017-07-23T03:46:30.977-07:00Black Box Variational Inference and PyFluxI found an interesting paper on black box variational inference. Variational inference is an<br />
approximate inference algorithm for factored probabilistic models (such as bayesian networks).<br />
<br />
In probabilistic inference, we are normally interested in the posterior density of a latent variable<br />
given all our data. The basic idea is to turn the inference problem into an optimisation problem.<br />
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.<br />
<br />
The problem is that in order to do this for a new model, we still have to do a lot of manual derivations<br />
and work through a lot of math before we can actually use this algorithm. This new paper<br />
however, shows an algorithm that is easier to use given a new model. In my opinion,<br />
one of the successes in deep learning is auto differentiation, where we do not<br />
have to develop all gradients on our own but let the computer do it (as described in earlier posts).<br />
One library that implements this inference method for complex time series is called pyFlux [2].<br />
<br />
The proposed algorithm [1] works as follows. First, we sample the latent variables in our model<br />
from the approximate distribution. We then use the samples to compute a gradient<br />
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).<br />
<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKkTatKcSu4udgzar7-mQBExOz1G8F-0tQo_pK5Qsmp7R0-ojpycVh1O35rjABhUwoV64oxSvz9HiKEIZ6Pej2EnsJGWsbx5ROgalto04xw8rhhqRIp9nS6eWEKiLVKrck6VxFjJtFHts/s1600/Screen+Shot+2017-07-23+at+12.25.32.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="479" data-original-width="662" height="231" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKkTatKcSu4udgzar7-mQBExOz1G8F-0tQo_pK5Qsmp7R0-ojpycVh1O35rjABhUwoV64oxSvz9HiKEIZ6Pej2EnsJGWsbx5ROgalto04xw8rhhqRIp9nS6eWEKiLVKrck6VxFjJtFHts/s320/Screen+Shot+2017-07-23+at+12.25.32.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Reproduced from original paper [1]</td></tr>
</tbody></table>
The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.<br />
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule<br />
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5UEDn9M3LuW2WoRhOjnnvjnsCEA7DXSlcVkzXu4B-FBuOW_1YAr9skPKcR_T_B0Yo29Uqi8vjK2xVCFhS-95FWttcVWbsIksYDpVXTmsTXa-Rsr7aDS5pGnWjtsHf8lLtxQvmxPjXFyw/s1600/CodeCogsEqn.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="27" data-original-width="259" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5UEDn9M3LuW2WoRhOjnnvjnsCEA7DXSlcVkzXu4B-FBuOW_1YAr9skPKcR_T_B0Yo29Uqi8vjK2xVCFhS-95FWttcVWbsIksYDpVXTmsTXa-Rsr7aDS5pGnWjtsHf8lLtxQvmxPjXFyw/s1600/CodeCogsEqn.gif" /></a></div>
<br />
In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.<br />
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,<br />
the final gradient becomes:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtg8hCdZCvmjYWsvX8dslaCdXnkccBacwtc8z8-gWqbMuxDbGsg6D4Xj4N0NZtUpyBJ-Dnh3WYLORBbOvzGoEIT2T0apMHR4MHQq4fRmzaHXVogJw_Y_Wrzc67ZdKDDX7Tq1NpbMOHajU/s1600/Screen+Shot+2017-07-23+at+12.40.43.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="144" data-original-width="612" height="75" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtg8hCdZCvmjYWsvX8dslaCdXnkccBacwtc8z8-gWqbMuxDbGsg6D4Xj4N0NZtUpyBJ-Dnh3WYLORBbOvzGoEIT2T0apMHR4MHQq4fRmzaHXVogJw_Y_Wrzc67ZdKDDX7Tq1NpbMOHajU/s320/Screen+Shot+2017-07-23+at+12.40.43.png" width="320" /></a></div>
<br />
Where the lambda parameters will be the mean and variance of each of the gaussians used<br />
with the gradients:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKzLsuiqcyeLNrhXla4dOfEZXCfNzvZtXqgKOh7jJbShISvK_l8J6W42UHGawSQ8hztkbEwz3_f7fCi9zDQlHUrsFVWnsZMPBEX8z8aDBGIRrTaHhH6s7uf8wJJ-9Ks1ypIfpWQRxPhVk/s1600/bbvi_c94353c8104dd4ec7b9178a2ce3fb6915c7684a4.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="17" data-original-width="262" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKzLsuiqcyeLNrhXla4dOfEZXCfNzvZtXqgKOh7jJbShISvK_l8J6W42UHGawSQ8hztkbEwz3_f7fCi9zDQlHUrsFVWnsZMPBEX8z8aDBGIRrTaHhH6s7uf8wJJ-9Ks1ypIfpWQRxPhVk/s1600/bbvi_c94353c8104dd4ec7b9178a2ce3fb6915c7684a4.png" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAy9Q2cA-wOenlvjOGvmUaprNyxUD8pPZspSl79Lxl1RvV4bGaJ8PPEWSPBXRDFrb1yNfnf7JWy5jraqs70EX-a38CCIcZ5_D1_NPazbJHieQ1b8sxy9gfSvxHcF53AR-C2ozhJJxXOQY/s1600/bbvi_f2414c5132312bfcba075cae71d0befa6dee885b.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="17" data-original-width="160" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAy9Q2cA-wOenlvjOGvmUaprNyxUD8pPZspSl79Lxl1RvV4bGaJ8PPEWSPBXRDFrb1yNfnf7JWy5jraqs70EX-a38CCIcZ5_D1_NPazbJHieQ1b8sxy9gfSvxHcF53AR-C2ozhJJxXOQY/s1600/bbvi_f2414c5132312bfcba075cae71d0befa6dee885b.png" /></a></div>
<br />
<br />
<br />
<h2>
References:</h2>
[1] <a href="http://www.cs.columbia.edu/~blei/papers/RanganathGerrishBlei2014.pdf">Ranganath, Gerrish, Blei: "Black Box Variational Inference", AISTATS 2014, 2014</a><br />
<a href="http://www.pyflux.com/black-box-variational-inference/">[2] PyFlux</a>dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.com0Hamburg, Germany53.5510846 9.9936817999999952.9475551 8.702788299999991 54.1546141 11.28457529999999tag:blogger.com,1999:blog-1040545417301646113.post-41397846556881900212017-05-26T03:11:00.004-07:002017-05-26T03:14:01.118-07:00A Generic Sequence Alignment Algorithm In this post I will describe how a general sequence alignment could be implemented in Scala.<br />
For readers that don't know what sequence alignment is, I recommend reading my<br />
<a href="http://daniel-at-world.blogspot.de/2012/11/building-gesture-recognizer-part-1.html">post</a> on building a gesture recogniser under Dynamic Time Warping distance (DTW).<br />
<br />
Sequence alignment is often used in several contexts including speech recognition,<br />
gesture recognition, approximate string matching, typing correction, biological<br />
sequence alignment and time series classification and even stereo vision.<br />
All these algorithms are based on a technique called dynamic programming and the algorithms<br />
are all very similar. However, in literature this alignment technique is often found under different names (Dynamic Time Warping, Levensthein Distance, Needleman-Wunsch).<br />
<br />
I will first formulate my idea of a general alignment algorithm with a scala implementation and afterwards describe an implementation of the three above algorithms.<br />
<br />
Given two sequences X = [x1 .. xi .. xN] and Y = [y .. yj .. yM], we use the following<br />
recursion to find the alignment score S(x, y). At each position i in sequence X and position j in sequence Y we look backwards at the matching score in X only, in Y only and in X and Y simultaneously resulting in the following recursion:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlmtcNqXj2biCN-xYcTKdqUzF-so0H6A4k4FpqPgncnYxL1vdJKgIOvqX2rkZHCAnhkTxnS7ZXs93WOSyEVcY6jyrnl_ccPoXEFJbbzRYcEneN1p1RWLFNrOgCzAZggQT61bmrRcWefTM/s1600/png.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="20" data-original-width="458" height="16" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlmtcNqXj2biCN-xYcTKdqUzF-so0H6A4k4FpqPgncnYxL1vdJKgIOvqX2rkZHCAnhkTxnS7ZXs93WOSyEVcY6jyrnl_ccPoXEFJbbzRYcEneN1p1RWLFNrOgCzAZggQT61bmrRcWefTM/s400/png.png" width="400" /></a></div>
<br />
The rank function scores the alignment at position i and j based on the previous alignment scores and<br />
the current elements. The rank function is the main component that changes in all three algorithms.<br />
Looking back in X direction is also called an insertion, in Y direction a deletion and for both it is<br />
called a match so the ranking function becomes:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBpE9Dn0je0lgCEV-AoCHKsJ5b22zikSblI_2wvuWWrXg4RP1aLLHpkWtbiJvCA4_anE_D0VhoNN1jVfmUFIx9dTAYfwnU9YThsOZMi4FlYwgDzOTYYQWQnHEaoOc5DEcJRX69oIlw72w/s1600/png+%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="18" data-original-width="290" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBpE9Dn0je0lgCEV-AoCHKsJ5b22zikSblI_2wvuWWrXg4RP1aLLHpkWtbiJvCA4_anE_D0VhoNN1jVfmUFIx9dTAYfwnU9YThsOZMi4FlYwgDzOTYYQWQnHEaoOc5DEcJRX69oIlw72w/s1600/png+%25281%2529.png" /></a></div>
<br />
<br />
Furthermore we will also implement a generic initialisation function that starts our recursion.<br />
Ok now the Scala code:
<br />
<br />
<pre class="brush:scala">class Alignment[A](
init: (Int, Int) => Double,
rank: (Double, Double, Double, A, A) => Double
) {
def score(x: Array[A], y: Array[A]): Double = {
val n = x.size
val m = y.size
val dp = Array.ofDim[Double](n + 1, m + 1)
for(i <- 0 until n + 1) dp(i)(0) = init(i, 0)
for(j <- 0 until m + 1) dp(0)(j) = init(0, j)
for(i <- 1 until n + 1) {
for(j <- 1 until m + 1) {
dp(i)(j) = rank(
dp(i - 1)(j), // insertion
dp(i - 1)(j - 1), // match
dp(i)(j - 1), // deletion
x(i - 1), // current sample in x
y(j - 1) // current sample in y
)
}
}
dp(n)(m)
}
}
</pre>
<br />
<br />
The alignment class has a type parameter specifying the type of the elements in the sequences we want
to align and takes the initialising function and the ranking function as input. Now DTW and the Edit distance are just specialisations with a different type parameter and scoring function:
<br />
<br />
<pre class="brush:scala"> class DTW extends Alignment[Double](
init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else Double.PositiveInfinity,
rank = (insert: Double, matching: Double, delete: Double, x: Double, y: Double) =>
if(insert < matching && insert < delete) insert + math.pow(x - y, 2)
else if (delete < insert && delete < matching) delete + math.pow(x - y, 2)
else matching + math.pow(x - y, 2)
)
</pre>
<br />
<br />
DTW is defined over doubles and is initialising the recursion with 0.0 at position (0,0) and all others to infinity. The ranking adds the euclidian distance to the insertion, deletion and match and then takes the minimum. It is easy to see how this can be generalised to multivariate sequences. The type parameter becomes Seq[Double] and the euclidean distance is implemented for vectors. One can also use kernels
instead or the cosine similarity.
The next function is the edit distance for discrete sequence elements (here characters):
<br />
<br />
<pre class="brush:scala"> class Levensthein extends Alignment[Char](
init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j else i,
rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
if(insert < matching && insert < delete) insert + 1
else if (delete < insert && delete < matching) delete + 1
else matching + (if(x == y) 0.0 else 1.0)
)
</pre>
<br />
<br />
The function is initialised as the index at the borders and the function again takes the minimum
of insertion, deletion and match. Insertions and deletions count as an error so we add 1.
A match can also be a substitution so we add 1 to the error if the characters do not match.
The last algorithm is the Needleman-Wunsch algorithm. Instead of minimising the distance
we are maximising a score. The similarity between two symbols can be defined and the 'gap' penalty
is not 1 as in the edit distance but given.
<br />
<br />
<pre class="brush:scala"> final val Sim = Map(
('a', 'a') -> 10,
('a', 'g') -> -1,
('g', 'a') -> -1,
('a', 'c') -> -3,
('c', 'a') -> -3,
('a', 't') -> -4,
('t', 'a') -> -4,
('g', 'g') -> 7,
('g', 'c') -> -5,
('c', 'g') -> -5,
('g', 't') -> -3,
('t', 'g') -> -3,
('c', 'c') -> 9,
('c', 't') -> 0,
('t', 'c') -> 0,
('t', 't') -> 8
)
final val Gap = -5
class NeedlemanWunsch extends Alignment[Char](
init = (i: Int, j: Int) => if(i == 0 && j == 0) 0.0 else if(i == 0) j * Gap else i * Gap,
rank = (insert: Double, matching: Double, delete: Double, x: Char, y: Char) =>
if(insert > matching && insert > delete) insert + Gap
else if (delete > insert && delete > matching) delete + Gap
else matching + Sim(x, y)
)
</pre>
<br />
<br />
If one needs the actual alignment, we can backtrack the alignment process in all above cases.
The code can be found here [<a href="https://gist.github.com/dkohlsdorf/46c6cca6b37fdac0e0f677c2aae9eeff">GitHub</a>]
dkohlhttp://www.blogger.com/profile/12693638025719077089noreply@blogger.comHamburg, Germany53.5510846 9.9936817999999953.5510846 9.99368179999999 53.5510846 9.99368179999999