Thursday, November 1, 2012

Building a Gesture Recognizer - Part 1 Dynamic Time Warping

I recently noticed a lot of people want to do "something with gestures" but don't know how to start.
So I thought writing a two part tutorial on how to recognize gestures. The first part will
describe the problems of gesture recognition and will describe a very simple algorithm, while
the second part will give the insights in what most gesture recognition systems currently use.

The goal in this tutorial is to understand a common problem in gesture recognition called warping
and to build a simple, distance based gesture recognizer.

So you want to recognize gestures. For simplicity let us assume you want to recognize gestures
performed with a mouse. If you record the position of your mouse over time, you will
have a two dimensional time series. In the picture below you can see an example.
I recorded a gesture and plotted the x and y position separately.
TOP: a recorded gesture, CENTER: the position in x - axis,  BOTTOM: the position in y - axis.

If we want to build a recognizer we have to collect data first. For example if we have four different
gestures we could record ten users performing each gesture twenty times. So the resulting data set includes 800 recordings. Now one problem is that the examples for a gesture will not
exactly look the same. For example some users will perform the gesture faster and slower resulting
in an effect called "warping". One solution to this problem is to align the time series. An alignment
is the association of each point in one time series to another point in the other time series. 


Two examples of the same gesture (only x - axis). One example
is warped. The purple lines show the alignment of the two time series.

In order to compare two warped time series we will use Dynamic Time Warping.
Dynamic Time Warping compares two time series while accounting for the warp using Dynamic Programming. In general the warping is resolved by accounting for added samples (the second example is performed slower) or deleting samples (the second example is performed faster. 

The distance is calculated by constructing an M x N matrix V, where M is the length of one time series X = x1... xM and N the length of another time series Y = y1 ... yN. In our case xi and yj are two dimensional vectors (the position of the mouse). In each matrix position v(i,j)
we minimize the distance comparing the time series up to position i in X and position j in Y by the following recursion:






while v(0,0) is initialized as 0 and v(i,0) as well as v(0, j) are initialized as infinity.
Now that we have a distance measure we can implement a very simple recognizer.
We use the Nearest Neighbor Classifier using the distance described above.
For a gesture we have recorded we will calculate the Dynamic Time Warping distance
to each example in our pre recorded gesture set. The gesture associated with the example
that has the minimum distance is the recognized gesture. 


Two gestures shown in the distance space from a new example. 
One important question to ask is "What happens when the new example is not a gesture".
In that case we apply a threshold on the gesture. If the distance is too high (above the threshold)
we will not consider it as anything.

So we built a very simple recognizer. The implementation should be straight forward. In on of the next posts I will walk through gesture recognition using Hidden Markov Models. Since Hidden Markov Models are not easy to implement we will use an existing Toolkit called HTK3. I will write a step by step tutorial. 


Tuesday, March 27, 2012

Bayes Meal Planning and Reach for Friends

It is a long time since my last post. I figured you might be wondering what I currently do and how it is studying at Georgia Tech. So I thought giving you a quick outlook on what is going on in one of my classes (yes in the U.S people have to take classes for their PhD).


In my current graduate AI Class (Gatech CS 6601) we had to perform two mini research projects.
The actual course structure is very nice and refreshing. You read the Textbook [1] chapter by chapter at
home and do some of the homework from the book. The actual lecture covers state of the art in AI
and you are required to perform three research projects (Proposal + Implementation  + Evaluation +
Paper) in a bi weekly cycle.


In the first I evaluated how to find influential users in a Twitter network and in the second we predicted how a group of people likes a dish using a bayes net. I thought both are worth a blog post.
The first I performed as a team with Patrick Mize.


Reach for Friends

Our goal in the project was to identify influential users in a social graph. A social graph represents friendships between users. Each node in such a graph is an individual and each edge represents a friendship between two people.  In order to generate a social graph, we used the Twitter api and extracted 1000 nodes and stored them in a database. The database holds two tables, one for users (with their Twitter id and name) and one for friendships (mapping a user to another user).  We “crawled” the Twitter network by recursively searching the friends of a user. In order to control the growth of the tree we restricted the depth of the search to 5 and the breadth of the tree to 4. So in the end we estimated the total number of nodes to be at most 4096 nodes. Because some users had less then 4 friends and because we did not continued searching duplicate users we collected 1600 nodes in the end. Our method for influence
is reach of a node / user.  Reach computes as follows: For a node all the shortest paths that go through it, the shorter of the halves divided by that node is inserted in a set. The largest distance in that set is the reach of that node. (see Figure 1



 In the upper graph the reach for v is on the shortest path between s and g.
Since computing reach for all nodes requires to compute the shortest path from all nodes to all other nodes
we use a landmark based heuristic to search faster. The landmarks for our heuristics are N distinct nodes from our social graph which show more then 4 connections each. From those N nodes we computed the shortest path to all other nodes in the graph using uniform cost search (Dijkstra’s algorithm). Once we computed the landmarks we can speed up future searches using landmarks as a heuristic [2]. The landmark heuristic is the distance from the current node to the landmark minus the distance from the landmark to the goal node. These distances can be looked up in the table we built while computing the landmarks. Since we want to get as close as possible to the actual distance we take the maximum landmark distance as the value of the heuristic function.
  
h(x, y) = max_i ( dist(x, landmark_i)−dist(landmark_i, y)  )

In the end we calculated the influence of some persons in my net and took the one showing the
highest reach. We ended with 3 of my professors, which makes sens, however over a few friends
we found that I am friend with Obama and he got scored lesser then the professors. The explanation
for this puzzling result is that the network was gathered starting from me. So the network is a local one centered around me.  

In the second project Woody Folsom and me explored peoples preferences towards food.


A Baysian Approach to Collaborative Dish Selection

Baysian Catering: A use case. Imagine, you run a catering service and have to plan an event with a customer. You can create a variaty of dishes and now you want to discuss with your clients which one to serve. In order to get a better idea of which preferences and needs you clients will have, you let them fill out a survey in advance, where they rate a small amount of your dishes on a scale from 1 − 10 and inform you about hard constraints like allergies, religious constraints or vegetarians. You then use those results in order to predict the ratings for the rest of your dishes and present the clients the top k results. If such a system works this will save time and will lead to a better cutomer satisfaction since you can present them dishes they will most probably like but still surprise them (since you have not presented them what the already rated). After the dinner participants could rate the dishes served at the party which would improve your work for that client in the future. We model the diners’ various taste preferences using a Bayes net. The net consists of three node types. We call them “control nodes”, “taste nodes” and “rating nodes”. A “taste node” models the probability of a diners’ preference towards an ingredient (P (likes tomato), P (likes potato)) or a category (P (likes meat)). These variables are discrete. The ingredients are conditional independent from each other but conditioned by the food category they belong to (see Figure 2 the two top layers). A control node can definitely reject a dish, by evaluating to 0 in certain conditions. For example if someone is vegetarian and the presented dish contains meet, the control variable for vegetarian will evaluate to 0 and so the probability for the whole dish will become 0. So the vegetarian variable is conditioned by meat. The third type in the net is a preference node, it is continuos and models the dish rating given a set of ingredients. The overall net is shown in Figure 2.




Our Baysian net modeling user preferences. The top layer describes the categories Meat and Vegetable. We have a control variable vegetarian for Meat, such that it will always evaluate to 0 when there is meat involved in a dish and we have a vegetarian diner. The mid layer describes the preference for different ingredients. The last layer is a gaussian predicting the users preferences. 


In order to estimate the model parameters, the system will be trained with statistics about taste and preferences given a set of dishes with ratings from multiple users. The training set is generated from the questionnaires we distributed. An example for a survey output could look like this (Ingredients, Rating): “Pork, Potatoes, 8”. In order to perform normal Maximum Likelihood Learning [3] we have to have information about all variables (“Pork, Potatoes, Tomatoes, Beef, Meat, Vegetables, Rating”). We perform several steps in order to transform from the survey input to a training instance. First we discretize the values such that all given variables (in our case Pork and Potatoes) are set to “true” if the value is above a certain threshold (in our experiments 5) and “false” otherwise. In that way “liking things rated 5” appear more often in the training set and will be assigned with a higher probability.  We add categories by including the category of each ingredient from the survey. If the ingredient is liked, the category is too and if it is not, the category is not liked too. The last step is to add all values that are not in the recipe as “false” to the training instance.rom a set of those preprocessed assignments, we can directly calculate the probabilities for the ingredients using Maximum Likelihood Learning [4]. Having estimated the probabilities of such a net, we can infer the maximum likelihood rating of a unseen dish while observing only a set of ingredients. Therefore we will iterate over all possible ratings (1 − 10) and compute the probability of this rating. The maximum probability is the maximum likelihood rating for that dish. We use the enumerateAll algorithm [1], for the probability calculations. In a final experiments we showed that we are able to predict the ratings of users with a root mean square error of 1.92.




[1] Russel, Norvig: Artificial Intelligence a Modern Approach, 2010, Prentice Hall
[2] C. H. Andrew V. Goldberg. Computing the shortest path: A search meets graph theory. Technical report, Microsoft Research, 2003.
[3] Wang, Knight, and Elder. On computer viral infection and the effect of immunization. In Proceedings of the ACSAC ’00, 2000
[4]  Kevin Murphy. A Brief Introduction to Graphical Models and Bayesian Networks