Daniel@World: Black Box Variational Inference and PyFlux

Sunday, July 23, 2017

Black Box Variational Inference and PyFlux

I found an interesting paper on black box variational inference. Variational inference is an
approximate inference algorithm for factored probabilistic models (such as bayesian networks).

In probabilistic inference, we are normally interested in the posterior density of a latent variable
given all our data. The basic idea is to turn the inference problem into an optimisation problem.
That is, we build an approximate probability distribution that is differentiable with respect to it 's parameters. We then find the parameter set that minimises the difference between the true posterior and the approximate distribution.

The problem is that in order to do this for a new model, we still have to do a lot of manual derivations
and work through a lot of math before we can actually use this algorithm. This new paper
however, shows an algorithm that is easier to use given a new model. In my opinion,
one of the successes in deep learning is auto differentiation, where we do not
have to develop all gradients on our own but let the computer do it (as described in earlier posts).
One library that implements this inference method for complex time series is called pyFlux [2].

The proposed algorithm [1] works as follows. First, we sample the latent variables in our model
from the approximate distribution. We then use the samples to compute a gradient
for each parameter that minimises the Kullback-Leibler divergence or more specifically the Expectational Lower Bound (ELBO).

Reproduced from original paper [1]

The z[s] are samples from the latent space. Each z[s] includes a values for each of the latent variables. These samples are used to build the gradient.
The value of the Robbins - Monroe Sequence, is modelling an adaptive learning rate schedule
to guarantee convergence. The gradient has multiple components, including the joint probability of the original distribution, the log of the approximate distribution and the derivative of the approximate distribution. For a lot of cases, these values are easy to compute. First note that when using the samples, all variables in the original joint distribution are observed. For a Bayesian network that means that the joint distribution is simply the product of all variables:

In the paper and in the pyFlux library, the approximate distribution is implemented as a mean field.
Which means that the variables in our model are all independent. This makes the sampling and evaluation of the model very easy. Each variable can be sampled individually and the conditional distribution is simply the product of all the variables. Finally, the pyFlux library implements each variable in the approximate distribution as a one dimensional gaussian. Putting all that together,
the final gradient becomes: