Saturday, July 28, 2018

Simple Gaussian Processes For Regression

In this post I am implementing a simple Gaussian Process Regressor. A Gaussian Process is a
probabilistic kernel model. Basically I am following the "Machine Learning: A probabilistic Perspective" book by Kevin Murphy.

The Basics

For an input $x_i$ and output $y_i = f(x_i)$, infer a distribution over functions of the data:
$p(f | X, y)$. In order to make predictions $y*$ for a new data point $x*$ given all the previous observations, we define:

$p(y*|x*, X, y) = \int p(y*|f, x*) p(f|x, y) df$

For our data $X = x_1 ... x_n$ the Gaussian process assumption is that the the distributions over functions $p(f_1 ... f_N)$ is jointly Gaussian with mean $\mu(x)$ and a covariance function $\Sigma_{ij} = k(x_i, x_j)$ and a kernel $k$.

Covariance structure

It is quite common to leaf the mean function at zero. Given a kernel, we need to compute the covariance function between all training and test instances. However, we structure the kernel matrix in the following way. 



The sub-matrix $K$ describes the affinity between all training instances and the sub-matrix $K*$ describes the affinity between training and test instances. Finally, the submatrix $K**$ describes the affinity between test instances to itself. The image above shows the final covariance structure. With that, our function $f*$ can be inferred as: $(f, f*) ~ N(0, \Sigma)$.

In order to obtain the prediction calculate: $f* = K_*  \bullet K^{-1} \bullet f$.
If we want the find the variance of the predictions we calculate: $trace(K_{**} - K_*  \bullet K^{-1} \bullet  K_*^T)$

In my opinion the method is very similar to support vector machines and to instance beased learning methods in general. The difference is that a SVM is only using a small set of training examples called support vectors, while the Gaussian process compares every example to all others in order to make a prediction.


A Naive Gaussian Process

Below I implemented a naive Gaussian process using numpy. Basically, I am building the three components of the covariance structure and then solve the prediction equations above. I call my method naive since computing the inverse has a worst case complexity of $O(n^3)$
So it is not advisable to use this method for larger datasets.
class GaussianProcess:
    
    def __init__(self, kernel):
        self.kernel = kernel
        
    def predict(self, train_x, train_y, test_x):
        (n, _) = train_x.shape
        (m, _) = test_x.shape
        covar_train      = np.zeros((n, n))
        covar_train_test = np.zeros((n, m))
        covar_test       = np.zeros((m, m))
        for i in range(0, n):
            for j in range(0, n):
                covar_train[i, j] = self.kernel.k(train_x[i], train_x[j], i == j)
        for i in range(0, m):
            for j in range(0, m):
                covar_test[i, j] = self.kernel.k(test_x[i], test_x[j], i == j)
        for i in range(0, n):
            for j in range(0, m):
                covar_train_test[i, j] = self.kernel.k(train_x[i], test_x[j], False)
        covar_inv  = np.linalg.inv(covar_train)
        prediction = np.dot(np.dot(covar_train_test.T, covar_inv), train_y)
        confidence = covar_test - np.dot(covar_train_test.T, np.dot(covar_inv, covar_train_test))
        confidence = np.sqrt(np.array([confidence[i, i] for i in range(0, m)]))
        return prediction, confidence 


Squared Exponential Kernel

For my experiments I use a squared exponential kernel and also add a variance parameter for for noisy observations:

$k(x_i, x_j) = \sigma_f^2 e^{-\frac{1}{2l^2} || x_i - x_j ||_2^2} + \sigma_y^2 \mathbb{I}(i == j)$

The parameters are as follows:
  • $\sigma_f^2$: Variance giving the scale of the output
  • $\sigma_y^2 \mathbb{I}(i == j)$: Noise variance applied to diagnoal
  • $l$: Scale on input
again, the implementation is quite simple in numpy.
class SquaredExponential:

    def __init__(self, variance, alpha, noise):
        self.alpha = alpha
        self.variance = variance
        self.noise = noise
        self.scaler = -1.0 / (2 * alpha**2)
        
    def k(self, a, b, on_diag):
        error  = np.sum(np.square(a - b))
        f      = np.exp(self.scaler * error)
        scaled = self.variance * f
        if on_diag:
            scaled += self.noise
        return scaled   


Results on some simple example datasets

I tried my method on three datasets, 2 times, I am intepolating stock marktet data and one time a sine wave. Each time, the index is the input to the method and the output for the stock market prediction is the closing price.
Sine Wave Interpolation:

Stock Interpolation Dax:

Stock Interpolation US:

No comments:

Post a Comment