probabilistic kernel model. Basically I am following the "Machine Learning: A probabilistic Perspective" book by Kevin Murphy.

### The Basics

For an input $x_i$ and output $y_i = f(x_i)$, infer a distribution over functions of the data:$p(f | X, y)$. In order to make predictions $y*$ for a new data point $x*$ given all the previous observations, we define:

$p(y*|x*, X, y) = \int p(y*|f, x*) p(f|x, y) df$

For our data $X = x_1 ... x_n$ the Gaussian process assumption is that the the distributions over functions $p(f_1 ... f_N)$ is jointly Gaussian with mean $\mu(x)$ and a covariance function $\Sigma_{ij} = k(x_i, x_j)$ and a kernel $k$.

### Covariance structure

It is quite common to leaf the mean function at zero. Given a kernel, we need to compute the covariance function between all training and test instances. However, we structure the kernel matrix in the following way.

The sub-matrix $K$ describes the affinity between all training instances and the sub-matrix $K*$ describes the affinity between training and test instances. Finally, the submatrix $K**$ describes the affinity between test instances to itself. The image above shows the final covariance structure. With that, our function $f*$ can be inferred as: $(f, f*) ~ N(0, \Sigma)$.

In order to obtain the prediction calculate: $f* = K_* \bullet K^{-1} \bullet f$.

If we want the find the variance of the predictions we calculate: $trace(K_{**} - K_* \bullet K^{-1} \bullet K_*^T)$

In my opinion the method is very similar to support vector machines and to instance beased learning methods in general. The difference is that a SVM is only using a small set of training examples called support vectors, while the Gaussian process compares every example to all others in order to make a prediction.

### A Naive Gaussian Process

Below I implemented a naive Gaussian process using numpy. Basically, I am building the three components of the covariance structure and then solve the prediction equations above. I call my method naive since computing the inverse has a worst case complexity of $O(n^3)$So it is not advisable to use this method for larger datasets.

class GaussianProcess: def __init__(self, kernel): self.kernel = kernel def predict(self, train_x, train_y, test_x): (n, _) = train_x.shape (m, _) = test_x.shape covar_train = np.zeros((n, n)) covar_train_test = np.zeros((n, m)) covar_test = np.zeros((m, m)) for i in range(0, n): for j in range(0, n): covar_train[i, j] = self.kernel.k(train_x[i], train_x[j], i == j) for i in range(0, m): for j in range(0, m): covar_test[i, j] = self.kernel.k(test_x[i], test_x[j], i == j) for i in range(0, n): for j in range(0, m): covar_train_test[i, j] = self.kernel.k(train_x[i], test_x[j], False) covar_inv = np.linalg.inv(covar_train) prediction = np.dot(np.dot(covar_train_test.T, covar_inv), train_y) confidence = covar_test - np.dot(covar_train_test.T, np.dot(covar_inv, covar_train_test)) confidence = np.sqrt(np.array([confidence[i, i] for i in range(0, m)])) return prediction, confidence

### Squared Exponential Kernel

For my experiments I use a squared exponential kernel and also add a variance parameter for for noisy observations:$k(x_i, x_j) = \sigma_f^2 e^{-\frac{1}{2l^2} || x_i - x_j ||_2^2} + \sigma_y^2 \mathbb{I}(i == j)$

The parameters are as follows:

- $\sigma_f^2$: Variance giving the scale of the output
- $\sigma_y^2 \mathbb{I}(i == j)$: Noise variance applied to diagnoal
- $l$: Scale on input

class SquaredExponential: def __init__(self, variance, alpha, noise): self.alpha = alpha self.variance = variance self.noise = noise self.scaler = -1.0 / (2 * alpha**2) def k(self, a, b, on_diag): error = np.sum(np.square(a - b)) f = np.exp(self.scaler * error) scaled = self.variance * f if on_diag: scaled += self.noise return scaled

### Results on some simple example datasets

I tried my method on three datasets, 2 times, I am intepolating stock marktet data and one time a sine wave. Each time, the index is the input to the method and the output for the stock market prediction is the closing price.Sine Wave Interpolation:

Stock Interpolation Dax:

Stock Interpolation US: