Saturday, July 21, 2018

Financial Machine Learning Experiement

As a fun project, I tried to predict the stock market and implemented a small library todo so.
Basically, I want to use a sliding window of historic prices of several companies or funds and
predict the price of a single company. In the end, I achieve decent results in a 10 day window and
predicting 5 days into the future. However, my error is still several euros large :)

The code can be found here: [:Github:]

Learning Problem, Features and Modeling

First we define the return of interest (ROI) as:

 $roi(t, t + 1) = \frac{x_t - x_{t + 1}}{x_t} $

Which is the percentage of earnings at a time $t + 1$ measured to a previous investment
at $t$. In order to extract the features, we compute sliding windows over our historic prices.
For each sample in a window, we compute the ROI to the start of the window, which represents
the earnings to the start of the window. For a window of $T$ steps and $n$ stocks, we flatten the sliding windows and get a feature vector of size $T x n$ of ROI entries. Our target variable we want
to predict is a stock price $d$ days after the last day of the window. The traget is converted
to the ROI, too. So we try to predict the earnings after $d$ days after the last day of the window
which represents a potential investment.

Red: roi in the window, Blue: ROI for prediction

Some Experimental Results

First we downloaded several historic price datasets from yahoo finance and load them using pandas. We take the date column as the index and interpolate missing values:
data = [
    ('euroStoxx50', pd.read_csv('data/stoxx50e.csv', index_col=0, na_values='null').interpolate('linear')),
    ('dax',         pd.read_csv('data/EL4A.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('us',          pd.read_csv('data/EL4Z.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('xing',        pd.read_csv('data/O1BC.F.csv',   index_col=0, na_values='null').interpolate('linear')),
    ('google',      pd.read_csv('data/GOOGL.csv',    index_col=0, na_values='null').interpolate('linear')),
    ('facebook',    pd.read_csv('data/FB2A.DE.csv',  index_col=0, na_values='null').interpolate('linear')),
    ('amazon',      pd.read_csv('data/AMZN.csv',     index_col=0, na_values='null').interpolate('linear'))

We learn the following classifier on our data.
predictors = [
    ('RF', RandomForestRegressor(n_estimators=250)),
    ('GP', GaussianProcessRegressor(kernel=RBF(length_scale=2.5))),
    ('NN', KNeighborsRegressor(n_neighbors=80)),
    ('NE', KerasPredictor(model, 10, 512, False)),
    ('GB', GradientBoostingRegressor(n_estimators=250))

The neural network's architecture (named model above):
hidden = [256, 128, 64, 32]
inp    = (len(data.stoxx) * WIN,)
model  = Sequential()
model.add(Dense(hidden[0], activation='relu', input_shape=inp))
for h in hidden[1:]:
    model.add(Dense(h, activation='relu'))
model.add(Dense(1, activation='linear'))
model.compile('adam', 'mse')

Below we show some prediction results for our classifiers.
We also not the root mean square error in euros. 

No comments:

Post a Comment