Neural Networks and the Stock Market Pt. 3 – Training and Performance

See Part 2 of the series here.

So in the last entry, I detailed the code I wrote to implement my neural network, which was a feed-forward network that backpropagates errors. The focus for this entry is to try to make some predictions based on freely available stock price data (from Yahoo! Finance), and get some rough estimates of how well the network is able to forecast.

Code From Previous Entries

Since this entry builds on the previous two, I figured it would be helpful to present that code in one place below.

import sys
import os
import random
import math
import numpy as np
import pandas as pd
import yahoo_finance as yf
import time

def getHistoricalData(symbol_name, start_date, end_date, save_data=False, pathname=""):
    symbol     = yf.Share(symbol_name)
    price_data = symbol.get_historical(start_date, end_date)
    price_df   = pd.DataFrame(price_data)

    if save_data:
        if len(pathname) > 0:
            if not os.path.exists(pathname):
                os.makedirs(pathname)
            filename = pathname + "\\" + symbol_name + "_" + start_date + "_" + end_date + ".csv"
            print "Ticker data for",symbol_name,"saved to:", pathname 
        else:
            filename = symbol_name + "_" + start_date + "_" + end_date + ".csv"
            print "Ticker data for",symbol_name,"saved to local directory"
        price_df.to_csv(filename)

    return price_df

class Node(object):
    def __init__(self,number_of_inputs):
        self.inputs  = number_of_inputs
        self.bias    = np.random.uniform(0.0,1.0)
        #self.weights = np.array([0.5] * number_of_inputs)
        self.weights = np.array([np.random.uniform(0.0,1.0)] * number_of_inputs)
        self.output  = 0.0

    def output(self):
        return self.output

    def debug_info(self):
        info =  "Bias: %f ; Weights:"%(self.bias)
        for w in self.weights:
            info += "%f," %(w)
        return info

    def getWeightAtIdx(self,idx):
        return self.weights[idx]

    def getBias(self):
        return self.bias

    def calculateActivity(self,input_vector):
        #linear basis function
        activity = self.bias
        activity += np.dot(input_vector,self.weights)
        return activity

    def activationFunction(self,input_value):
        # Sigmoid Activation
        return 1.0/(1.0 + math.exp(-input_value))    

    def calculate(self,input_vector):
        activity_value = self.calculateActivity(input_vector)
        self.output = self.activationFunction(activity_value)

    def updateWeights(self,alpha,delta):
        adjustment = self.output * delta * alpha
        self.bias = self.bias + adjustment
        self.weights = self.weights + adjustment

class FeedForwardNet(object):
    def __init__(self,no_of_inputs,no_of_hidden_layers,nodes_in_hiddens,no_of_outputs,learning_rate):
        self.number_of_inputs        = no_of_inputs
        self.number_of_hidden_layers = no_of_hidden_layers
        self.hidden_nodes            = []
        self.hidden_outputs          = []
        self.hidden_nodes.append(np.array([Node(no_of_inputs) for x in range(nodes_in_hiddens[0])]))
        self.hidden_outputs.append(np.array([0.0 for x in range(nodes_in_hiddens[0])]))
        if no_of_hidden_layers > 1:
            for i in range(1,len(nodes_in_hiddens)):
                self.hidden_nodes.append(np.array([Node(nodes_in_hiddens[i-1]) for x in range(nodes_in_hiddens[i])]))
                self.hidden_outputs.append(np.array([0.0 for x in range(nodes_in_hiddens[i])]))

        self.hidden_node_list        = nodes_in_hiddens

        self.output_layer        = np.array([Node(nodes_in_hiddens[-1]) for i in range(no_of_outputs)])


        self.number_of_outputs       = no_of_outputs
        self.network_output          = np.array([0.0 for i in range(no_of_outputs)])
        self.errors                  = np.array([0.0 for i in range(no_of_outputs)])
        self.alpha                   = learning_rate

    def getNetOutputs(self):
        return self.network_output

    def debug_info(self):
        print "Number of Inputs: ", self.number_of_inputs
        print "Number of Hidden Nodes: ", self.hidden_node_list
        print "Number of Outputs: ", self.number_of_outputs

        print "Hidden Layer Node Weights:"
        count = 1
        for layer in self.hidden_nodes:
            print "Hidden Layer",count,":"
            count +=1
            for node in layer:
                print node.debug_info()

        print "Ouput Layer Node Weights:"
        for node in self.output_layer:
            print node.debug_info()

        print "Output from network:"
        print self.network_output
        print "Network Errors:"
        print self.errors


    def FeedForward(self,input_vector,true_outputs=None,Training=False):

        for y in range(len(self.hidden_nodes)):
            layer  = self.hidden_nodes[y]
            output = self.hidden_outputs[y]
            for x in range (len(layer)):
                layer[x].calculate(input_vector)
                output[x] = layer[x].output
            input_vector = output
        hidden_output = self.hidden_outputs[-1]
        for x in range(self.number_of_outputs):            
            self.output_layer[x].calculate(hidden_output)
            self.network_output[x] = self.output_layer[x].output

        if Training:
            self.errors[x] = true_outputs[x] - self.output_layer[x].output
            self.BackPropagate()
        else:
            return self.network_output

    def BackPropagate(self):
        deltas_for_layer = []
        for i in range(self.number_of_outputs):
            output = self.network_output[i]
            delta_o = self.errors[i] * (output * (1.0-output))
            self.output_layer[i].updateWeights(self.alpha,delta_o)
            deltas_for_layer.append(delta_o)
        prev_layer = self.output_layer
        for y in range(len(self.hidden_nodes)):
            layer  = self.hidden_nodes[-(1+y)]
            prev_layer_factor = 0
            current_layer_deltas = []
            for j in range(len(layer)):
                output = layer[j].output
                for x in range(len(prev_layer)):
                    prev_layer_factor += prev_layer[x].getWeightAtIdx(j) * deltas_for_layer[x]

                delta_h = (output * (1.0-output)) * prev_layer_factor
                current_layer_deltas.append(delta_h)
                layer[j].updateWeights(self.alpha,delta_h)
            prev_layer = layer            
            deltas_for_layer = current_layer_deltas


E:\Anaconda2\lib\site-packages\pandas\computation\__init__.py:19: UserWarning: The installed version of numexpr 2.4.4 is not supported in pandas and will be not be used

  UserWarning)

In addition to that code, I also have the following two helper functions. The first is one that simply seeds both the Python Standard Libary random number generator and the NumPy random number generator. In order to be able to accurately recreate results when using any type of algorithm that employs randomness in some fashion, seeding your random number generator and saving the seed is extremely important. So this function provides a mechanism to do so.

The second function provides stats for a Pandas Dataframe. It’s purpose is basically to provide me with debugging information about my datasets.

def Seed_RNG(seed_val):      
    print "RANDOM SEED: ",seed_val
    random.seed(random_seed)
    np.random.seed(random_seed)

def examine_data_frame( df):
    for name in df.columns:
        print "----------"
        print df[ name].dtype
        if df[ name].dtype is np.dtype( 'O'):
            print df[ name].value_counts()
            print "Name: ", name
        else:
            print df[ name].describe()

Initial Thoughts On Training

As a starting point, I think that the S&P 500 ETF (exchange-traded fund) SPY is a good first choice for a ‘stock’ to attempt to forecast. SPY (and the S&P 500) tend to be used as benchmarks for trading and investing strategies. Quantopian uses it as the benchmark to beat, and (at least in my opinion) is a fairly low volatility fund to follow. I think this low volatility may make the task of prediction easier.

The data provided from Yahoo! is the opening, closing, and adjusted closing price, daily high, daily low, and the trading volume for each day. The next question is from these 6 variables, what should be presented to the network? I think that the daily high and lows should be avoided. This data can, sometimes, be a bit unreliable, as different sources may have different values for these. Additionally, if I were to implement some type of system where the price is monitored in real-time (like with a live feed of price data), I don’t have knowledge of what the high/low for the day is until the day is over. My concern with using high/low is that it seems fairly easy to introduce future data into the system. So I’m not going to use it. For now, I’ll just be using the open/close prices and the trading volume as inputs.

Finally, I think that there are two things I’m going to try to predict: whether a stock will close higher tomorrow and whether a stock will close higher a week from today (which is 5 trading days).

spy = "SPY"

start = "2013-01-01"
end   = "2017-01-01"

spy_df = getHistoricalData("SPY",start,end)
spy_df[['Open','Close','Volume']] = spy_df[['Open','Close','Volume']].apply(pd.to_numeric)
spy_df.info()

RangeIndex: 1008 entries, 0 to 1007
Data columns (total 8 columns):
Adj_Close    1008 non-null object
Close        1008 non-null float64
Date         1008 non-null object
High         1008 non-null object
Low          1008 non-null object
Open         1008 non-null float64
Symbol       1008 non-null object
Volume       1008 non-null int64
dtypes: float64(2), int64(1), object(5)
memory usage: 63.1+ KB

Preprocessing the data

So this data looks pretty much as I would expect. The data does need some preprocessing however, namely some normalization and stardardization. This isn’t absolutely necessary with neural networks, unlike logistic regression where it is, but it still is a very good idea. Doing so, along with the random initializations of weights, allows for the network to more quickly reduce errors. Additionally, the open/close price and trading volume have very different scales, so normalizing helps get us to more of an ‘apples to apples’ situation with our variables. To do this, I’ve taken the most basic approach, which is to center about the mean by subtracting the mean from the value then dividing by 2x the standard deviation.

However, I don’t think that the mean and standard deviations should be computed for the entire data set. Since this is a time series, at time t, you wouldn’t know the mean or standard deviation for the entire dataset, as it includes data from time t+1 and greater. To get around introducing future data to the network, I think the best approach would be to use a moving average and standard deviation for normalization.

The code below does this. The code takes in a number of days to scale on, the variable to scale, and the data frame. A scaled version of this variable is added to the dataframe. Then the function will loop through the data and compute a mean and standard deviation, including the past input number of days. The value is then scaled based on that and added to the frame.

def scale_on_lookback_window(num_of_days,variable,dataframe):
    scaled_var = variable + "_scaled"
    dataframe[scaled_var] = np.nan
    var_array = dataframe.as_matrix(columns = [variable])
    #print num_of_days, len(dataframe[scaled_var]) 
    for i in range(num_of_days,len(dataframe[scaled_var])):        
        data_slice = var_array[(i-num_of_days):i]
        #print data_slice[0]
        data_avg = np.mean(data_slice)
        data_std = np.std(data_slice)
        dataframe[scaled_var][i] = (dataframe[variable][i] - data_avg) / (2.0*data_std)


scale_window = 30
scale_on_lookback_window(scale_window,"Open",spy_df)
scale_on_lookback_window(scale_window,"Close",spy_df)
scale_on_lookback_window(scale_window,"Volume",spy_df)
spy_df.drop(spy_df.index[0:scale_window], inplace=True)
spy_df.reset_index(drop=True,inplace=True)
spy_df.info()
spy_df.head()
RangeIndex: 978 entries, 0 to 977
Data columns (total 11 columns):
Adj_Close        978 non-null object
Close            978 non-null float64
Date             978 non-null object
High             978 non-null object
Low              978 non-null object
Open             978 non-null float64
Symbol           978 non-null object
Volume           978 non-null int64
Open_scaled      978 non-null float64
Close_scaled     978 non-null float64
Volume_scaled    978 non-null float64
dtypes: float64(5), int64(1), object(5)
memory usage: 84.1+ KB


E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Adj_Close Close Date High Low Open Symbol Volume Open_scaled Close_scaled Volume_scaled
0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 SPY 65617700 -0.992043 -0.975247 -0.280858
1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 SPY 91652600 -0.985221 -0.819481 0.189329
2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 SPY 94580000 -0.895887 -1.030917 0.219208
3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 SPY 100552700 -0.969539 -0.957311 0.306738
4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 SPY 172113300 -0.707726 -0.802213 1.601762

The last bit of preprocessing I’ll be doing to the dataset is adding the true values for what we’re trying to predict. I’ll add variables representing tomorrow’s close, next week’s close, and binary variables indicating if the closing price is up from today. You don’t necessarily need to do this, but I think this will make processing the data easier.

<br />print spy_df.head()

spy_df["tomorrow_close"] = np.nan
spy_df["week_close"]     = np.nan
spy_df["tomorrow_up"]    = np.nan
spy_df["week_up"]        = np.nan

for i in range(1,len(spy_df["tomorrow_close"])):

    spy_df["tomorrow_close"][i-1] = spy_df["Close_scaled"][i]
    if spy_df["Close_scaled"][i] > spy_df["Close_scaled"][i-1]:
        spy_df["tomorrow_up"][i-1] = 1
    else:
        spy_df["tomorrow_up"][i-1] = 0

for i in range(5,len(spy_df["tomorrow_close"])):
    spy_df["week_close"][i-5] = spy_df["Close_scaled"][i]
    if spy_df["Close_scaled"][i] > spy_df["Close_scaled"][i-5]:
        spy_df["week_up"][i-5] = 1
    else:
        spy_df["week_up"][i-5] = 0

spy_df.drop(spy_df.index[-5:], inplace=True)

spy_df.info()
spy_df.head()
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:18: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


    Adj_Close       Close        Date        High         Low        Open  \
0  216.593374  217.869995  2016-11-16  218.139999  217.419998  217.559998   
1  217.000975  218.279999  2016-11-15  218.279999  216.800003  217.039993   
2  215.320876  216.589996  2016-11-14  217.270004  215.720001  217.029999   
3  215.151874  216.419998  2016-11-11  216.699997  215.320007  216.080002   
4  215.648944  216.919998  2016-11-10  218.309998  215.220001  217.300003   

  Symbol     Volume  Open_scaled  Close_scaled  Volume_scaled  
0    SPY   65617700    -0.992043     -0.975247      -0.280858  
1    SPY   91652600    -0.985221     -0.819481       0.189329  
2    SPY   94580000    -0.895887     -1.030917       0.219208  
3    SPY  100552700    -0.969539     -0.957311       0.306738  
4    SPY  172113300    -0.707726     -0.802213       1.601762  

Int64Index: 973 entries, 0 to 972
Data columns (total 15 columns):
Adj_Close         973 non-null object
Close             973 non-null float64
Date              973 non-null object
High              973 non-null object
Low               973 non-null object
Open              973 non-null float64
Symbol            973 non-null object
Volume            973 non-null int64
Open_scaled       973 non-null float64
Close_scaled      973 non-null float64
Volume_scaled     973 non-null float64
tomorrow_close    973 non-null float64
week_close        973 non-null float64
tomorrow_up       973 non-null float64
week_up           973 non-null float64
dtypes: float64(9), int64(1), object(5)
memory usage: 121.6+ KB
Adj_Close Close Date High Low Open Symbol Volume Open_scaled Close_scaled Volume_scaled tomorrow_close week_close tomorrow_up week_up
0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 SPY 65617700 -0.992043 -0.975247 -0.280858 -0.819481 -0.820529 1.0 1.0
1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 SPY 91652600 -0.985221 -0.819481 0.189329 -1.030917 -1.078715 0.0 0.0
2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 SPY 94580000 -0.895887 -1.030917 0.219208 -0.957311 -1.105524 1.0 0.0
3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 SPY 100552700 -0.969539 -0.957311 0.306738 -0.802213 -1.597586 1.0 0.0
4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 SPY 172113300 -0.707726 -0.802213 1.601762 -0.820529 -1.325594 0.0 0.0

Training and Predictions

Below is my code for running the whole dataset created above. The basic process here is:
* Set a number of days to use in training
* Take that number of days – 1 to train the network on
* Use the last day in the window to make a prediction
* If the prediction is greater than that day’s close: the prediction is 1
* Else: the prediction is 0

This style of training the network is called online learning, where you feed it sequentially, as opposed to training with batch-style techniques. Additionally, we don’t keep the network from day-to-day, meaning each day we develop a new model for the stock price. The main reason I’m doing this is to try to prevent some amount of overfitting the network. My thought is that if there is some correlation between a stock’s past price and it’s future price, it’s will be heavily weighted to the most current price data, so then why include old data at all?

Additionally, you’ll notice that most of the parameters and meta-parameters of this training and prediction function seem chosen in fairly arbitrary fashion (like only looking back 3 days, the number of nodes in the network, etc.). That’s because, quite frankly, they are abitrary. To get good values for these, I think you would really need to create validation curves for each of these parameters. I will probably end up doing this. I don’t think, however, I’ll include that as a blog entry, especially not for this one anyway.

random_seed = int(time.time())
Seed_RNG(random_seed)

num_of_iterations = 250
hidden_layers = 1
lookback_window = 3

daily_residuals  = [-1.0 for i in range(lookback_window)]
weekly_residuals = [-1.0 for i in range(lookback_window)]
for i in range(lookback_window,len(spy_df["tomorrow_close"])):
    #no_of_inputs,no_of_hidden_layers,nodes_in_hiddens,no_of_outputs,learning_rate
    stock_net = FeedForwardNet(3,1,[7],2,0.35)
    for iterations in range(num_of_iterations):
        for x in range(lookback_window-1):
            idx = i - lookback_window + x
            training_vector = np.array([float(spy_df["Open_scaled"][idx]),float(spy_df["Close_scaled"][idx]),
                                        float(spy_df["Volume_scaled"][idx])])
            training_output = [float(spy_df["tomorrow_close"][idx]),float(spy_df["week_close"][idx])]

            stock_net.FeedForward(training_vector,training_output,Training=True)
    pred_vector = [float(spy_df["Open_scaled"][idx]),float(spy_df["Close_scaled"][i]),float(spy_df["Volume_scaled"][i])]
    pred_closes = stock_net.FeedForward(pred_vector)
    if pred_closes[0] > spy_df["Close_scaled"][i]:
        daily_residuals.append(1)
        #print "here"
    else:
        daily_residuals.append(0)

    if pred_closes[1] > spy_df["Close_scaled"][i]:
        weekly_residuals.append(1)
    else:
        weekly_residuals.append(0)
    if (i % 100 == 0):
        print i, "iterations"
print len(weekly_residuals)

spy_df["daily_residuals"]  = daily_residuals 
spy_df["weekly_residuals"] = weekly_residuals
spy_df.info()
spy_df.head()

RANDOM SEED:  1484778788
100 iterations
200 iterations
300 iterations
400 iterations
500 iterations
600 iterations
700 iterations
800 iterations
900 iterations
973

Int64Index: 973 entries, 0 to 972
Data columns (total 17 columns):
Adj_Close           973 non-null object
Close               973 non-null float64
Date                973 non-null object
High                973 non-null object
Low                 973 non-null object
Open                973 non-null float64
Symbol              973 non-null object
Volume              973 non-null int64
Open_scaled         973 non-null float64
Close_scaled        973 non-null float64
Volume_scaled       973 non-null float64
tomorrow_close      973 non-null float64
week_close          973 non-null float64
tomorrow_up         973 non-null float64
week_up             973 non-null float64
daily_residuals     973 non-null float64
weekly_residuals    973 non-null float64
dtypes: float64(11), int64(1), object(5)
memory usage: 136.8+ KB
Adj_Close Close Date High Low Open Symbol Volume Open_scaled Close_scaled Volume_scaled tomorrow_close week_close tomorrow_up week_up daily_residuals weekly_residuals
0 216.593374 217.869995 2016-11-16 218.139999 217.419998 217.559998 SPY 65617700 -0.992043 -0.975247 -0.280858 -0.819481 -0.820529 1.0 1.0 -1.0 -1.0
1 217.000975 218.279999 2016-11-15 218.279999 216.800003 217.039993 SPY 91652600 -0.985221 -0.819481 0.189329 -1.030917 -1.078715 0.0 0.0 -1.0 -1.0
2 215.320876 216.589996 2016-11-14 217.270004 215.720001 217.029999 SPY 94580000 -0.895887 -1.030917 0.219208 -0.957311 -1.105524 1.0 0.0 -1.0 -1.0
3 215.151874 216.419998 2016-11-11 216.699997 215.320007 216.080002 SPY 100552700 -0.969539 -0.957311 0.306738 -0.802213 -1.597586 1.0 0.0 1.0 1.0
4 215.648944 216.919998 2016-11-10 218.309998 215.220001 217.300003 SPY 172113300 -0.707726 -0.802213 1.601762 -0.820529 -1.325594 0.0 0.0 1.0 1.0

Initial Performance

Here’s how the network did for both the daily predictions and weekly predictions.

print "STATS FOR DAILY PREDICTIONS"
error_count     = 0
false_pos_count = 0
false_neg_count = 0
up_preds        = 0
down_preds      = 0
for i in range (0,len(spy_df["tomorrow_up"])):
    if spy_df["daily_residuals"][i] != spy_df["tomorrow_up"][i]:
        error_count += 1
    if spy_df["daily_residuals"][i] > spy_df["tomorrow_up"][i]:
        false_pos_count += 1
    if spy_df["daily_residuals"][i] < spy_df["tomorrow_up"][i]:
        false_neg_count += 1       
    if spy_df["daily_residuals"][i] == 1.0:
        up_preds += 1
    else:
        down_preds += 1

error_rate = error_count / float(len(spy_df["daily_residuals"]))
false_pos_rate = false_pos_count / float(up_preds)
false_neg_rate = false_neg_count / float(down_preds)

print error_count, len(spy_df["daily_residuals"])

print "Error rate: ", error_rate
print "Accuracy  : ", 1 - error_rate
print "False Positive Rate: ", false_pos_rate
print "False Negative Rate: ", false_neg_rate
print "Count of upward predictions: ", up_preds
print "Count of down predictions:", down_preds 

STATS FOR DAILY PREDICTIONS
484 973
Error rate:  0.497430626927
Accuracy  :  0.502569373073
False Positive Rate:  0.506922257721
False Negative Rate:  0.235294117647
Count of upward predictions:  939
Count of down predictions: 34
print "STATS FOR WEEKLY PREDICTIONS"
error_count     = 0
false_pos_count = 0
false_neg_count = 0
up_preds        = 0
down_preds      = 0
for i in range (0,len(spy_df["week_up"])):
    if spy_df["weekly_residuals"][i] != spy_df["week_up"][i]:
        error_count += 1
    if spy_df["weekly_residuals"][i] > spy_df["week_up"][i]:
        false_pos_count += 1
    if spy_df["weekly_residuals"][i] < spy_df["week_up"][i]:
        false_neg_count += 1       
    if spy_df["weekly_residuals"][i] == 1.0:
        up_preds += 1
    else:
        down_preds += 1

error_rate = error_count / float(len(spy_df["weekly_residuals"]))
false_pos_rate = false_pos_count / float(up_preds)
false_neg_rate = false_neg_count / float(down_preds)

print error_count, len(spy_df["weekly_residuals"])

print "Error rate: ", error_rate
print "Accuracy  : ", 1 - error_rate
print "False Positive Rate: ", false_pos_rate
print "False Negative Rate: ", false_neg_rate
print "Count of upward predictions: ", up_preds
print "Count of down predictions:", down_preds 

STATS FOR WEEKLY PREDICTIONS
332 973
Error rate:  0.34121274409
Accuracy  :  0.65878725591
False Positive Rate:  0.410666666667
False Negative Rate:  0.107623318386
Count of upward predictions:  750
Count of down predictions: 223

So attempting to predict daily Up/Downs is a wash – with an error rate at roughly 50%. The weekly predictions, I think, are far more interesting, with an error rate down at 34%. This indicates that you get a correct prediction every 2 out of 3 times. It’s also very interesting to note that the false negative rate of the predictions is very low, at just under 10%.

Some caveats to the above results are that we’ve only included data from the past 4 years in this analysis, so the results above may be not be indicative of trying to forecast other time periods, i.e. is this method (and the parameters selected) merely optimized for this specific set of data. Additionally, the very low false positive rate for the weekly predictions is based on only 200+ samples, so that low rate may not hold up as more “downs” are predicted.

“Canning” This Process

To explore this process further, I think it makes sense to create functions out of the above code, so that this process can be repreated any number of times with any arbitrary ticker symbol, any start and stop dates, and with arbitrary network and training parameters. The following four functions below help do this. The first function does the normalizing of the variables: Open, Close, and Volume. The second function will add variables to the dataset for predicting out the input number of days. The third function is essentially the training/predicting process that was run above for SPY. The final function then provides some basic metrics for how well the network was able to make classifications.

def getAndSmoothData(symbol,start_day,end_day,lookback_window):
    df = getHistoricalData(symbol,start,end)
    df[['Open','Close','Volume']] = df[['Open','Close','Volume']].apply(pd.to_numeric)
    scale_on_lookback_window(lookback_window,"Open",df)
    scale_on_lookback_window(lookback_window,"Close",df)
    scale_on_lookback_window(lookback_window,"Volume",df)
    df.drop(df.index[0:scale_window], inplace=True)
    df.reset_index(drop=True,inplace=True)
    return df

def addForecastingVariable(days_to_predict,df,var_to_predict):
    var_name = "%s_%d_day_forecast" % (var_to_predict,days_to_predict)
    #print var_name
    df[var_name] = np.nan
    df[(var_name + "_up")] = np.nan

    for i in range(days_to_predict,len(df[var_name])):

        df[var_name][i-days_to_predict] = df[var_to_predict][i]
        if df[var_to_predict][i] > df[var_to_predict][i-days_to_predict]:
            df[(var_name + "_up")][i-days_to_predict] = 1
        else:
            df[(var_name + "_up")][i-days_to_predict] = 0

    df.drop(df.index[-days_to_predict:], inplace=True)
    df.reset_index(drop=True,inplace=True)
    #print df.info()
    #print df.head()

def run_data(df,input_vars,output_vars,hidden_nodes_list,learning_rate,training_window,training_iterations):

    residuals = []
    print len(df["Close_scaled"])
    for x in range(len(output_vars)):
        residuals.append([-1.0 for i in range(training_window)])
    for i in range(training_window,len(df["Close_scaled"])):
    #no_of_inputs,no_of_hidden_layers,nodes_in_hiddens,no_of_outputs,learning_rate
        stock_net = FeedForwardNet(len(input_vars),len(hidden_nodes_list),hidden_nodes_list,
                                   len(output_vars),learning_rate)
        for iterations in range(training_iterations):
            for x in range(training_window-1):
                idx = i - training_window + x
                training_list = []
                for var in input_vars:
                    training_list.append(float(df[var][idx]))
                training_vector = np.array(training_list)
                out_list = []
                for var in output_vars:
                    out_list.append(float(df[var][idx]))                     
                training_output = np.array(out_list)
                stock_net.FeedForward(training_vector,training_output,Training=True)
        pred_list = []
        for var in input_vars:
            pred_list.append(float(df[var][idx]))
        pred_vector = np.array(pred_list)
        pred_closes = stock_net.FeedForward(pred_vector)
        for x in range(len(output_vars)):
            #print len(residuals[x])
            if pred_closes[x] > df["Close_scaled"][i]:
                residuals[x].append(1)
            else:
                residuals[x].append(0)                 
    for x in range(len(output_vars)):
        df[(output_vars[x] + "_residuals")]  = residuals[x] 

def Forecast_Stats(df,residuals,actuals):
    print "STATS FOR", residuals
    error_count     = 0
    false_pos_count = 0
    false_neg_count = 0
    up_preds        = 0
    down_preds      = 0
    for i in range (0,len(df[actuals])):
        if df[residuals][i] != df[actuals][i]:
            error_count += 1
        if df[residuals][i] > df[actuals][i]:
            false_pos_count += 1
        if df[residuals][i] < df[actuals][i]:
            false_neg_count += 1       
        if df[residuals][i] == 1.0:
            up_preds += 1
        else:
            down_preds += 1

    error_rate = error_count / float(len(df[residuals]))
    false_pos_rate = false_pos_count / float(up_preds)
    false_neg_rate = false_neg_count / float(down_preds)

    print error_count, len(df[residuals])

    print "Error rate: ", error_rate
    print "Accuracy  : ", 1 - error_rate
    print "False Positive Rate: ", false_pos_rate
    print "False Negative Rate: ", false_neg_rate
    print "Count of upward predictions: ", up_preds
    print "Count of down predictions:", down_preds 

With these functions, I think it would be intersting to try out some different stocks and ETFs and see how well the network is able to make correct predictions. Again, like the SPY analysis I did above, there are a lot of abitrary parameters used here that could (and in all honestly should) be optimized.

In addition, I’m also adding a 10-day prediction as well. Since the weekly predictions were more accurate than the daily predictions, maybe that means the network is able to pick up more easily on longer-term trends.

ticker_symbols = ["FAS","BAC","JNUG","DUST","GOOG","TQQQ","ANGL","CHK","WMT"]
start = "2013-01-01"
end   = "2017-01-01"
for ticker_symbol in ticker_symbols:
    print "---------------" + ticker_symbol + "-----------------------------------"
    data = getAndSmoothData(ticker_symbol,start,end,15)
    addForecastingVariable(1,data,"Close_scaled")
    addForecastingVariable(5,data,"Close_scaled")
    addForecastingVariable(10,data,"Close_scaled")
    input_variables = ["Open_scaled","Close_scaled","Volume_scaled"]
    output_vars = ["Close_scaled_1_day_forecast","Close_scaled_5_day_forecast","Close_scaled_10_day_forecast"]
    #print data.info()
    #print data.head()

    run_data(data,input_variables,output_vars,[7,5],0.25,10,100)

    for output in output_vars:
        Forecast_Stats(data,(output + "_residuals"),(output + "_up"))
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
E:\Anaconda2\lib\site-packages\ipykernel\__main__.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


---------------FAS-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
477 962
Error rate:  0.495841995842
Accuracy  :  0.504158004158
False Positive Rate:  0.514318442153
False Negative Rate:  0.314606741573
Count of upward predictions:  873
Count of down predictions: 89
STATS FOR Close_scaled_5_day_forecast_residuals
424 962
Error rate:  0.440748440748
Accuracy  :  0.559251559252
False Positive Rate:  0.467963386728
False Negative Rate:  0.170454545455
Count of upward predictions:  874
Count of down predictions: 88
STATS FOR Close_scaled_10_day_forecast_residuals
227 962
Error rate:  0.235966735967
Accuracy  :  0.764033264033
False Positive Rate:  0.282051282051
False Negative Rate:  0.150887573964
Count of upward predictions:  624
Count of down predictions: 338
---------------BAC-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
410 962
Error rate:  0.426195426195
Accuracy  :  0.573804573805
False Positive Rate:  0.447306791569
False Negative Rate:  0.259259259259
Count of upward predictions:  854
Count of down predictions: 108
STATS FOR Close_scaled_5_day_forecast_residuals
381 962
Error rate:  0.39604989605
Accuracy  :  0.60395010395
False Positive Rate:  0.429078014184
False Negative Rate:  0.155172413793
Count of upward predictions:  846
Count of down predictions: 116
STATS FOR Close_scaled_10_day_forecast_residuals
251 962
Error rate:  0.260914760915
Accuracy  :  0.739085239085
False Positive Rate:  0.30407523511
False Negative Rate:  0.175925925926
Count of upward predictions:  638
Count of down predictions: 324
---------------JNUG-----------------------------------
772
STATS FOR Close_scaled_1_day_forecast_residuals
318 772
Error rate:  0.411917098446
Accuracy  :  0.588082901554
False Positive Rate:  0.450079239303
False Negative Rate:  0.241134751773
Count of upward predictions:  631
Count of down predictions: 141
STATS FOR Close_scaled_5_day_forecast_residuals
288 772
Error rate:  0.373056994819
Accuracy  :  0.626943005181
False Positive Rate:  0.4224
False Negative Rate:  0.163265306122
Count of upward predictions:  625
Count of down predictions: 147
STATS FOR Close_scaled_10_day_forecast_residuals
206 772
Error rate:  0.266839378238
Accuracy  :  0.733160621762
False Positive Rate:  0.312629399586
False Negative Rate:  0.190311418685
Count of upward predictions:  483
Count of down predictions: 289
---------------DUST-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
404 962
Error rate:  0.419958419958
Accuracy  :  0.580041580042
False Positive Rate:  0.445442875481
False Negative Rate:  0.311475409836
Count of upward predictions:  779
Count of down predictions: 183
STATS FOR Close_scaled_5_day_forecast_residuals
343 962
Error rate:  0.356548856549
Accuracy  :  0.643451143451
False Positive Rate:  0.402313624679
False Negative Rate:  0.163043478261
Count of upward predictions:  778
Count of down predictions: 184
STATS FOR Close_scaled_10_day_forecast_residuals
218 962
Error rate:  0.226611226611
Accuracy  :  0.773388773389
False Positive Rate:  0.273770491803
False Negative Rate:  0.144886363636
Count of upward predictions:  610
Count of down predictions: 352
---------------GOOG-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
412 962
Error rate:  0.428274428274
Accuracy  :  0.571725571726
False Positive Rate:  0.447058823529
False Negative Rate:  0.285714285714
Count of upward predictions:  850
Count of down predictions: 112
STATS FOR Close_scaled_5_day_forecast_residuals
421 962
Error rate:  0.43762993763
Accuracy  :  0.56237006237
False Positive Rate:  0.473004694836
False Negative Rate:  0.163636363636
Count of upward predictions:  852
Count of down predictions: 110
STATS FOR Close_scaled_10_day_forecast_residuals
253 962
Error rate:  0.262993762994
Accuracy  :  0.737006237006
False Positive Rate:  0.318529862175
False Negative Rate:  0.145631067961
Count of upward predictions:  653
Count of down predictions: 309
---------------TQQQ-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
464 962
Error rate:  0.482328482328
Accuracy  :  0.517671517672
False Positive Rate:  0.492737430168
False Negative Rate:  0.34328358209
Count of upward predictions:  895
Count of down predictions: 67
STATS FOR Close_scaled_5_day_forecast_residuals
436 962
Error rate:  0.453222453222
Accuracy  :  0.546777546778
False Positive Rate:  0.479190101237
False Negative Rate:  0.13698630137
Count of upward predictions:  889
Count of down predictions: 73
STATS FOR Close_scaled_10_day_forecast_residuals
269 962
Error rate:  0.279625779626
Accuracy  :  0.720374220374
False Positive Rate:  0.335375191424
False Negative Rate:  0.161812297735
Count of upward predictions:  653
Count of down predictions: 309
---------------ANGL-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
454 962
Error rate:  0.471933471933
Accuracy  :  0.528066528067
False Positive Rate:  0.491228070175
False Negative Rate:  0.317757009346
Count of upward predictions:  855
Count of down predictions: 107
STATS FOR Close_scaled_5_day_forecast_residuals
419 962
Error rate:  0.435550935551
Accuracy  :  0.564449064449
False Positive Rate:  0.469964664311
False Negative Rate:  0.176991150442
Count of upward predictions:  849
Count of down predictions: 113
STATS FOR Close_scaled_10_day_forecast_residuals
270 962
Error rate:  0.280665280665
Accuracy  :  0.719334719335
False Positive Rate:  0.337060702875
False Negative Rate:  0.175595238095
Count of upward predictions:  626
Count of down predictions: 336
---------------CHK-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
400 962
Error rate:  0.4158004158
Accuracy  :  0.5841995842
False Positive Rate:  0.44099378882
False Negative Rate:  0.286624203822
Count of upward predictions:  805
Count of down predictions: 157
STATS FOR Close_scaled_5_day_forecast_residuals
370 962
Error rate:  0.384615384615
Accuracy  :  0.615384615385
False Positive Rate:  0.425373134328
False Negative Rate:  0.177215189873
Count of upward predictions:  804
Count of down predictions: 158
STATS FOR Close_scaled_10_day_forecast_residuals
246 962
Error rate:  0.255717255717
Accuracy  :  0.744282744283
False Positive Rate:  0.313114754098
False Negative Rate:  0.15625
Count of upward predictions:  610
Count of down predictions: 352
---------------WMT-----------------------------------
962
STATS FOR Close_scaled_1_day_forecast_residuals
439 962
Error rate:  0.456340956341
Accuracy  :  0.543659043659
False Positive Rate:  0.481087470449
False Negative Rate:  0.275862068966
Count of upward predictions:  846
Count of down predictions: 116
STATS FOR Close_scaled_5_day_forecast_residuals
407 962
Error rate:  0.423076923077
Accuracy  :  0.576923076923
False Positive Rate:  0.459976105137
False Negative Rate:  0.176
Count of upward predictions:  837
Count of down predictions: 125
STATS FOR Close_scaled_10_day_forecast_residuals
258 962
Error rate:  0.268191268191
Accuracy  :  0.731808731809
False Positive Rate:  0.320512820513
False Negative Rate:  0.171597633136
Count of upward predictions:  624
Count of down predictions: 338

So attempting to predict if a stock is going to rise over the next two weeks ended up being even more accurate, with accuracy above 70% for all the stocks and ETFs in the list above, which is very very interesting. Again, this may be a product of looking at this particular time period. Another thing to consider is how well this perfomance stacks up against other modelling and forecasting techniques. For instance, would you be able to get as good or better performance using a linear model, or merely simply predicting “UP” everytime? The greater than 70% does seem really promising, but maybe there are benchmarks out there that do better.

One more thing I feel that I should point out is that this isn’t really a strategy for making trades. I think that these predictions would most likely be best used to augment or help some strategy, or maybe be used to screen stocks out stocks that are trending down (unless you’re looking for stocks to short that is). Basically, I see this more as a tool to aid a strategy. I could be wrong though, I’m by no means an expert.

This post and the past two are really the bulk of what I wanted to cover in the blog. I think I may write one more entry with some evaluation of how this method compares performance-wise to other modelling techniques, or I may start a new series of posts about something else; I haven’t really decided yet. If there’s any topic within this idea of using a neural network with financial data you’d like to see, feel free to comment and let me know.

Thanks for reading! Please feel free to post any questions/comments/bug fixes!

7 thoughts on “Neural Networks and the Stock Market Pt. 3 – Training and Performance

  1. Hey! Thanks for sharing. I am playing around with something similar and this has sparked some ideas for me. Just a few questions to make sure I am understanding this correctly:

    1. Are you performance results against your training data, or a separate test data set that your network hasn’t seen before?
    2. Is there a reason you chose to use a feed forward network as opposed to an LSTM?
    3. Am I correct in understanding you only fed the network the scaled open, close and volume (3 features) as training inputs? If so, why not the daily high/low as well?
    4. I am not sure I understand how you used the lookback window. Did you modify your input to become a 3×3 matrix, or a 9×1 or something else?

    I am using stateful LSTMs, with a look back window of 30 days, and using 5 features (1 – % change from previous day close to next day open, 2- % change from same day open to high, 3- % change from same day open to low, 4 – % change from previous day close to today close, 5 – scaled volume) and similarly predicting 10 days into the future. With some hyperparameter optimization I can 98% accuracy on my training set, and 87% accuracy on my validation set, but my first test set attempts are around 40% which is disappointing. My training set size is 1300 days and I am using 100 for validation and test. The issue seems to be that without a large validation or test set size, measuring accuracy isn’t great as their isn’t enough data, but making them too big means the patterns the network has learned do not necessarily apply so far into the future. Its an interesting problem either way.

    Thanks again for sharing.

    Like

  2. This might be off topic, but any idea as to why the data is not up to date when downloading? I put in today’s date (1/27/2017) as the end, and the most recent it would save in the CSV file (no processing, just downloading from Yahoo!) is 1/23/2017. Looking at where it got it’s data from, https://github.com/yql/yql-tables/blob/master/yahoo/finance/yahoo.finance.stocks.xml shows the URL it’s feeding from is http://finance.yahoo.com/quote/aapl/history?ltr=1 which shows up to today.

    Is this just a function of https://www.datatables.org/ being days behind with this data?

    Also, any idea why when specifying today as the end date, it can only give predictions dated 6 weeks ago (12/12/2016)?

    Like

  3. Pingback: ANN | Chang&Chang

Leave a comment