{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial 104: Time Series with Pandas / NumPy\n", "\n", "Contributed by: Avi Thaker https://github.com/athaker/CNTK\n", "November 20, 2016\n", "\n", "This tutorial will introduce the use of the Cognitive Toolkit for time series data. We show how to prepare time series data for deep learning algorithms. We will cover training a neural network and evaluating the neural network model. We will also look at the predictive potential on classification of an Exchange-traded Funds ([ETF](https://en.wikipedia.org/wiki/Exchange-traded_fund)), and in this simplified setting how one could trade it. This tutorial serves **only** as an example of how to use neural networks for time series analysis. \n", " \n", "It is important to note that the stock market is extremely noisy and is difficult to predict. This is best done by professionals with domain expertise. It is more important to make sure the model is correct before setting up a trading system (there are many factors to consider including but not limited to: [curve fitting bias](https://en.wikipedia.org/wiki/Overfitting), [forward looking bias](http://www.investopedia.com/terms/l/lookaheadbias.asp?lgl=no-infinite), profitability etc.). The learnings and anecdotes presented in this tutorial is only for exemplary purposes with the goal of introducing an approach to analyze time series data.\n", "\n", "This tutorial introduces how to use pandas_datareader package and pandas. Please note, this tutorial will utilize the numpy interface to CNTK which interfaces well with [Pandas dataframes](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) (a structure that is well suited towards timeseries analysis). \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from __future__ import print_function\n", "import os\n", "import numpy as np\n", "import cntk\n", "import cntk.ops as C\n", "\n", "from cntk.initializer import glorot_uniform\n", "from cntk.layers import default_options, Input, Dense # Layers\n", "from cntk.learner import sgd, learning_rate_schedule, UnitType\n", "from cntk.utils import get_train_eval_criterion, get_train_loss\n", "\n", "import datetime\n", "import pandas as pd\n", "pd.options.mode.chained_assignment = None # default='warn'\n", "%matplotlib inline\n", "\n", "# Select the right target device when this notebook is being tested:\n", "if 'TEST_DEVICE' in os.environ:\n", " if os.environ['TEST_DEVICE'] == 'cpu':\n", " cntk.device.set_default_device(cntk.device.cpu())\n", " else:\n", " cntk.device.set_default_device(cntk.device.gpu(0))\n", "\n", "# If you want to set the program to use a CUDA enabled GPU\n", "#from cntk.device import set_default_device, gpu\n", "#set_default_device(gpu(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing stock data\n", "We first retrieve stock data using the method `get_stock_data`. This method downloads stock data on a daily timescale from Yahoo finance (can be modified to get data from Google Finance and many other sources). [Pandas datareader]( http://pandas-datareader.readthedocs.io/en/latest/remote_data.html) shows many use cases for this data reader." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# A method which obtains stock data from Yahoo finance\n", "# Requires that you have an internet connection to retreive stock data from Yahoo finance\n", "import time\n", "try:\n", " from pandas_datareader import data\n", "except ImportError:\n", " !pip install pandas_datareader\n", " from pandas_datareader import data\n", "\n", "def get_stock_data(contract, s_year, s_month, s_day, e_year, e_month, e_day):\n", " \"\"\"\n", " Args:\n", " contract (str): the name of the stock/etf\n", " s_year (int): start year for data\n", " s_month (int): start month\n", " s_day (int): start day\n", " e_year (int): end year\n", " e_month (int): end month\n", " e_day (int): end day\n", " Returns:\n", " Pandas Dataframe: Daily OHLCV bars\n", " \"\"\"\n", " start = datetime.datetime(s_year, s_month, s_day)\n", " end = datetime.datetime(e_year, e_month, e_day)\n", " \n", " retry_cnt, max_num_retry = 0, 3\n", " \n", " while(retry_cnt < max_num_retry):\n", " try:\n", " bars = data.get_data_yahoo(contract, start, end)\n", " return bars\n", " except:\n", " retry_cnt += 1\n", " time.sleep(np.random.randint(1,10)) \n", " \n", " print(\"Yahoo Finance is not reachable\")\n", " return None" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File already exists data\\Stock\\stock_SPY.pkl\n" ] } ], "source": [ "import pickle as pkl\n", "\n", "# We search in cached stock data set with symbol SPY. \n", "# Check for an environment variable defined in CNTK's test infrastructure\n", "envvar = 'CNTK_EXTERNAL_TESTDATA_SOURCE_DIRECTORY'\n", "def is_test(): return envvar in os.environ\n", "\n", "def download(data_file):\n", " data = get_stock_data(\"SPY\", 2000, 1,2,2017,1,1)\n", " dir = os.path.dirname(data_file)\n", " \n", " if not os.path.exists(dir):\n", " os.makedirs(dir)\n", " \n", " if not os.path.isfile(data_file):\n", " print(\"Saving\", data_file )\n", " with open(data_file, 'wb') as f:\n", " pkl.dump(data, f, protocol = 2)\n", " return data\n", "\n", "data_file = os.path.join(\"data\", \"Stock\", \"stock_SPY.pkl\")\n", "\n", "# Check for data in local cache\n", "if os.path.exists(data_file):\n", " print(\"File already exists\", data_file)\n", " data = pd.read_pickle(data_file) \n", "else: \n", " # If not there we might be running in CNTK's test infrastructure\n", " if is_test():\n", " test_file = os.path.join(os.environ[envvar], 'Tutorials','data','stock','stock_SPY.pkl')\n", " if os.path.isfile(test_file):\n", " print(\"Reading data from test data directory\")\n", " data = pd.read_pickle(test_file)\n", " else:\n", " print(\"Test data directory missing file\", test_file)\n", " print(\"Downloading data from Yahoo Finance\")\n", " data = download(data_file) \n", " else:\n", " # Local cache is not present and not test env\n", " # download the data from Yahoo finance and cache it in a local directory\n", " # Please check if there is trade data for the chosen stock symbol during this period\n", " data = download(data_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the training paramaters\n", "\n", "The stock market behavior exhibits substantial [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation) ([reference](http://epchan.blogspot.com/2016/04/mean-reversion-momentum-and-volatility.html)). We use [ETF](http://www.investopedia.com/terms/e/etf.asp) `SPY` index representing the \"market\" of stock. This is the ETF that encompasses around top 500 companies in America by market capitalization. We will trade under the assumption that there is some short term autocorrelation that have predictive power in the market. \n", "\n", "### Predicting\n", "* Whether or not the next data for the given stock/ETF will be above or below the current day.\n", "\n", "### Predictors\n", "* The previous 8 days, classified if greater than the current day,\n", "\n", "* The volume changes as a percentage,\n", "\n", "* The percentage change from the previous day.\n", "\n", "Note, we are not feeding the neural network the price itself. Financial timeseries data are noisy. It is important not to overfit the data. There is a lot we can do here (smoothing, adding more features, etc.), but we will keep this tutorial simple, and demonstrate CNTK's ability to interface with timeseries data. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Open | \n", "High | \n", "Low | \n", "Close | \n", "Volume | \n", "Adj Close | \n", "diff | \n", "v_diff | \n", "p_1 | \n", "p_2 | \n", "p_3 | \n", "p_4 | \n", "p_5 | \n", "p_6 | \n", "p_7 | \n", "p_8 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2000-01-03 | \n", "148.250000 | \n", "148.250000 | \n", "143.875000 | \n", "145.437500 | \n", "8164300 | \n", "105.825332 | \n", "0.000000 | \n", "0.000000 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-04 | \n", "143.531204 | \n", "144.062500 | \n", "139.640594 | \n", "139.750000 | \n", "8089800 | \n", "101.686912 | \n", "0.040698 | \n", "0.009209 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-05 | \n", "139.937500 | \n", "141.531204 | \n", "137.250000 | \n", "140.000000 | \n", "12177900 | \n", "101.868820 | \n", "0.001786 | \n", "0.335698 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-06 | \n", "139.625000 | \n", "141.500000 | \n", "137.750000 | \n", "137.750000 | \n", "6227200 | \n", "100.231643 | \n", "0.016334 | \n", "0.955598 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-07 | \n", "140.312500 | \n", "145.750000 | \n", "140.062500 | \n", "145.750000 | \n", "8066500 | \n", "106.052718 | \n", "0.054889 | \n", "0.228017 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-10 | \n", "146.250000 | \n", "146.906204 | \n", "145.031204 | \n", "146.250000 | \n", "5741700 | \n", "106.416535 | \n", "0.003419 | \n", "0.404898 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-11 | \n", "145.812500 | \n", "146.093704 | \n", "143.500000 | \n", "144.500000 | \n", "7503700 | \n", "105.143175 | \n", "0.012111 | \n", "0.234817 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2000-01-12 | \n", "144.593704 | \n", "144.593704 | \n", "142.875000 | \n", "143.062500 | \n", "6907700 | \n", "104.097201 | \n", "0.010048 | \n", "0.086281 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "
2000-01-13 | \n", "144.468704 | \n", "145.750000 | \n", "143.281204 | \n", "145.000000 | \n", "5158300 | \n", "105.506992 | \n", "0.013362 | \n", "0.339143 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "0 | \n", "
2000-01-14 | \n", "146.531204 | \n", "147.468704 | \n", "145.968704 | \n", "146.968704 | \n", "7437300 | \n", "106.939489 | \n", "0.013395 | \n", "0.306428 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "1 | \n", "