I did a simple experiment using EURUSD OHLC 1-Day data.
My features were Open Price, Low Price, High Price, and I was trying to predict the future Closing price.
The code worked, as expected, but the results were very misleading.
I got a 99% Accuracy score, which as we all know is impossible.
1) So what I am I doing wrong?
2) How can I correct my mistakes?
The official system I am building would have BoP, PPI, Interest Rate, GDP, and a lot of Momentum indicators, etc. as Features, over some 60 features.
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import pickle
# 1. Read the EURUSD csv data.
# 2. Process the DataFrame, using only the Open, High, Low, Close columns.
df = pd.read_csv( 'EURUSD1440.csv', index_col= 'Date' )
df = df[['Open','High','Low','Close']]
array = df.values
# Features consist of Open, High, Low column, and stored in x.
# Label is the Close column stored in y.
x = array[:,0:3]
y = array[:,3]
# Split Data into Test and Train.
# 60% Train and 40% Test.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.4 )
# 1. Train the Model using .fit method.
# 2. Predict the future Closing prices using the .predict method.
# 3. Know how Accurate the Model is using the .score method.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
model = LinearRegression()
model.fit( x_train, y_train )
forecast = model.predict( x_test )
accuracy = model.score( x_test, y_test )
print( forecast, accuracy )
Prologue:
Having been several decades in quantitative modelling and operating a set of 4th Gen distributed system with M/L predictors, I can guarantee even your 60-features' to be overly optimistic. One might assume about an order of magnitude higher dimensionality space, containing both technical and fundamental factors, to reasonably train a model with, if the ambition is to go beyond just an academic paper. Why? The Market Rules.
Your experiment exhibits two types of principal errors:
The first - a conceptual miss:
the Machine Learning task, striving to predict a continuous value is Regression, ( no "classification" Labels, but Regression target values ) for which a metric for "a prediction success" is not a score, but some sort of absolute, PriceDOMAIN distance measures. Yes, distance, not a percent, as it is translated into a monetary reward by a trade execution.
Any attempt to use a percentage does not provide means to compare any two Regression models one against another and is incoherent with highly non-linear professional risk-management.
This post's footprint does not provide space enough to discuss additional dependencies for defining + assessment of a successful Trading TruStrategy, operating in at least 5-dimensions of policies -{ Select, Detect, Act, Allocate, Terminate }-Policy. Without a full TruStrategy SDAAT-model parameters definition, there is no chance to compute any performance expectations of a Market ride of any trading model under review.
Next:
Your model exhibits peeking into the future. You have allowed the model to learn from values, the reality will never give you at hand at the time of prediction, so except some clairvoyance, the model is principally skewed from the training DataSET and will never provide a fair service in real circumstances.
Epilogue:
One need not be shy to make this mistake, as Google has published their own Machine Learning "success" doing the very same error. ( If interested in details, search for Michal Illich + Google Machine Learning blogs on their experience ).
Ex post:
Do not give up. If your project is well-funded, has a reasonable technical infrastructure in place & has a reasonable grounding in the business domain, one can hire a mix of professional knowledge to have a FOREX market predictions engine working within a reasonable time and budget.
Reinventing a wheel could not be more expensive in the FOREX costs of failure realms.