The question is about a wrongly chosen strategy for train/test splitting in a RandomForest model. I know choosing the test set this way gives the wrong output but I would like to know why.
(The model looks at previous days of data and tries to predict if next day's data will be higher or lower than today, i.e. a classification problem)
I copied the train/test split code from another example and it simply sets random rows to be either train_set or test_set. (Tried to illustrate below) Raw data is daily Close values of for example EURUSD.
I then create features based on that. Every feature looks at some previous data points and comes up with a set of features which is a row in X_test. I then train a random forest model to try to predict next day's close.
The accuracy in the test_set is very high and it increases with the increasing number of historical previous points it looks at which seems to suggest overfitting.
When I change train/test split model to have, for example, train_set: data in January-June and test_set: data in August, i.e. completely separate datasets and no possible mixing, the accuracy is a more realistic 50%.
Again, I know the train/test split it is not correct, but can someone help me understand why..?
Every time I want to validate a row (i.e. one prediction in test_set) I use features that looks at previous data trying to predict tomorrow's data? How come there is overfitting?
Overfitting explained with dumb kid example:
If you mix data from training and test, you will likely have overfitting. Think of test as an exam, and training data as the material available to study on an exam. Then think of the machine learning algorithm as a kid which is not much smart, but has lot of memory. If 10% of the questions in the exam are exactly the same as the ones he found on his textbook, it is very likely that he will get them right, even if he didn't understand the logic which links questions to answer. If he only got right the question that were exactly the same as the textbook, you would not think he had understood anything about it, just remembered the examples in the book, even though he got a good evaluation (because the percentage of right answer is high). You didn't train a algorithm to guess future data from present data, but a dictionary which can remember example from its training set. This would work even if the data was totally random (in fact, the result might be even better).