I am building an encoder-decoder prediction model based on this paper:
https://www.sciencedirect.com/science/article/pii/S0952197623001483
It is made of a transformer encoder and a 1D CNN Decoder. The model takes in a input window of length L, and predicts a window of length L shifted by $delta$ time steps.
For now, I am using the model on 1D time series. When I train the model, it seems to be doing quite well as long as there is no trend in the data.
But when there is trend, the model only does well on the training data, and performance is less good on the validation data, and poor on the test data.
See below an example. The first red line shows the end of the training data, and the second the end of the validation data.
So my question is, what are the model parameters or elements of the model structure that might prevent the model from learning beyond data seen in training?
Additional information:
I fit a sklearn MinMaxScaler on the train data and apply that the the train, validation and test sets. I wonder if seing data outside the [0-1] range after training might be the issue? I have tried using StandardScaler as well but that did not help.
I have tried changind the lookback window, does not help.
I have added dropouts in several parts of the model, improved performance, but did not help in learning trend / learning on data outside the training range.
Thanks you very much for your help!