I've been working on a multivariate LSTM model for time series forecasting, but I'm encountering an issue where the predicted output doesn't exhibit enough variability or 'ups and downs'. The predictions tend to be too smooth or flat, particularly after the first predicted point. Here's a brief overview of my model architecture:
model.add(tf.keras.layers.InputLayer(input_shape=(60,25)))
model.add(LSTM(256,return_sequences=True))
model.add(tf.keras.layers.LayerNormalization())
model.add(Dropout(0.2))
model.add(LSTM(256,return_sequences=True))
model.add(tf.keras.layers.LayerNormalization())
model.add(Dropout(0.2))
model.add(LSTM(256,return_sequences=False))
model.add(tf.keras.layers.LayerNormalization())
model.add(Dropout(0.2))
model.add(Dense(30 * 3,activation=tf.keras.layers.LeakyReLU(alpha=0.1))
model.add(Reshape([30, 3]))
What I am trying to achieve is to have the output layer predict 90 points, which will be further reshaped to three variables. Regarding the data:
- 670 000 rows, 25 features
- using 60 past points - prediction of 30 points into the future for 3 target features
- dataset is first split into training, validation and test (70:20:10) using sliding window
- each sliding window is one row shifted - shift 1
- using StandardScaler for scaling as I have couple anomalies in the data I would like to detect afterwards on the results
- The graph of predicted vs. true values can be seen for one feature below:
My question is: Did I correctly handle the last layers from the LSTM? My goal is for having one-shot prediction of 30 values for 3 features.
What I tried:
- Hyperparameter tuning containing
- Number of layers
- Optimizer (RMSprod, Adam, SGD)
- Number of units in LSTM/GRU
- Dropout Rate Chance
- Learning rate (0.01, 0.001, 0.005)
- Batch size of sliding windows (64, 128, 256)
- Warmup for first 4 layers
- Using other scaling methods (Quantile, MinMaxScaler, PowerTransformer)
Link to documentation of LSTM https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
Output Activation Function: First the most striking one to me (also already mentioned in the comments @DrSnoopy) - Ensure that the output activation function is appropriate for your regression task. Since you're predicting continuous values, using linear activation is often a good choice.
Loss Function: Confirm that your loss function is suitable for regression. Mean Squared Error (MSE) is commonly used for regression tasks. You did not mention what you are using, but this really is an important choice.
Normalization/Scaling: You said you are trying to do some normalization. a) Make sure your scaling is applied consistently during both training and inference. b) The standardization process should be based on statistics computed from the training set only. This is definitely something you should always do as it is easy to do, and helps your model to converge faster and more stable.
Model Complexity: Consider whether your model is complex enough to capture the underlying patterns in your data. If the predictions are too smooth, it could be an indication that the model is not able to capture the variability in your data. #AddingLayers.
Or..
Sequence Length: Experiment with the length of the input sequence (60 in your case). You could try decreasing it to see if it has an impact on the model's ability to capture patterns (while keeping your capacity / number of units constant).