I am experimenting with stateful LSTM on a time-series regression problem by using TensorFlow. I apologize that I cannot share the dataset. Below is my code.
train_feature = train_feature.reshape((train_feature.shape[0], 1, train_feature.shape[1]))
val_feature = val_feature.reshape((val_feature.shape[0], 1, val_feature.shape[1]))
batch_size = 64
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(50, batch_input_shape=(batch_size, train_feature.shape[1], train_feature.shape[2]), stateful=True))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='adam',
loss='mse',
metrics=[tf.keras.metrics.RootMeanSquaredError()])
model.fit(train_feature, train_label,
epochs=10,
batch_size=batch_size)
When I run the above code, after the end of the first epoch, I will get an error as follows.
InvalidArgumentError: [_Derived_] Invalid input_h shape: [1,64,50] [1,49,50]
[[{{node CudnnRNN}}]]
[[sequential_1/lstm_1/StatefulPartitionedCall]] [Op:__inference_train_function_1152847]
Function call stack:
train_function -> train_function -> train_function
However, the model will be successfully trained if I change the batch_size to 1, and change the code for model training to the following.
total_epochs = 10
for i in range(total_epochs):
model.fit(train_feature, train_label,
epochs=1,
validation_data=(val_feature, val_label),
batch_size=batch_size,
shuffle=False)
model.reset_states()
Nevertheless, with a very large data (1 million rows), the model training will take a very long time since the batch_size is 1.
So, I wonder, how to train a stateful LSTM with a batch size larger than 1 (e.g. 64), without getting the invalid input_h shape error?
Thanks for your answers.
The fix is to ensure batch size never changes between batches. They must all be the same size.
Method 1
One way is to use a batch size that perfectly divides your dataset into equal-sized batches. For example, if total size of data is 1500 examples, then use a batch size of 50 or 100 or some other proper divisor of 1500.
Method 2
The other way is to ignore any batch that is less than the specified size, and this can be done using the TensorFlow Dataset API and setting the
drop_remainder
toTrue
.When using the Dataset API like above, you will need to also specify how many rounds of training count as an epoch (essentially how many batches to count as 1 epoch). A
tf.data.Dataset
instance (the result fromtf.data.Dataset.from_tensor_slices
) doesn't know the size of the data that it's streaming to the model, so what constitutes as one epoch has to be manually specified withsteps_per_epoch
.Your new code will look like this:
You can also include the validation set as well, like this (not showing other code):
Caveat: This means a few datapoints will never be seen by the model. To get around that, you can shuffle the dataset each round of training, so that the datapoints left behind each epoch changes, giving everyone a chance to be seen by the model.
Why the error occurs
Stateful RNNs and their variants (LSTM, GRU, etc.) require fixed batch size. The reason is simply because statefulness is one way to realize Truncated Backprop Through Time, by passing the final hidden state for a batch as the initial hidden state of the next batch. The final hidden state for the first batch has to have exactly the same shape as the initial hidden state of the next batch, which requires that batch size stay the same across batches.
When you set the batch size to 64,
model.fit
will use the remaining data at the end of an epoch as a batch, and this may not have up to 64 datapoints. So, you get such an error because the batch size is different from what the stateful LSTM expects. You don't have the problem with batch size of 1 because any remaining data at the end of an epoch will always contain exactly 1 datapoint, so no errors. More generally, 1 is always a divisor of any integer. So, if you picked any other divisor of your data size, you should not get the error.In the error message you posted, it appears the last batch has size of 49 instead of 64. On a side note: The reason the shapes look different from the input is because, under the hood, keras works with the tensors in time_major (i.e. the first axis is for steps of sequence). When you pass a tensor of shape (10, 15, 2) that represents (batch_size, steps_per_sequence, num_features), keras reshapes it to (15, 10, 2) under the hood.