My aim is to create a LSTM model (with tensorflow/keras) for a time-series prediction.

In order to train the model the time-series data (2 features, hourly spaced values of several years) needs to be prepared.

the idea was to use a certain amount of input values (e.g. 200) to model a certain amount of output values (e.g. 24).

While this reshaping using pandas and numpy is pretty straightforward the question is how the tf.data.Dataset functions can be used to do this more efficiently.

As result, to my knowledge, the keras LSTM model requests an input with the following shape: (nr_of_sampples, nr_of_time_steps, features)

This mean in my case it is: (X, 5, 2)

Example transformation code:

Create data

import numpy as np

f_1 = np.linspace(1.4, 2000,num=2000, endpoint=True, retstep=False, dtype=None, axis=0)
f_2 = np.linspace(10, 20000,num=2000, endpoint=True, retstep=False, dtype=None, axis=0)

d = np.vstack((f_1, f_2)).T

Reshape data:


time_step_back = 200
time_step_forward=2

row_nr_X=np.arange(0, d.shape[0]-time_step_back-time_step_forward)
row_nr_y=np.arange(time_step_back, d.shape[0]-time_step_forward)

nr_features_X = d.shape[1]
nr_features_y = 1 # has to be defined

row_select_X = np.vstack(([[row_nr_X+i for i in np.arange(0,time_step_back)]])).T
row_select_y = np.vstack(([[row_nr_y+i for i in np.arange(0,time_step_forward)]])).T

# create empty input matrix
X = np.zeros((row_select_X.shape[0], time_step_back, nr_features_X))

# fill input matrix with correct values
for f in np.arange(0, nr_features_X):
    X[:,:,f] = d[row_select_X,f]

# create empty result matrix
y = np.zeros((row_select_y.shape[0], time_step_forward, nr_features_y))

# fill result matrix with correct values
for f in np.arange(0,nr_features_y):
    y[:,:,f] = d[row_select_y,f]

print(X[1,:,0])
print(y[1,:,0])

print(X.shape)
print(y.shape)

Afterwards the data is split into training and test set and scaled via MinMaxScaler before it is used.

To increase the efficiency (at least I think so) I'm wondering how the tensorflow dataset library can be used. The data can be easily shifted with the 'window' function.

Do you have any recommendations? Is it worth the effort?

Additional Questions:

Furthermore I have some more general questions as I do not have much experience with LSTM models:

  • Is it common / useful to include more time related features such as month, year, hour, etc.
  • Is it better to include relative values compared to absolute? (change of a parameter with time)?
  • are there some general rules of thumb for the ratio of adjustable model parameters vs number of data points?

0 Answers