tf.data performance for timeseries forecasting

40 views Asked by At

I am training a time series forecasting model in Tensorflow. I create a tf.data.Dataset containing batches of windows of data using the approach presented in the example notebook https://www.tensorflow.org/tutorials/structured_data/time_series#4_create_tfdatadatasets.

However, from profiling, I see that this results in an input pipeline bottleneck. Tensorboard profiler suggests to perform the map operation offline, but I have not found a way to do so. I tried to change the number of parallel executors in map and also to apply prefetch and cache transformations, withouth any improvement in training time.

Finally, I decided to manually implement the Dataset.map() function using a simple for loop like this:

ds = tf.keras.utils.timeseries_dataset_from_array(
  data=data,
  targets=None,
  sequence_length=self.total_window_size,
  sequence_stride=1,
  shuffle=True,
  batch_size=32,)

input_tensor_list = []
labels_tensor_list = []
for window_batch in dataset.as_numpy_iterator():
    input_tensor, labels_tensor = split(window=window_batch)
    input_tensor_list.append(input_tensor)
    labels_tensor_list.append(labels_tensor)
result_dataset = tf.data.Dataset.from_tensor_slices(
        (tf.stack(input_tensor_list), tf.stack(labels_tensor_list)))

This improves my training time (25% reduction). However, it requires the last batch to also have the same size as the others, which the other approach did not.

I would like to know whether 1) there is a way to speed up the use of map as used in the example notebook or 2) how can I modify my code to avoid having to drop the last batch of data. Thank you

0

There are 0 answers