What is best way to convert time series data (parquet format) into sequences using petastorm?

396 views Asked by At

Pardon me if use the terms in the wrong sense. I am still grappling with many spark and distributed related things.

Here is my use case and I am not able to get a complete picture of the implementation.

I have time-series data of 40 columns and 100 timesteps saved in parquet format.

I learned that to do distributed training on big data we can use petastorm for data injection and Horovod for training. But it is unclear to me how the data needs to be partitioned (one partition per ID? what row groups are?) and how to convert the data to sequences that LSTM expects?

Any pointers in this direction will be of great help. Thanks!

1

There are 1 answers

0
mirko On

I can think of 2 ways to load time series data using petastorm. The first is to group by the id column and then aggregate the features into an array using e.g. the sql function collect_list (make sure that the arrays are sorted by time). This will give you a table that looks something like this.

id  |         time         |      feature_1       |
---------------------------------------------------
1   | [t11, t12, t13, ...] | [f11, f12, f13, ...] |
2   | [t21, t22, t23, ...] | [f21, f22, f23, ...] |

When you save the data like this you should not need to worry about parquet row groups because each row contains all the data for exactly one time series.

The other option is to use n-grams in order to load the unaggregated data. N-grams allow you to load rows in a specific order. There are some examples of this in the petastorm API docs under petastorm.ngram.NGram. Note that if you follow this approach you need to worry about parquet row groups because n-grams don not span row groups (see the example described in the API docs). I am not sure if partitioning by e.g. id will always ensure that all the data for one time series is in one row group. You may also need to set the size of the row group to some value that depends on the size of your time series.