Loading columnar-structured time-series data faster into a NumPy Arrays

67 views Asked by At

Hi! Are there any ways to load large, (ideally) compressed, and columnar-structured data faster into NumPy arrays in Python? Considering common solutions such as Pandas, Apache Parquet/Feather and HDF5, I am struggling to find a suiting way for my (time-series) problem.

As was expected, representing my data as NumPy array yields, by far, the fastest execution time for search problems such as binary search, significantly outperforming the same analysis when applied on a Pandas dataframe instead. On the other hand, when I try to store my data as npz files, for instance, directly loading the npz into NumPy arrays takes much longer compared to loading the same data into a Dataframe using the fasterparquet engine and the columnar-storage in .parquet. This loading, however, requires me to call .to_numpy() on the resulting dataframe, which now again causes heavy delays in accessing the underlying numpy representation of the dataframe.

As mentioned above, one alternative I tried was to store the data in a format, that can be loaded without any intermediate conversion steps into a numpy array. However, loading time appears to be much slower when the data is stored as .npz file (table with > 10M records and > 10 columns) compared to the same data stored as .parquet file.

1

There are 1 answers

0
mdurant On BEST ANSWER

Actually, fastparquet supports loading your data into a dictionary of numpy arrays, if you set these up before hand. This is a "hidden" feature. If you give details of the dtype and size of the data you wish to load. this answer can be edited accordingly.

to call .to_numpy() on the resulting dataframe, which now again causes heavy delays

This is very surprising, it should normally be a copy-free view of the same underlying data.