I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.
Simple Solution
If you just want to get something quickly then simple use of
dask.dataframe.read_csv
using a globstring for the path should suffice:Keyword arguments
The
dask.dataframe.read_csv
function supports most of thepandas.read_csv
keyword arguments, so you might want to tweak things a bit.Set the index
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the
set_index
methodHowever if you know that your new index column is sorted then you can make this much faster by passing the
sorted=True
keyword argumentDivisions
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
Convert to another format
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.