Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

Question

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

1.6k views Asked by MRocklin At 24 December 2016 at 16:38

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?

Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.

Original Q&A

There are 1 answers

**MRocklin** · Answer 1 · 2016-12-24T16:38:10+00:00

Simple Solution

If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:

import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')

Keyword arguments

The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.

df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])

Set the index

Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.

The naive way to do this is to use the set_index method

df2 = df.set_index('timestamp')

However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument

df2 = df.set_index('timestamp', sorted=True)

Divisions

In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.

import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)

This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).

Convert to another format

CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

TechQA.

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

There are 1 answers

Simple Solution

Keyword arguments

Set the index

Divisions

Convert to another format

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DASK

Popular Questions

Trending Questions