I have a large data set of time series data (350 million rows,15GB) with date-times (half hourly resolution).

I am therefore using dask to handle and parallelize as much as possible.

I'm stuck in what should be a trivial task. I have a list of dates that are holidays, created using the holidays package:

NSWholidays = holidays.Australia(years= [2010,2011,2012,2013,2014], state='NSW')

And I have a 'date' column in my dask dataframe.

I want to add a new column called 'IsWorkDay' where 1 will reflect days which are not holidays and are from Monday to Friday, and 0 will reflect weekends or holidays.

I've tried dozens of combinations trying to find the required syntax for dask's requirements to paralellise this but the only solution I've managed to get working is using .apply which is frustratingly slow for the task (multiple hours). In short, the line below works but is too slow:

SGSCData['IsWorkDay'] = SGSCData.apply(lambda row: int(row.weekday<6 and not row.Date in NSWholidays), axis=1, meta=(None, 'int64'))

How can I make this faster?

Thanks in advance

1 Answers

1
user32185 On Best Solutions

First of all I'm not quite sure about your logic for weekends. Anyway this is what I'm able to do without use apply or map_partitions. My dask.__version__ is 1.1.1. As you can see using exactly the same vectorial solution you would use in pandas your computation takes less than 15s on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

import pandas as pd
import holidays
import dask.dataframe as dd
NSWholidays = holidays.Australia(years=list(range(2010,2021)), state='NSW')
# this is slightly more than 31.5M rows
dates = pd.date_range(start='2010-01-01', end='2020-01-01', freq="10S")
ddf = pd.DataFrame({"Date":dates})
ddf = dd.from_pandas(ddf, npartitions=10)

ddf["IsWorkDay"] = (~((ddf["Date"].dt.weekday>=5) | 
                      (ddf["Date"].dt.date.isin(NSWholidays)))).astype(int)

%%time
ddf = ddf.compute()

CPU times: user 1.07 s, sys: 1.48 s, total: 2.55 s
Wall time: 13.9 s