I have a large data set of time series data (350 million rows,15GB) with date-times (half hourly resolution).
I am therefore using dask to handle and parallelize as much as possible.
I'm stuck in what should be a trivial task. I have a list of dates that are holidays, created using the holidays package:
NSWholidays = holidays.Australia(years= [2010,2011,2012,2013,2014], state='NSW')
And I have a 'date' column in my dask dataframe.
I want to add a new column called
'IsWorkDay' where 1 will reflect days which are not holidays and are from Monday to Friday, and 0 will reflect weekends or holidays.
I've tried dozens of combinations trying to find the required syntax for dask's requirements to paralellise this but the only solution I've managed to get working is using .apply which is frustratingly slow for the task (multiple hours). In short, the line below works but is too slow:
SGSCData['IsWorkDay'] = SGSCData.apply(lambda row: int(row.weekday<6 and not row.Date in NSWholidays), axis=1, meta=(None, 'int64'))
How can I make this faster?
Thanks in advance