I have a two datasets that look like this:
What I'd like to do is filter out non-trading days on the "data" dataframe. I assume it would be comparing the data.index.date of each row to the data.index.date of trading_days, then returning the row if there's a match. If there is no match, then it is not a trading day and the row is not returned. This effectively filters out the dataset of non-trading days.
However, going row-by-row here to check if the two data.index.dates are equal using an apply() function to return the row seems inefficient - I feel like there's a more efficient way to do this since I will do this on a 180M row dataframe.
Is there some kind of "merge" or "join" like:
data.join(trading_days)
that will filter for only the dates where date.index.date matches? I need to have it all by the minute-level (as shown in the "data" dataframe) but simply filter out non-trade dates. Thanks for your help!
UPDATE to include values (please let me know if there's a better way to paste these):
In[5]: data.head(30).values
Out[6]:
array([[ 438.9, 438.9, 438.9, 438.9, 0. ],
[ 438.9, 438.9, 438.7, 438.7, 31. ],
[ 438.6, 438.6, 438.6, 438.6, 7. ],
[ 438.4, 438.7, 438.4, 438.4, 4. ],
[ 438.4, 438.4, 438.3, 438.3, 4. ],
[ 438.2, 438.2, 438.2, 438.2, 1. ],
[ 438.2, 438.2, 438.2, 438.2, 0. ],
[ 438.2, 438.2, 438.2, 438.2, 1. ],
[ 438.2, 438.2, 438.2, 438.2, 0. ],
[ 438.1, 438.1, 438.1, 438.1, 3. ],
[ 438. , 438. , 437.9, 438. , 6. ],
[ 438. , 438.2, 438. , 438. , 8. ],
[ 438.2, 438.2, 438.1, 438.1, 6. ],
[ 438.1, 438.1, 438.1, 438.1, 4. ],
[ 438.1, 438.1, 438.1, 438.1, 0. ],
[ 438.3, 438.3, 438.3, 438.3, 1. ],
[ 438.3, 438.3, 438.3, 438.3, 0. ],
[ 438.3, 438.3, 438.3, 438.3, 0. ],
[ 438.1, 438.1, 438.1, 438.1, 1. ],
[ 438. , 438. , 437.9, 437.9, 54. ],
[ 437.8, 437.8, 437.8, 437.8, 10. ],
[ 437.8, 437.8, 437.8, 437.8, 1. ],
[ 437.8, 437.8, 437.8, 437.8, 6. ],
[ 437.8, 437.8, 437.8, 437.8, 0. ],
[ 437.9, 438. , 437.9, 438. , 12. ],
[ 437.9, 438. , 437.9, 438. , 0. ],
[ 437.9, 438. , 437.9, 438. , 0. ],
[ 437.9, 438. , 437.9, 438. , 0. ],
[ 437.9, 437.9, 437.9, 437.9, 1. ],
[ 437.9, 437.9, 437.8, 437.8, 4. ]])
And here are the time stamps:
In[10]: data.head(30).index.values
Out[11]:
array(['2005-01-02T13:59:00.000000000-0500',
'2005-01-02T14:00:00.000000000-0500',
'2005-01-02T14:01:00.000000000-0500',
'2005-01-02T14:02:00.000000000-0500',
'2005-01-02T14:03:00.000000000-0500',
'2005-01-02T14:04:00.000000000-0500',
'2005-01-02T14:05:00.000000000-0500',
'2005-01-02T14:06:00.000000000-0500',
'2005-01-02T14:07:00.000000000-0500',
'2005-01-02T14:08:00.000000000-0500',
'2005-01-02T14:09:00.000000000-0500',
'2005-01-02T14:10:00.000000000-0500',
'2005-01-02T14:11:00.000000000-0500',
'2005-01-02T14:12:00.000000000-0500',
'2005-01-02T14:13:00.000000000-0500',
'2005-01-02T14:14:00.000000000-0500',
'2005-01-02T14:15:00.000000000-0500',
'2005-01-02T14:16:00.000000000-0500',
'2005-01-02T14:17:00.000000000-0500',
'2005-01-02T14:18:00.000000000-0500',
'2005-01-02T14:19:00.000000000-0500',
'2005-01-02T14:20:00.000000000-0500',
'2005-01-02T14:21:00.000000000-0500',
'2005-01-02T14:22:00.000000000-0500',
'2005-01-02T14:23:00.000000000-0500',
'2005-01-02T14:24:00.000000000-0500',
'2005-01-02T14:25:00.000000000-0500',
'2005-01-02T14:26:00.000000000-0500',
'2005-01-02T14:27:00.000000000-0500',
'2005-01-02T14:28:00.000000000-0500'], dtype='datetime64[ns]')
And the trading_days is a read.csv from here: http://pastebin.com/5N01Gi5V
SECOND UPDATE:
You can do a join the following way:
days
column todata
which contains the day of the index.pd.merge(days, data, on='days')
This does an inner join by default, so only the rows from
data
with days which appear in thedays
frame will be in the result.