I have a dataframe with a multi-index: "subject" and "datetime". Each row corresponds to a subject and a datetime, and columns of the dataframe correspond to various measurements.
The range of days differ per subject and some days can be missing for a given subject (see example). Moreover, a subject can have one or several values for a given day.
I want to resample the dataframe so that:
- there is only one row per day per subject (I do not care about time of day),
- each column value is the last non-NaN of the day (and NaN if there is no value for that day),
- days with no values on any column are not created or kept.
For instance, the following dataframe example:
a b
subject datetime
patient1 2018-01-01 00:00:00 2.0 high
2018-01-01 01:00:00 NaN medium
2018-01-01 02:00:00 6.0 NaN
2018-01-01 03:00:00 NaN NaN
2018-01-02 00:00:00 4.3 low
patient2 2018-01-01 00:00:00 NaN medium
2018-01-01 02:00:00 NaN NaN
2018-01-01 03:00:00 5.0 NaN
2018-01-03 00:00:00 9.0 NaN
2018-01-04 02:00:00 NaN NaN
should return:
a b
subject datetime
patient1 2018-01-01 00:00:00 6.0 medium
2018-01-02 00:00:00 4.3 low
patient2 2018-01-01 00:00:00 5.0 medium
2018-01-03 00:00:00 9.0 NaN
I spent too much time trying to obtain this using resample with the 'pad' option, but I always get errors or not the result I want. Can anybody help?
Note: Here is a code to create the example dataframe:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_product([['patient1', 'patient2'], pd.date_range('20180101', periods=4,
freq='h')])
df = pd.DataFrame({'a': [2, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 5], 'b': ['high', 'medium', np.nan, np.nan, 'medium', 'low', np.nan, np.nan]},
index=index)
df.index.names = ['subject', 'datetime']
df = df.drop(df.index[5])
df.at[('patient2', '2018-01-03 00:00:00'), 'a'] = 9
df.at[('patient2', '2018-01-04 02:00:00'), 'a'] = None
df.at[('patient1', '2018-01-02 00:00:00'), 'a'] = 4.3
df.at[('patient1', '2018-01-02 00:00:00'), 'b'] = 'low'
df = df.sort_index(level=['subject', 'datetime'])
Let's
floor
thedatetime
on daily frequency, thengroupby
the dataframe onsubject
+ floored timestamp andagg
usinglast
, finallydrop
the rows having allNaN's
: