pandas: resample a multi-index dataframe

Question

pandas: resample a multi-index dataframe

1.7k views Asked by Stéphane Deparis At 12 October 2020 at 13:00

I have a dataframe with a multi-index: "subject" and "datetime". Each row corresponds to a subject and a datetime, and columns of the dataframe correspond to various measurements.

The range of days differ per subject and some days can be missing for a given subject (see example). Moreover, a subject can have one or several values for a given day.

I want to resample the dataframe so that:

there is only one row per day per subject (I do not care about time of day),
each column value is the last non-NaN of the day (and NaN if there is no value for that day),
days with no values on any column are not created or kept.

For instance, the following dataframe example:

                                a       b
subject  datetime                        
patient1 2018-01-01 00:00:00  2.0    high
         2018-01-01 01:00:00  NaN  medium
         2018-01-01 02:00:00  6.0     NaN
         2018-01-01 03:00:00  NaN     NaN
         2018-01-02 00:00:00  4.3     low
patient2 2018-01-01 00:00:00  NaN  medium
         2018-01-01 02:00:00  NaN     NaN
         2018-01-01 03:00:00  5.0     NaN
         2018-01-03 00:00:00  9.0     NaN
         2018-01-04 02:00:00  NaN     NaN

should return:

                                a       b
subject  datetime                        
patient1 2018-01-01 00:00:00  6.0  medium
         2018-01-02 00:00:00  4.3     low
patient2 2018-01-01 00:00:00  5.0  medium
         2018-01-03 00:00:00  9.0     NaN

I spent too much time trying to obtain this using resample with the 'pad' option, but I always get errors or not the result I want. Can anybody help?

Note: Here is a code to create the example dataframe:

import pandas as pd
import numpy as np

index = pd.MultiIndex.from_product([['patient1', 'patient2'], pd.date_range('20180101', periods=4,
                                      freq='h')])

df = pd.DataFrame({'a': [2, np.nan, 6, np.nan, np.nan, np.nan, np.nan, 5], 'b': ['high', 'medium', np.nan, np.nan, 'medium', 'low', np.nan, np.nan]},
                  index=index)
df.index.names = ['subject', 'datetime']

df = df.drop(df.index[5])
df.at[('patient2', '2018-01-03 00:00:00'), 'a'] = 9
df.at[('patient2', '2018-01-04 02:00:00'), 'a'] = None
df.at[('patient1', '2018-01-02 00:00:00'), 'a'] = 4.3
df.at[('patient1', '2018-01-02 00:00:00'), 'b'] = 'low'

df = df.sort_index(level=['subject', 'datetime'])

Original Q&A

There are 3 answers

Alex On 12 October 2020 at 14:18

# drop a et b we don't need them when they ='re both na
df = df.reset_index().dropna(subset=["a", "b"], how="all")

#add a day columns we need it to keep last value
df["dt_day"] = df["datetime"].dt.date

#d1 result dataframe which we add a et b
 
d1 = df.copy().drop_duplicates(subset=["subject", "dt_day"]).loc[:, ["subject", "datetime"]].reset_index(drop=True)

#add a et b to ou dataframe result

for col in ["a", "b"]:
    d1.loc[:,col] = (df.copy().
                     dropna(subset=[col]).drop_duplicates(subset=["subject", "dt_day"], keep="last")[col]
                     .reset_index(drop=True))

Wall time: 24 ms

@Shubham Sharma code => Wall time: 2.94 ms

    subject   datetime    a       b
0  patient1 2018-01-01  6.0  medium
1  patient1 2018-01-02  4.3     low
2  patient2 2018-01-01  5.0  medium
3  patient2 2018-01-03  9.0     NaN

thanks for your question :)

Lukas S On 12 October 2020 at 13:53

This should do the job:

def day_agg(series_):
    try:
        return series_.dropna().iloc[-1]
    except IndexError:
        return float("nan")

df = df.reset_index().sort_values("datetime")
df.groupby([df["subject"],df.datetime.map(lambda x:datetime(year=x.year,month=x.month,day=x.day))])\
    .agg({"a":day_agg, "b":day_agg})\
    .dropna(how="all")

**Shubham Sharma** · Accepted Answer · 2020-10-12T14:34:30+00:00

Let's floor the datetime on daily frequency, then groupby the dataframe on subject + floored timestamp and agg using last, finally drop the rows having all NaN's:

i = pd.to_datetime(df.index.get_level_values(1)).floor('d')
df1 = df.groupby(['subject', i]).agg('last').dropna(how='all')

                       a       b
subject  datetime               
patient1 2018-01-01  6.0  medium
         2018-01-02  4.3     low
patient2 2018-01-01  5.0  medium
         2018-01-03  9.0     NaN

TechQA.

pandas: resample a multi-index dataframe

There are 3 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in NAN

Related Questions in PANDAS-RESAMPLE

Popular Questions

Popular Tags

Trending Questions