I'm trying to calculate moving average on a bitemporal dataset. The dataset consists of a data date and an effective date(the date at which the data became available). The data for this date could be restated several times in future(same data date but a different effective date). I need to calculate a moving average for the past 4 quarters using the data that is valid for the effective date of the row being calculated.
The dataset looks like this
| id | datadate | effdate | value |
|---|---|---|---|
| 1 | 2005-03-31 | 2005-04-15 | 10 |
| 1 | 2005-03-31 | 2005-05-30 | 11 |
| 1 | 2005-06-30 | 2005-07-15 | 9 |
| 1 | 2005-06-30 | 2005-08-20 | 9.5 |
| 1 | 2005-06-30 | 2005-10-15 | 9.6 |
| 1 | 2005-09-30 | 2005-10-15 | 10.5 |
| 1 | 2005-09-30 | 2005-11-10 | 11 |
| 1 | 2005-09-30 | 2006-02-20 | 10.75 |
| 1 | 2005-12-31 | 2006-02-13 | 12 |
| 1 | 2005-12-31 | 2006-02-20 | 11.6 |
| 1 | 2005-12-31 | 2006-05-10 | 11 |
| 1 | 2006-03-31 | 2006-04-20 | 8 |
| 1 | 2006-03-31 | 2006-05-10 | 8.25 |
The result should be
| id | datadate | effdate | Value | MAvg | |
|---|---|---|---|---|---|
| 0 | 1 | 2005-03-31 | 2005-04-15 | 10 | 10 |
| 1 | 1 | 2005-03-31 | 2005-05-30 | 11 | 11 |
| 2 | 1 | 2005-06-30 | 2005-07-15 | 9 | 10 |
| 3 | 1 | 2005-06-30 | 2005-08-20 | 9.5 | 10.25 |
| 4 | 1 | 2005-06-30 | 2005-10-15 | 9.6 | 10.30 |
| 5 | 1 | 2005-09-30 | 2005-10-15 | 10.5 | 10.37 |
| 6 | 1 | 2005-09-30 | 2005-11-10 | 11 | 10.53 |
| 7 | 1 | 2005-09-30 | 2006-02-20 | 10.75 | 10.45 |
| 8 | 1 | 2005-12-31 | 2006-02-13 | 12 | 10.9 |
| 9 | 1 | 2005-12-31 | 2006-02-20 | 11.5 | 10.71 |
| 10 | 1 | 2005-12-31 | 2006-05-10 | 11 | 10.59 |
| 11 | 1 | 2006-03-31 | 2006-04-20 | 8 | 9.96 |
| 12 | 1 | 2006-03-31 | 2006-05-10 | 8.25 | 9.9 |
I'm doing this in python using pandas. The way I'm doing this is by joining the dataframe with itself on id and previous 4 quarters and calculating new effdates for all periods based on the effdates of the past 4 quarters, then I join once again with id, datadate and effdate and calculate the average.
keys["id"]
calc_df = df1.merge(df2, on=keys, how='left')
calc_df = calc_df.loc[
(calc_df["datadate_x"] >= calc_df["datadate_y"])
& (calc_df["datadate_y"] >= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9))
& (calc_df["effdate_x"] <= calc_df["thrudate_y"])
& (calc_df["thrudate_x"] >= calc_df["effdate_y"])
]
calc_df = calc_df.drop_duplicates().reset_index(drop=True)
grp_keys = keys + ["datadate_x"]
calc_df["effdate"] = calc_df[["effdate_x", "effdate_y"]].max(axis=1)
calc_df = calc_df.sort_values(grp_keys + ["effdate"]).drop_duplicates(
subset=grp_keys + ["effdate"], keep="first"
)
calc_df = calc_df['id', 'datadate_x', 'effdate', 'value']
calc_df = calc_df.merge(df1, on=["id"], how="left")
calc_df = calc_df.loc[
(calc_df["datadate_x"] >= calc_df["datadate"])
& (
calc_df["datadate"]
>= calc_df["datadate_x"] - pd.tseries.offsets.MonthEnd(n=9)
)
& (calc_df["effdate_x"] <= calc_df["thrudate_y"])
& (calc_df["thrudate_x"] >= calc_df["effdate_y"])
]
calc_df["MAvg"] = calc_df.groupby(["id", "datadate_x", "effdate_x"])["value"].transform(
lambda s: s.mean(skipna=False)
)
This works but its very slow when I run it on the full dataset which has around 2000 differebt ids and datadate starting from 2000 to most recent quarter(around 500K rows) and I have to calcualate the moving averages on several fields for different windows. So I want to see if there is a efficient way of doing this.
It's unclear what would constitute as an answer to this question, since you've only asked for something "better" and you've not indicated which axis you'd like this improvement to be on.
So I'll provide you something that is better, in terms of readability. This might help other find improvements in other directions:
Given the csv
data.csv:You can achieve the same outcome by doing the following:
This provides: