What rows are included in Polars rolling calculations when `window_size` is "2d" or `timedelta(days=2)`

78 views Asked by At

When using rolling_mean() to calculate the mean of a DataFrame column, you can set the window by either specifying "2d" | "2i" | timedelta(days=2).

If you have a DataFrame that consists of financial time series data, there are dates that are missing (holidays/weekends), there is not a clear depiction of how each variation of window will change the calculation window.

If the window_size = "2d" and there is a a missing date between two dates, the mean() calculation will only include 1 day in the mean calculation, and will not include the other day (that is not present). This should be able to be replaced with the last observation that the time series has. Furthermore, if there is one observation in the new month, and the window_size = "1m" then the calculation will just use the first value of the month in the calc, and won't include the lookback period of one month. A timedelta(days=2) argument, does not equal the same values as a "2i window size.

TL;DR

It is not clear what data is included during the calculation when window_size = timedelta(days=2) vs window_size = "2d". What is thedifference between the two, and what is the best logic to compute a rolling_mean() of close prices with a "1m" window?

If the last date is Mar 1st 2024, I would want the calculation to include data from Feb 1st 2024.

I have tried a few testing methods to see the difference and what logic i expect to occur.

Dataframe

df = pl.read_csv(b"""
date,open,high,low,close,volume,symbol
2023-01-03T00:00:00.000000000,129.39,130.01,123.32,124.22,112117500,AAPL
2023-01-04T00:00:00.000000000,126.02,127.78,124.23,125.5,89113600,AAPL
2023-01-05T00:00:00.000000000,126.26,126.9,123.91,124.17,80962700,AAPL
2023-01-06T00:00:00.000000000,125.15,129.4,124.04,128.74,87754700,AAPL
2023-01-09T00:00:00.000000000,129.58,132.5,129.0,129.26,70790800,AAPL
2023-01-10T00:00:00.000000000,129.37,130.36,127.25,129.84,63896200,AAPL
2023-01-11T00:00:00.000000000,130.35,132.6,129.57,132.58,69458900,AAPL
2023-01-12T00:00:00.000000000,132.97,133.34,130.54,132.5,71379600,AAPL
2023-01-13T00:00:00.000000000,131.13,134.0,130.76,133.84,57809700,AAPL
2023-01-17T00:00:00.000000000,133.91,136.35,133.21,135.01,63646600,AAPL
2023-01-18T00:00:00.000000000,135.89,137.66,134.11,134.29,69672800,AAPL
2023-01-19T00:00:00.000000000,133.16,135.32,132.86,134.35,58280400,AAPL
2023-01-20T00:00:00.000000000,134.36,137.08,133.3,136.93,80223600,AAPL
2023-01-23T00:00:00.000000000,137.18,142.34,136.96,140.15,81760300,AAPL
2023-01-24T00:00:00.000000000,139.35,142.18,139.34,141.56,66435100,AAPL
2023-01-25T00:00:00.000000000,139.93,141.46,137.86,140.89,65799300,AAPL
2023-01-26T00:00:00.000000000,142.19,143.27,140.93,142.98,54105100,AAPL
2023-01-27T00:00:00.000000000,142.18,146.23,142.1,144.93,70555800,AAPL
2023-01-30T00:00:00.000000000,143.97,144.56,141.87,142.02,64015300,AAPL
2023-01-31T00:00:00.000000000,141.73,143.35,141.31,143.31,65874500,AAPL
2023-02-01T00:00:00.000000000,142.99,145.61,140.36,144.44,77663600,AAPL
2023-02-02T00:00:00.000000000,147.88,150.15,147.16,149.79,118339000,AAPL
2023-02-03T00:00:00.000000000,147.02,156.31,146.82,153.45,154357300,AAPL
2023-02-06T00:00:00.000000000,151.53,152.05,149.75,150.69,69858300,AAPL
2023-02-07T00:00:00.000000000,149.61,154.17,149.61,153.59,83322600,AAPL
""".strip(), try_parse_dates=True)
import numpy as np
import datetime as dt
from openbb import obb


_column_name_returns: str = "close"


def annual_vol(data: pl.Series) -> pl.Series:
    return data.mean()


data = obb.equity.price.historical(
    "AAPL", start_date="2023-01-01", end_date="2021-02-07", provider="yfinance"
)
data = data.to_polars()

test = data.set_sorted("date").select(
    pl.col(_column_name_returns).rolling_map(
        annual_vol, window_size=21, min_periods=1
    )
)
test2 = (
    data.set_sorted("date")
    .rolling(index_column="date", period="21i")
    .agg(pl.col("close").map_elements(annual_vol))
)

test3 = data.set_sorted("date").with_columns(
    pl.col(_column_name_returns)
    .rolling_mean(window_size=dt.timedelta(days=31), by="date", min_periods=1, center=False, closed = "left")
    .alias("vol")
)

Goal:

if you have data that looks like this: Stock Data and the latest date is 2023-02-07, then a "1m" rolling statistic would calculate from 2023-01-07, or 2023-01-06 in the case of there being no data.

0

There are 0 answers