I have a time-series containing ID values. The series consists of chunks over over which the series value stays constant. I want to efficiently extract the corresponding list of time-intervals over which the time-series does not change for each value.
My time-series is represented as a polars.DataFrame
with two columns, timestamp
and value
, cf. example below
import polars as pl
dates = pl.date_range(
start=datetime.date.fromisoformat("2011-01-01"),
end=datetime.date.fromisoformat("2011-01-10"),
interval="1d",
eager=True
)
frame = pl.DataFrame({
"date": dates,
"value": ["A", "B", "A", "B", "B", "A", "A", "A", "B", "B"]
})
As a result I want a frame where for each value I get the list of all intervals/tuple[date, date] over which the above series was constant, e.g. the following frame or similar
>>> shape: (2, 2)
┌───────┬───────────────────────────────────┐
│ value ┆ intervals │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═══════╪═══════════════════════════════════╡
│ A ┆ [{2011-01-01,2011-01-01}, {2011-… │
│ B ┆ [{2011-01-02,2011-01-02}, {2011-… │
└───────┴───────────────────────────────────┘
My current solution consists of the following steps:
i) Indicate the (inclusive) start and end of a chunk as two additional boolean columns
def is_start(expr: pl.Expr) -> pl.Expr:
return (expr.shift(1) != expr).fill_null(True).alias("is_start")
def is_end(expr: pl.Expr) -> pl.Expr:
return (expr.shift(-1) != expr).fill_null(True).alias("is_end")
x = frame.with_columns(pl.col("value").pipe(is_start), pl.col("value").pipe(is_end))
>>>shape: (10, 4)
┌────────────┬───────┬──────────┬────────┐
│ date ┆ value ┆ is_start ┆ is_end │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ bool ┆ bool │
╞════════════╪═══════╪══════════╪════════╡
│ 2011-01-01 ┆ A ┆ true ┆ true │
│ 2011-01-02 ┆ B ┆ true ┆ true │
│ 2011-01-03 ┆ A ┆ true ┆ true │
│ 2011-01-04 ┆ B ┆ true ┆ false │
│ … ┆ … ┆ … ┆ … │
│ 2011-01-07 ┆ A ┆ false ┆ false │
│ 2011-01-08 ┆ A ┆ false ┆ true │
│ 2011-01-09 ┆ B ┆ true ┆ false │
│ 2011-01-10 ┆ B ┆ false ┆ true │
└────────────┴───────┴──────────┴────────┘
ii) Groupby value and assemble the start and end dates into lists
y = x.groupby("value").agg(
pl.col("date").filter(pl.col("is_start")).alias("start_date"),
pl.col("date").filter(pl.col("is_end")).alias("end_date")
)
>>>shape: (2, 3)
┌───────┬───────────────────────────────────┬───────────────────────────────────┐
│ value ┆ start_date ┆ end_date │
│ --- ┆ --- ┆ --- │
│ str ┆ list[date] ┆ list[date] │
╞═══════╪═══════════════════════════════════╪═══════════════════════════════════╡
│ A ┆ [2011-01-01, 2011-01-03, 2011-01… ┆ [2011-01-01, 2011-01-03, 2011-01… │
│ B ┆ [2011-01-02, 2011-01-04, 2011-01… ┆ [2011-01-02, 2011-01-05, 2011-01… │
└───────┴───────────────────────────────────┴───────────────────────────────────┘
iii) Explode the lists into columns
z = y.explode("start_date", "end_date")
>>> shape: (6, 3)
┌───────┬────────────┬────────────┐
│ value ┆ start_date ┆ end_date │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ date │
╞═══════╪════════════╪════════════╡
│ A ┆ 2011-01-01 ┆ 2011-01-01 │
│ A ┆ 2011-01-03 ┆ 2011-01-03 │
│ A ┆ 2011-01-06 ┆ 2011-01-08 │
│ B ┆ 2011-01-02 ┆ 2011-01-02 │
│ B ┆ 2011-01-04 ┆ 2011-01-05 │
│ B ┆ 2011-01-09 ┆ 2011-01-10 │
└───────┴────────────┴────────────┘
iv) Group by "value" (again) and aggregate into series of intervals (structs)
u = z.groupby("value").agg(
pl.struct(pl.col("start_date"), pl.col("end_date")).alias("intervals")
)
>>> shape: (2, 2)
┌───────┬───────────────────────────────────┐
│ value ┆ intervals │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═══════╪═══════════════════════════════════╡
│ A ┆ [{2011-01-01,2011-01-01}, {2011-… │
│ B ┆ [{2011-01-02,2011-01-02}, {2011-… │
└───────┴───────────────────────────────────┘
I am wondering if there is a more concise (and efficient) way to express this transformation using polars
. Especially the fact that I have to groupby
"value" twice and that y
already contains all start and end dates, albeit not as 'interval' tuples makes me wonder, if this could be improved.
In the end I will need to apply this transformation not for one but for many (long) time-series, where I know a-priori, that the number of 'change points' in each series is rather small.
I'd be glad for any suggestions.
I'm not sure how to avoid double group_bys (I don't think you can) but this avoids filters and explodes. The trick is to first create a helper column which represents each continuous sequence of
value
. You used to have to do that manually but now there's.rle_id
to do it for you. Once you do that then you can use the built in.first()
and.last()
to construct theintervals
struct. You then need a second group_by to put the those intervals into a list byvalue
**Incidentally, if you want to print a wider dataframe that isn't abbreviated, then you can set
pl.Config.set_fmt_str_lengths()
to a high number.