Aggregate one df column into list of structs of first/last pairs by an ID value

58 views Asked by At

I have a time-series containing ID values. The series consists of chunks over over which the series value stays constant. I want to efficiently extract the corresponding list of time-intervals over which the time-series does not change for each value.

My time-series is represented as a polars.DataFrame with two columns, timestamp and value, cf. example below

import polars as pl


dates = pl.date_range(
    start=datetime.date.fromisoformat("2011-01-01"), 
    end=datetime.date.fromisoformat("2011-01-10"),
    interval="1d",
    eager=True
)

frame = pl.DataFrame({
    "date": dates,
    "value": ["A", "B", "A", "B", "B", "A", "A", "A", "B", "B"]
})

As a result I want a frame where for each value I get the list of all intervals/tuple[date, date] over which the above series was constant, e.g. the following frame or similar

>>> shape: (2, 2)
┌───────┬───────────────────────────────────┐
│ value ┆ intervals                         │
│ ---   ┆ ---                               │
│ str   ┆ list[struct[2]]                   │
╞═══════╪═══════════════════════════════════╡
│ A     ┆ [{2011-01-01,2011-01-01}, {2011-… │
│ B     ┆ [{2011-01-02,2011-01-02}, {2011-… │
└───────┴───────────────────────────────────┘

My current solution consists of the following steps:

i) Indicate the (inclusive) start and end of a chunk as two additional boolean columns

def is_start(expr: pl.Expr) -> pl.Expr:    
    return (expr.shift(1) != expr).fill_null(True).alias("is_start")


def is_end(expr: pl.Expr) -> pl.Expr:
    return (expr.shift(-1) != expr).fill_null(True).alias("is_end")


x = frame.with_columns(pl.col("value").pipe(is_start), pl.col("value").pipe(is_end))
>>>shape: (10, 4)
┌────────────┬───────┬──────────┬────────┐
│ date       ┆ value ┆ is_start ┆ is_end │
│ ---        ┆ ---   ┆ ---      ┆ ---    │
│ date       ┆ str   ┆ bool     ┆ bool   │
╞════════════╪═══════╪══════════╪════════╡
│ 2011-01-01 ┆ A     ┆ true     ┆ true   │
│ 2011-01-02 ┆ B     ┆ true     ┆ true   │
│ 2011-01-03 ┆ A     ┆ true     ┆ true   │
│ 2011-01-04 ┆ B     ┆ true     ┆ false  │
│ …          ┆ …     ┆ …        ┆ …      │
│ 2011-01-07 ┆ A     ┆ false    ┆ false  │
│ 2011-01-08 ┆ A     ┆ false    ┆ true   │
│ 2011-01-09 ┆ B     ┆ true     ┆ false  │
│ 2011-01-10 ┆ B     ┆ false    ┆ true   │
└────────────┴───────┴──────────┴────────┘

ii) Groupby value and assemble the start and end dates into lists

y = x.groupby("value").agg(
    pl.col("date").filter(pl.col("is_start")).alias("start_date"),
    pl.col("date").filter(pl.col("is_end")).alias("end_date")
)
>>>shape: (2, 3)
┌───────┬───────────────────────────────────┬───────────────────────────────────┐
│ value ┆ start_date                        ┆ end_date                          │
│ ---   ┆ ---                               ┆ ---                               │
│ str   ┆ list[date]                        ┆ list[date]                        │
╞═══════╪═══════════════════════════════════╪═══════════════════════════════════╡
│ A     ┆ [2011-01-01, 2011-01-03, 2011-01… ┆ [2011-01-01, 2011-01-03, 2011-01… │
│ B     ┆ [2011-01-02, 2011-01-04, 2011-01… ┆ [2011-01-02, 2011-01-05, 2011-01… │
└───────┴───────────────────────────────────┴───────────────────────────────────┘


iii) Explode the lists into columns

z = y.explode("start_date", "end_date")
>>> shape: (6, 3)
┌───────┬────────────┬────────────┐
│ value ┆ start_date ┆ end_date   │
│ ---   ┆ ---        ┆ ---        │
│ str   ┆ date       ┆ date       │
╞═══════╪════════════╪════════════╡
│ A     ┆ 2011-01-01 ┆ 2011-01-01 │
│ A     ┆ 2011-01-03 ┆ 2011-01-03 │
│ A     ┆ 2011-01-06 ┆ 2011-01-08 │
│ B     ┆ 2011-01-02 ┆ 2011-01-02 │
│ B     ┆ 2011-01-04 ┆ 2011-01-05 │
│ B     ┆ 2011-01-09 ┆ 2011-01-10 │
└───────┴────────────┴────────────┘

iv) Group by "value" (again) and aggregate into series of intervals (structs)

u = z.groupby("value").agg(
    pl.struct(pl.col("start_date"), pl.col("end_date")).alias("intervals")
)
>>> shape: (2, 2)
┌───────┬───────────────────────────────────┐
│ value ┆ intervals                         │
│ ---   ┆ ---                               │
│ str   ┆ list[struct[2]]                   │
╞═══════╪═══════════════════════════════════╡
│ A     ┆ [{2011-01-01,2011-01-01}, {2011-… │
│ B     ┆ [{2011-01-02,2011-01-02}, {2011-… │
└───────┴───────────────────────────────────┘


I am wondering if there is a more concise (and efficient) way to express this transformation using polars. Especially the fact that I have to groupby "value" twice and that y already contains all start and end dates, albeit not as 'interval' tuples makes me wonder, if this could be improved.

In the end I will need to apply this transformation not for one but for many (long) time-series, where I know a-priori, that the number of 'change points' in each series is rather small.

I'd be glad for any suggestions.

2

There are 2 answers

0
Dean MacGregor On

I'm not sure how to avoid double group_bys (I don't think you can) but this avoids filters and explodes. The trick is to first create a helper column which represents each continuous sequence of value. You used to have to do that manually but now there's .rle_id to do it for you. Once you do that then you can use the built in .first() and .last() to construct the intervals struct. You then need a second group_by to put the those intervals into a list by value

(
    frame
    .group_by(
        pl.col('value').rle_id().alias('i'),
        maintain_order=True
        )
    .agg(
        value = pl.col('value').first(),
        intervals = pl.struct(
            start_date = pl.col('date').first(), 
            end_date=pl.col('date').last()
            )
        )
    .group_by('value',maintain_order=True)
    .agg(
        pl.col('intervals')
        )
)
shape: (2, 2)
┌───────┬─────────────────────────────────────────────────────────────────────────────┐
│ value ┆ intervals                                                                   │
│ ---   ┆ ---                                                                         │
│ str   ┆ list[struct[2]]                                                             │
╞═══════╪═════════════════════════════════════════════════════════════════════════════╡
│ A     ┆ [{2011-01-01,2011-01-01}, {2011-01-03,2011-01-03}, {2011-01-06,2011-01-08}] │
│ B     ┆ [{2011-01-02,2011-01-02}, {2011-01-04,2011-01-05}, {2011-01-09,2011-01-10}] │
└───────┴─────────────────────────────────────────────────────────────────────────────┘

**Incidentally, if you want to print a wider dataframe that isn't abbreviated, then you can set pl.Config.set_fmt_str_lengths() to a high number.

0
jqurious On

Just with regards to your current approach:

Creating booleans and using Expr.filter() seems like a "manual implementation" of .when().then()

def find_date(periods=1):
    return (
        pl.when(pl.col("value").ne_missing(pl.col("value").shift(periods)))
          .then(pl.col("date"))
    )
   
(df.with_columns(start_date = find_date().forward_fill(), end_date = find_date(-1))
   .drop_nulls()
   .group_by("value")
   .agg(intervals = pl.struct("start_date", "end_date"))
)
shape: (2, 2)
┌───────┬─────────────────────────────────────────────────────────────────────────────┐
│ value ┆ intervals                                                                   │
│ ---   ┆ ---                                                                         │
│ str   ┆ list[struct[2]]                                                             │
╞═══════╪═════════════════════════════════════════════════════════════════════════════╡
│ A     ┆ [{2011-01-01,2011-01-01}, {2011-01-03,2011-01-03}, {2011-01-06,2011-01-08}] │
│ B     ┆ [{2011-01-02,2011-01-02}, {2011-01-04,2011-01-05}, {2011-01-09,2011-01-10}] │
└───────┴─────────────────────────────────────────────────────────────────────────────┘
  • .ne_missing() is used instead of != to handle the first/last row nulls that .shift() produces.