Explode a polars DataFrame column without duplicating other column values

Question

Explode a polars DataFrame column without duplicating other column values

90 views Asked by Igor Lazarev At 11 March 2024 at 20:47

As a minimum example, let's say we have next polars.DataFrame:

df = pl.DataFrame({"sub_id": [1,2,3], "engagement": ["one:one,two:two", "one:two,two:one", "one:one"], "total_duration": [123, 456, 789]})

sub_id	engagement	total_duration
1	one:one,two:two	123
2	one:two,two:one	456
3	one:one	789

then, we explode "engagement" column

df = df.with_columns(pl.col("engagement").str.split(",")).explode("engagement")

and receive:

sub_id	engagement	total_duration
1	one:one	123
1	two:two	123
2	one:two	456
2	two:one	456
3	one:one	789

For visualization I use Plotly, and code would be following:

import plotly.express as px
fig = px.bar(df, x="sub_id", y="total_duration", color="engagement")
fig.show()

Resulting plot:

Resulting plot

Now it basically means that subscribers 1 and 2 have their total_duration (total watched time) doubled. How could I remain total_duration per sub, but leaving engagement groups as shown on the plot legend?

Original Q&A

There are 2 answers

Roman Pekar On 12 March 2024 at 08:26

Assuming you want to have your total_duration to be split equally between engagement within sub_id, you could adjust it before exploding:

(
    df.with_columns(pl.col('engagement').str.split(','))
    .with_columns(
        pl.col('total_duration') / pl.col('engagement').list.len(),
    ).explode('engagement')
)

┌────────┬────────────┬────────────────┐
│ sub_id ┆ engagement ┆ total_duration │
│ ---    ┆ ---        ┆ ---            │
│ i64    ┆ str        ┆ f64            │
╞════════╪════════════╪════════════════╡
│ 1      ┆ one:one    ┆ 61.5           │
│ 1      ┆ two:two    ┆ 61.5           │
│ 2      ┆ one:two    ┆ 228.0          │
│ 2      ┆ two:one    ┆ 228.0          │
│ 3      ┆ one:one    ┆ 789.0          │
└────────┴────────────┴────────────────┘

**Hericks** · Accepted Answer · 2024-03-11T22:05:09+00:00

An option to handle this in polars would be to split total_duration equally between engagement rows within sub_id. For this, we simply divide total_duration by the number of rows of the given sub_id.

(
    df
    .with_columns(
        pl.col("engagement").str.split(",")
    )
    .explode("engagement")
    .with_columns(
        pl.col("total_duration") / pl.len().over("sub_id")
    )
)

shape: (5, 3)
┌────────┬────────────┬────────────────┐
│ sub_id ┆ engagement ┆ total_duration │
│ ---    ┆ ---        ┆ ---            │
│ i64    ┆ str        ┆ f64            │
╞════════╪════════════╪════════════════╡
│ 1      ┆ one:one    ┆ 61.5           │
│ 1      ┆ two:two    ┆ 61.5           │
│ 2      ┆ one:two    ┆ 228.0          │
│ 2      ┆ two:one    ┆ 228.0          │
│ 3      ┆ one:one    ┆ 789.0          │
└────────┴────────────┴────────────────┘

TechQA.

Explode a polars DataFrame column without duplicating other column values

There are 2 answers

Related Questions in PYTHON

Related Questions in DATAFRAME

Related Questions in PLOTLY

Related Questions in PLOTLY-PYTHON

Related Questions in PYTHON-POLARS

Popular Questions

Trending Questions