Polars: Expand dataframe so that each id vars have the same num of rows

391 views Asked by At

I have a dataframe that has id and week. I want to expand the dataframe so that each id have the same number of rows or four weeks.

import pandas as pd

data = {
    'id': ['a', 'a', 'b', 'c', 'c', 'c'],
    'week': ['1', '2', '3', '4', '3', '1'],
    'num1': [1, 3, 5, 4, 3, 6],
    'num2': [4, 5, 3, 4, 6, 6]
}
df = pd.DataFrame(data)
  id week  num1  num2
0  a    1     1     4
1  a    2     3     5
2  b    3     5     3
3  c    4     4     4
4  c    3     3     6
5  c    1     6     6

In pandas, I can just do:

df = (
    df.set_index(['id', 'week'])
      .unstack().stack(dropna=False)
      .reset_index()
)
   id week  num1  num2
0   a    1   1.0   4.0
1   a    2   3.0   5.0
2   a    3   NaN   NaN
3   a    4   NaN   NaN
4   b    1   NaN   NaN
5   b    2   NaN   NaN
6   b    3   5.0   3.0
7   b    4   NaN   NaN
8   c    1   6.0   6.0
9   c    2   NaN   NaN
10  c    3   3.0   6.0
11  c    4   4.0   4.0

How do you do this with polars?

2

There are 2 answers

7
jqurious On BEST ANSWER

It kind of looks like an .upsample but week is a str in this case which wont work.

It also looks like a cross join of unique id -> week values joined with the original df.

(df.select("id").unique()
   .join(df.select("week").unique(), how="cross")
   .join(
      df,
      on=["id", "week"],
      how="left"
   )
)   
shape: (12, 4)
┌─────┬──────┬──────┬──────┐
│ id  ┆ week ┆ num1 ┆ num2 │
│ --- ┆ ---  ┆ ---  ┆ ---  │
│ str ┆ str  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╡
│ a   ┆ 1    ┆ 1    ┆ 4    │
│ a   ┆ 2    ┆ 3    ┆ 5    │
│ a   ┆ 3    ┆ null ┆ null │
│ a   ┆ 4    ┆ null ┆ null │
│ …   ┆ …    ┆ …    ┆ …    │
│ c   ┆ 1    ┆ 6    ┆ 6    │
│ c   ┆ 2    ┆ null ┆ null │
│ c   ┆ 3    ┆ 3    ┆ 6    │
│ c   ┆ 4    ┆ 4    ┆ 4    │
└─────┴──────┴──────┴──────┘

There's also .to_dummies() which could potentially be of use depending on what it is you're actually doing.

We make a copy of the original column first:

df.with_columns(_week = pl.col("week")).to_dummies("week")
shape: (6, 8)
┌─────┬────────┬────────┬────────┬────────┬──────┬──────┬───────┐
│ id  ┆ week_1 ┆ week_2 ┆ week_3 ┆ week_4 ┆ num1 ┆ num2 ┆ _week │
│ --- ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---  ┆ ---  ┆ ---   │
│ str ┆ u8     ┆ u8     ┆ u8     ┆ u8     ┆ i64  ┆ i64  ┆ str   │
╞═════╪════════╪════════╪════════╪════════╪══════╪══════╪═══════╡
│ a   ┆ 1      ┆ 0      ┆ 0      ┆ 0      ┆ 1    ┆ 4    ┆ 1     │
│ a   ┆ 0      ┆ 1      ┆ 0      ┆ 0      ┆ 3    ┆ 5    ┆ 2     │
│ b   ┆ 0      ┆ 0      ┆ 1      ┆ 0      ┆ 5    ┆ 3    ┆ 3     │
│ c   ┆ 0      ┆ 0      ┆ 0      ┆ 1      ┆ 4    ┆ 4    ┆ 4     │
│ c   ┆ 0      ┆ 0      ┆ 1      ┆ 0      ┆ 3    ┆ 6    ┆ 3     │
│ c   ┆ 1      ┆ 0      ┆ 0      ┆ 0      ┆ 6    ┆ 6    ┆ 1     │
└─────┴────────┴────────┴────────┴────────┴──────┴──────┴───────┘

You can then do things like:

(df.with_columns(_week = pl.col("week"))
   .to_dummies("week")
   .with_columns(
      pl.when(pl.col("^week_\d+$") == 0)
        .then(pl.col("^week_\d+$"))
        .fill_null(pl.col("_week"))
   )
)
shape: (6, 8)
┌─────┬────────┬────────┬────────┬────────┬──────┬──────┬───────┐
│ id  ┆ week_1 ┆ week_2 ┆ week_3 ┆ week_4 ┆ num1 ┆ num2 ┆ _week │
│ --- ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---  ┆ ---  ┆ ---   │
│ str ┆ str    ┆ str    ┆ str    ┆ str    ┆ i64  ┆ i64  ┆ str   │
╞═════╪════════╪════════╪════════╪════════╪══════╪══════╪═══════╡
│ a   ┆ 1      ┆ 0      ┆ 0      ┆ 0      ┆ 1    ┆ 4    ┆ 1     │
│ a   ┆ 0      ┆ 2      ┆ 0      ┆ 0      ┆ 3    ┆ 5    ┆ 2     │
│ b   ┆ 0      ┆ 0      ┆ 3      ┆ 0      ┆ 5    ┆ 3    ┆ 3     │
│ c   ┆ 0      ┆ 0      ┆ 0      ┆ 4      ┆ 4    ┆ 4    ┆ 4     │
│ c   ┆ 0      ┆ 0      ┆ 3      ┆ 0      ┆ 3    ┆ 6    ┆ 3     │
│ c   ┆ 1      ┆ 0      ┆ 0      ┆ 0      ┆ 6    ┆ 6    ┆ 1     │
└─────┴────────┴────────┴────────┴────────┴──────┴──────┴───────┘
0
mishpat On

There is a panda-ish polars equivalent of your method, which extends my suggestion here to allow for multiple columns by melting as a first step. The basic structure is melt->pivot->melt (the last two steps being the approximate equivalents of unstack/stack in pandas.)

df.melt(id_vars=["id", "week"]).pivot(values="value", index="id", aggregate_function="first", columns="week").melt(id_vars="id", variable_name="week", value_name="value").sort(["id", "week"])

As usual with polars, the lack of an index means it is more explicit but also wordier than the equivalent pandas code. The order is also slightly different so I added a sort to get the exact row ordering you wanted.

┌─────┬──────┬───────┐
│ id  ┆ week ┆ value │
│ --- ┆ ---  ┆ ---   │
│ str ┆ str  ┆ i64   │
╞═════╪══════╪═══════╡
│ a   ┆ 1    ┆ 1     │
│ a   ┆ 2    ┆ 3     │
│ a   ┆ 3    ┆ null  │
│ a   ┆ 4    ┆ null  │
│ …   ┆ …    ┆ …     │
│ c   ┆ 1    ┆ 6     │
│ c   ┆ 2    ┆ null  │
│ c   ┆ 3    ┆ 3     │
│ c   ┆ 4    ┆ 4     │
└─────┴──────┴───────┘