Unstack columns by groups in parallel with polars

200 views Asked by At

Hy!

I need to unstack some columns in polars, but i want to group the columns by column 'a'

seq_ = df['a'].unique()
l = len(seq_)
dfs = []

for idx, a in enumerate(seq_):
    df_tmp = df.filter(pl.col('a') == a)
    steps = df_tmp.shape[0]
    df_tmp = df_tmp.unstack(step=steps, how="horizontal",columns=['b', 'c'])
    df_tmp = df_tmp.with_columns(pl.lit(a).alias('a'))
    dfs.append(df_tmp)

df_tmp = pl.concat(dfs, how='diagonal')

this code is too much slower. Is there any way to do this in polars more faster?

from this:

┌─────┬───────┬──────┐
│ a   ┆  b    ┆ c    │
│ --- ┆ ---   ┆ ---  │
│ i64 ┆ str   ┆ str  │
╞═════╪═══════╪══════╡
│ 1   ┆ a     ┆ f    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ b     ┆ g    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ c     ┆ h    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ d     ┆ i    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ e     ┆ j    │
└─────┴───────┴──────┘

to this:

┌─────┬───────┬──────┬─────┬──────┬──────┬──────┐
│ a   ┆  b_0  ┆  b_1 ┆ b_2 ┆ c_0  ┆ c_1  ┆ c_2  │
│ --- ┆ ---   ┆ ---  ┆ --- ┆ ---  ┆ ---  ┆ ---  │ 
│ i64 ┆ str   ┆ str  ┆ str ┆ str  ┆ str  ┆ str  │
╞═════╪═══════╪══════╪═════╪══════╪══════╪══════╡
│ 1   ┆ a     ┆ c    ┆ d   ┆ f    ┆ h    ┆  i   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ b     ┆ e    ┆null ┆ g    ┆ j    ┆ null │
└─────┴───────┴──────┴─────┴──────┴──────┴──────┘
2

There are 2 answers

2
Wayoshi On BEST ANSWER
  • A group_by with all aggregation will transform the b and c columns into lists.
  • From there, With list.to_struct(), you can name each element with a custom field name, and it will be padded out with nulls at this time per row. (The reason there's df = ... steps below instead of all in one step is due to the way fields works when passing in a late-binding lambda, run into duplicate field issues)
  • unnest makes each field of a struct column its own column.
df = df.group_by('a', maintain_order=True).all()

for col in 'bc':
    df = df.with_columns(
        pl.col(col).list.to_struct(
            n_field_strategy='max_width', fields=lambda i: f'{col}_{i}'
        )
    ).unnest(col)
shape: (2, 7)
┌─────┬─────┬─────┬──────┬─────┬─────┬──────┐
│ a   ┆ b_0 ┆ b_1 ┆ b_2  ┆ c_0 ┆ c_1 ┆ c_2  │
│ --- ┆ --- ┆ --- ┆ ---  ┆ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ str ┆ str  ┆ str ┆ str ┆ str  │
╞═════╪═════╪═════╪══════╪═════╪═════╪══════╡
│ 1   ┆ a   ┆ c   ┆ d    ┆ f   ┆ h   ┆ i    │
│ 2   ┆ b   ┆ e   ┆ null ┆ g   ┆ j   ┆ null │
└─────┴─────┴─────┴──────┴─────┴─────┴──────┘

There might be an alternate, cleaner solution with pivot, but at a glance I don't think this quite fits the mold of a pivot.

0
Dean MacGregor On

An alternative approach using melt/pivot could look like this:

(
    df
    .with_columns(
        i=pl.int_range(0,pl.count()).over('a')
    )
    .melt(['a','i'])
    .with_columns(variable=pl.col('variable')+pl.lit("_")+pl.col('i').cast(pl.Utf8))
    .pivot('value', 'a','variable','first')
)
shape: (2, 7)
┌─────┬─────┬─────┬──────┬─────┬─────┬──────┐
│ a   ┆ b_0 ┆ b_1 ┆ b_2  ┆ c_0 ┆ c_1 ┆ c_2  │
│ --- ┆ --- ┆ --- ┆ ---  ┆ --- ┆ --- ┆ ---  │
│ i64 ┆ str ┆ str ┆ str  ┆ str ┆ str ┆ str  │
╞═════╪═════╪═════╪══════╪═════╪═════╪══════╡
│ 1   ┆ a   ┆ c   ┆ d    ┆ f   ┆ h   ┆ i    │
│ 2   ┆ b   ┆ e   ┆ null ┆ g   ┆ j   ┆ null │
└─────┴─────┴─────┴──────┴─────┴─────┴──────┘