How can I ungroup a polars dataframe in python?

283 views Asked by At

I have a polars dataframe that has a particular column with repeating patterns. I have grouped them by the patterns & adding a new column to this grouped dataframe. But now I have to unpack/ungroup this dataframe. How can I do it in polars?

My original dataframe looks like this:

file col1 col2
A cell 1 cell 2
B cell 3 cell 4
A cell 5 cell 6
B cell 7 cell 8

I performed groupby to group the dataframe by FILE & then I added my desired new column & I got below output.

file col1 col2 folder
A [cell 1, cell 5] [cell 2, cell 6] [file1, file2]
B [cell 3, cell 7] [cell 4, cell 8] [file1, file2]

Now I want to ungroup the above dataframe into original format while also including this new column. How can I do it? My actual dataframe is huge & has a lot of rows & columns, using iterations is not effective & quite slow. Is there any function that can be applied to the whole dataframe instead of iterating by columns?

Final desired output:

file header 1 header 2 folder
A cell 1 cell 2 file1
B cell 3 cell 4 file1
A cell 5 cell 6 file2
B cell 7 cell 8 file2

I have done the following:

dfg = df.groupby('FILE').agg(pl.all())             #to group them first time 
newdf =  dfg.with_columns(pl.repeat([file1,file2,file3], dfg.height)    #adding desired column

In what efficient ways can I get the desired output? Note that my dataframe is quite large, so using iterations by column is time consuming.

PS - Updated typo in the final table format. In the column "file" as the entries get repeated after few rows, they should be assigned a new "folder" name.

2

There are 2 answers

1
Wayoshi On

You can explode:

dfg.explode(pl.exclude('file'))

Your problem overall might be best solved by a join or some type of over expression, though:

df = pl.DataFrame(
    {
        'file': ['A', 'B'] * 2,
        'col1': [f'cell {i}' for i in range(1, 9, 2)],
        'col2': [f'cell {i}' for i in range(2, 9, 2)],
    }
)
df2 = pl.DataFrame({'file': ['A', 'B'], 'folder': ['file1', 'file2']})

df.join(df2, on='file')
shape: (4, 4)
┌──────┬────────┬────────┬────────┐
│ file ┆ col1   ┆ col2   ┆ folder │
│ ---  ┆ ---    ┆ ---    ┆ ---    │
│ str  ┆ str    ┆ str    ┆ str    │
╞══════╪════════╪════════╪════════╡
│ A    ┆ cell 1 ┆ cell 2 ┆ file1  │
│ B    ┆ cell 3 ┆ cell 4 ┆ file2  │
│ A    ┆ cell 5 ┆ cell 6 ┆ file1  │
│ B    ┆ cell 7 ┆ cell 8 ┆ file2  │
└──────┴────────┴────────┴────────┘
1
jqurious On

It looks like you're trying to "enumerate" each group.

You can use .cum_count() for that.

df = pl.from_repr("""
┌──────┬─────────┬──────────┐
│ file ┆ col1    ┆ col2     │
│ ---  ┆ ---     ┆ ---      │
│ str  ┆ str     ┆ str      │
╞══════╪═════════╪══════════╡
│ A    ┆ cell 1  ┆ cell 2   │
│ B    ┆ cell 3  ┆ cell 4   │
│ A    ┆ cell 5  ┆ cell 6   │
│ B    ┆ cell 7  ┆ cell 8   │
│ A    ┆ cell 9  ┆ cell 10  │
│ B    ┆ cell 11 ┆ cell 12  │
│ A    ┆ cell 13 ┆ cell 14  │
│ B    ┆ cell 15 ┆ cell 16  │
│ A    ┆ cell 17 ┆ cell 18  │
│ B    ┆ cell 19 ┆ cell 20  │
└──────┴─────────┴──────────┘
""")
df.with_columns(folder = 
   pl.col("file").cum_count().over("file")
)
shape: (10, 4)
┌──────┬─────────┬─────────┬────────┐
│ file ┆ col1    ┆ col2    ┆ folder │
│ ---  ┆ ---     ┆ ---     ┆ ---    │
│ str  ┆ str     ┆ str     ┆ u32    │
╞══════╪═════════╪═════════╪════════╡
│ A    ┆ cell 1  ┆ cell 2  ┆ 0      │
│ B    ┆ cell 3  ┆ cell 4  ┆ 0      │
│ A    ┆ cell 5  ┆ cell 6  ┆ 1      │
│ B    ┆ cell 7  ┆ cell 8  ┆ 1      │
│ …    ┆ …       ┆ …       ┆ …      │
│ A    ┆ cell 13 ┆ cell 14 ┆ 3      │
│ B    ┆ cell 15 ┆ cell 16 ┆ 3      │
│ A    ┆ cell 17 ┆ cell 18 ┆ 4      │
│ B    ┆ cell 19 ┆ cell 20 ┆ 4      │
└──────┴─────────┴─────────┴────────┘

You can turn it into a "repeating sequence" using modulo arithmetic.

df.with_columns(folder = 
   pl.col("file").cum_count().over("file").mod(3)
)
shape: (10, 4)
┌──────┬─────────┬─────────┬────────┐
│ file ┆ col1    ┆ col2    ┆ folder │
│ ---  ┆ ---     ┆ ---     ┆ ---    │
│ str  ┆ str     ┆ str     ┆ u32    │
╞══════╪═════════╪═════════╪════════╡
│ A    ┆ cell 1  ┆ cell 2  ┆ 0      │
│ B    ┆ cell 3  ┆ cell 4  ┆ 0      │
│ A    ┆ cell 5  ┆ cell 6  ┆ 1      │
│ B    ┆ cell 7  ┆ cell 8  ┆ 1      │
│ …    ┆ …       ┆ …       ┆ …      │
│ A    ┆ cell 13 ┆ cell 14 ┆ 0      │
│ B    ┆ cell 15 ┆ cell 16 ┆ 0      │
│ A    ┆ cell 17 ┆ cell 18 ┆ 1      │
│ B    ┆ cell 19 ┆ cell 20 ┆ 1      │
└──────┴─────────┴─────────┴────────┘

You can then .format() the string.

df.with_columns(folder = 
   pl.format("file{}", pl.col("file").cum_count().over("file").mod(3) + 1)
)
shape: (10, 4)
┌──────┬─────────┬─────────┬────────┐
│ file ┆ col1    ┆ col2    ┆ folder │
│ ---  ┆ ---     ┆ ---     ┆ ---    │
│ str  ┆ str     ┆ str     ┆ str    │
╞══════╪═════════╪═════════╪════════╡
│ A    ┆ cell 1  ┆ cell 2  ┆ file1  │
│ B    ┆ cell 3  ┆ cell 4  ┆ file1  │
│ A    ┆ cell 5  ┆ cell 6  ┆ file2  │
│ B    ┆ cell 7  ┆ cell 8  ┆ file2  │
│ …    ┆ …       ┆ …       ┆ …      │
│ A    ┆ cell 13 ┆ cell 14 ┆ file1  │
│ B    ┆ cell 15 ┆ cell 16 ┆ file1  │
│ A    ┆ cell 17 ┆ cell 18 ┆ file2  │
│ B    ┆ cell 19 ┆ cell 20 ┆ file2  │
└──────┴─────────┴─────────┴────────┘