Given a Polars Dataframe:
df = pl.DataFrame({'Col_1':['A','B','C','D'],'Col_2':['a','b','c','d'],'Col_3':[1,2,3,4]})
print(df)
shape: (4, 3)
┌───────┬───────┬───────┐
│ Col_1 ┆ Col_2 ┆ Col_3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════╪═══════╪═══════╡
│ A ┆ a ┆ 1 │
│ B ┆ b ┆ 2 │
│ C ┆ c ┆ 3 │
│ D ┆ d ┆ 4 │
└───────┴───────┴───────┘
and a list:
display_list = ['A','B','B','B','C','D','D','A']
I want the output to be in this format:
shape: (8, 4)
┌───────┬───────┬───────┬───────────────┐
│ Col_1 ┆ Col_2 ┆ Col_3 ┆ Occurrence_No │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════════════╡
│ A ┆ a ┆ 1 ┆ 1 │
│ B ┆ b ┆ 2 ┆ 1 │
│ B ┆ b ┆ 2 ┆ 2 │
│ B ┆ b ┆ 2 ┆ 3 │
│ C ┆ c ┆ 3 ┆ 1 │
│ D ┆ d ┆ 4 ┆ 1 │
│ D ┆ d ┆ 4 ┆ 2 │
│ A ┆ a ┆ 1 ┆ 1 │
└───────┴───────┴───────┴───────────────┘
I want to the rows to be duplicated based on the number of times the first column appears in the list. Also, I need an Occurrence_No column that acts as the counter of the occurrence of the first column in the dataframe. The order of the rows in the DataFrame does not matter.
I am able to get the result except the Occurrence_No using this code:
df = df.with_columns(pl.col('Col_1').apply(lambda x: display_list.count(x)).alias('occur'))
df = df.select(pl.exclude('occur').repeat_by('occur').explode())
The above codes creates the number of rows based on the number of times it occurs in the list.
How do I add the Occurrence_No column?
You could just do a
join
ofdisplay_list
(as apl.Series
->pl.DataFrame
). ForOccurence_no
, acumcount
over
therle_id
.