How to transform dataframe long to wide with "grouped" columns?

78 views Asked by At

When pivoting the following dataframe from long to wide, I would like to get "groups" of columns and mark them with a prefix or suffix.

  • The groups of elements can have different sizes, i.e. consist of one, two or more grouped elements/rows, I used pairs of two here to keep the example simple.
import pandas as pd

df = pd.DataFrame(
    [
        {'group': 'group-009297534',  'single_id': 'single-011900051',  'country': 'ESP',  'name': '00000911'},
        {'group': 'group-009297534',  'single_id': 'single-000000821',  'country': 'USA',  'name': '00001054'},
        {'group': 'group-009280053',  'single_id': 'single-000000002',  'country': 'HUN',  'name': '00000496'},
        {'group': 'group-009280053',  'single_id': 'single-000000014',  'country': 'HUN',  'name': '00000795'},
        {'group': 'group-009245039',  'single_id': 'single-000001258',  'country': 'NOR',  'name': '00000527'},
        {'group': 'group-009245039',  'single_id': 'single-000000669',  'country': 'TWN',  'name': '00000535'}
    ]
)

My approach of assigning an index to the elements to be grouped and then using this for the column designation is already going in the right direction, but still deviates from the expected view

df['idx'] = df.groupby('group').cumcount()
df.pivot(index='group', columns='idx')
group ('single_id', 0) ('single_id', 1) ('country', 0) ('country', 1) ('name', 0) ('name', 1)
group-009245039 single-000001258 single-000000669 NOR TWN 00000527 00000535
group-009280053 single-000000002 single-000000014 HUN HUN 00000496 00000795
group-009297534 single-011900051 single-000000821 ESP USA 00000911 00001054

However, the expected solution would look like this:

group single_id_1 country_1 name_1 single_id_2 country_2 name_2
0 group-009245039 single-000001258 NOR 00000527 single-000000669 TWN 00000535
1 group-009280053 single-000000002 HUN 00000496 single-000000014 HUN 00000795
2 group-009297534 single-011900051 ESP 00000911 single-000000821 USA 00001054

I'm not sure whether the approach with the multi-index, which would then have to be sorted and merged somehow, is the right one or whether there is a more elegant option.

3

There are 3 answers

1
Panda Kim On BEST ANSWER

Code

out = (df
       .assign(index=df.groupby('group').cumcount().add(1).astype('str'))
       .pivot(index='group', columns='index')
       .pipe(lambda x: x.set_axis(x.columns.map('_'.join), axis=1))
       .sort_index(axis=1, 
                   key=lambda x: x.str.extract(r'.+_(\d+$)')[0].astype('int'))
       .reset_index()
)

out

enter image description here

1
DataSciRookie On

Is it what you are looking for ?

import pandas as pd

df = pd.DataFrame([
    {'group': 'group-009297534',  'single_id': 'single-011900051',  'country': 'ESP',  'name': '00000911'},
    {'group': 'group-009297534',  'single_id': 'single-000000821',  'country': 'USA',  'name': '00001054'},
    {'group': 'group-009280053',  'single_id': 'single-000000002',  'country': 'HUN',  'name': '00000496'},
    {'group': 'group-009280053',  'single_id': 'single-000000014',  'country': 'HUN',  'name': '00000795'},
    {'group': 'group-009245039',  'single_id': 'single-000001258',  'country': 'NOR',  'name': '00000527'},
    {'group': 'group-009245039',  'single_id': 'single-000000669',  'country': 'TWN',  'name': '00000535'}
])

df['idx'] = (df.groupby('group').cumcount() + 1).astype(str)


# Pivot the DataFrame with the new 'idx' for differentiation
df_pivoted = df.pivot(index='group', columns='idx')

# Flatten the MultiIndex and format column names
df_pivoted.columns = [f'{x[0]}_{x[1]}' for x in df_pivoted.columns]

# Reset the index to bring 'group' back as a column
df_pivoted.reset_index(inplace=True)

# Optional: Reorder the columns according to your expected output
# This assumes you know the order and number of groups
expected_order = [
    'group', 
    'single_id_1', 'country_1', 'name_1',
    'single_id_2', 'country_2', 'name_2'
]
df_pivoted = df_pivoted[expected_order]

print(df_pivoted)

Output :

  group       single_id_1 country_1    name_1       single_id_2  \
0  group-009245039  single-000001258       NOR  00000527  single-000000669   
1  group-009280053  single-000000002       HUN  00000496  single-000000014   
2  group-009297534  single-011900051       ESP  00000911  single-000000821   

  country_2    name_2  
0       TWN  00000535  
1       HUN  00000795  
2       USA  00001054 
1
Onyambu On

You can use pivot_wider from janitor

import janitor
df['cols'] = df.groupby('group').cumcount() + 1
df.pivot_wider(index = 'group', names_from = 'cols')

            group      single_id_1      single_id_2 country_1 country_2 name_1 name_2
0 group-009245039 single-000001258 single-000000669       NOR       TWN    527    535
1 group-009280053 single-000000002 single-000000014       HUN       HUN    496    795
2 group-009297534 single-011900051 single-000000821       ESP       USA    911   1054

Equivalently you could use pivot_wider from siuba:

from siuba import _
from siuba.experimental.pivot import pivot_wider

df['cols'] = df.groupby('group').cumcount() + 1
pivot_wider(df, names_from = _.cols, 
                values_from = _['single_id', 'country', 'name'],
                names_vary = 'slowest')

            group      single_id_1 country_1 name_1      single_id_2 country_2 name_2
0 group-009245039 single-000001258       NOR    527 single-000000669       TWN    535
1 group-009280053 single-000000002       HUN    496 single-000000014       HUN    795
2 group-009297534 single-011900051       ESP    911 single-000000821       USA   1054