Find mismatch in all columns between two rows linked by another column in dataframe

Question

Find mismatch in all columns between two rows linked by another column in dataframe

570 views Asked by AudioBubble At 09 June 2022 at 19:15

I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?

import pandas as pd

df = pd.DataFrame({'col_1':       ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
                    'identifier': [  1,   2,   3,   2,   3,   4,   4],
                    'col_3':      [ 10,  20,   30,  21, 31,  40,  41],
                    'col_4':      [  1,   1,    1,   1,  1,   1,   1]
                    })

In above df, it would be

col_1 for identifier 4 (D vs. E)
col_3 for identifier 2/3/4 (20 vs. 21, 30 vs. 31, 40 vs. 41)

Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.

Original Q&A

There are 2 answers

BeRT2me On 09 June 2022 at 19:49

mismatch = df.groupby('identifier').agg(set).applymap(lambda x: x if len(x) > 1 else np.nan)
col_1_mismatch = mismatch[['col_1']].dropna()
col_3_mismatch = mismatch[['col_3']].dropna()
print(col_1_mismatch)
print(col_3_mismatch)

Output:

             col_1
identifier
4           {D, E}


               col_3
identifier
2           {20, 21}
3           {30, 31}
4           {40, 41}

**mozway** · Accepted Answer · 2022-06-09T19:49:40+00:00

IIUC, you can agregate the columns as sets and keep those with more than one element:

s = df.groupby('identifier').agg(set).stack()
out = s[s.str.len().gt(1)]

output:

identifier       
2           col_3    {20, 21}
3           col_3    {30, 31}
4           col_1      {D, E}
            col_3    {40, 41}
dtype: object

further aggregation:

out.reset_index(level=1)['level_1'].groupby(level=0).agg(list)

output:

identifier
2           [col_3]
3           [col_3]
4    [col_1, col_3]
Name: level_1, dtype: object

TechQA.

Find mismatch in all columns between two rows linked by another column in dataframe

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in COMPARE

Related Questions in RECONCILIATION

Popular Questions

Trending Questions