I have a large df with many columns and rows, with usually two rows per certain identifier as df is used for reconciliation. Is there any way to streamline identification of non-identifier columns which cause mismatch?
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'B', 'C', 'D', 'E'],
'identifier': [ 1, 2, 3, 2, 3, 4, 4],
'col_3': [ 10, 20, 30, 21, 31, 40, 41],
'col_4': [ 1, 1, 1, 1, 1, 1, 1]
})
In above df, it would be
- col_1 for identifier 4 (D vs. E)
- col_3 for identifier 2/3/4 (20 vs. 21, 30 vs. 31, 40 vs. 41)
Open to any representation that makes it easy to isolate the columns causing mismatch, their values and identifiers.
IIUC, you can agregate the columns as sets and keep those with more than one element:
output:
further aggregation:
output: