Python Pandas: Find the difference of two dataframe

54 views Asked by At
import pandas as pd

a = [0.0, 0.6900000000000001, 1.3800000000000001, 1.3800000000000001, 2.0700000000000003, 2.7600000000000002, 2.7600000000000002, 3.45, 4.1262, 4.140000000000001, 4.140000000000001, 4.83, 5.5200000000000005, 5.5200000000000005, 5.9754000000000005, 6.4308000000000005, 6.6654, 6.9]
b = [0.0, 0.6900000000000001, 1.3800000000000001, 1.3800000000000001, 2.0700000000000003, 2.7600000000000002, 2.7600000000000002, 3.45, 4.1262, 4.140000000000001, 4.140000000000001, 4.83, 5.5200000000000005, 5.5200000000000005, 5.9754000000000005, 6.4308000000000005, 6.4308000000000005, 6.6654, 6.9]
df1 = pd.DataFrame(a, columns=['x(m)'])
df2 = pd.DataFrame(b, columns=['x(m)'])

df1:

      x(m)
0   0.0000
1   0.6900
2   1.3800
3   1.3800
4   2.0700
5   2.7600
6   2.7600
7   3.4500
8   4.1262
9   4.1400
10  4.1400
11  4.8300
12  5.5200
13  5.5200
14  5.9754
15  6.4308
16  6.6654
17  6.9000

df2:

      x(m)
0   0.0000
1   0.6900
2   1.3800
3   1.3800
4   2.0700
5   2.7600
6   2.7600
7   3.4500
8   4.1262
9   4.1400
10  4.1400
11  4.8300
12  5.5200
13  5.5200
14  5.9754
15  6.4308
16  6.4308
17  6.6654
18  6.9000

The difference of the two dataframes is row 16 in df2 is not in df1. How can I get the difference? The output can be the row index 16 plus the value 6.4308 , or a combined dataframe with the third column specifying not in df1 or anything similar. Thanks.

1

There are 1 answers

1
Scott Boston On BEST ANSWER

IIUC, do you want something like this?

df1['seq'] = df1.groupby('x(m)').cumcount()
df2['seq'] = df2.groupby('x(m)').cumcount()
df1.merge(df2, how="outer", indicator="in_df")\
   .sort_values("x(m)")\
   .replace({"right_only": "not in df1", "left_only": "not in df2"})

Output:

      x(m)  seq       in_df
0   0.0000    0        both
1   0.6900    0        both
2   1.3800    0        both
3   1.3800    1        both
4   2.0700    0        both
5   2.7600    0        both
6   2.7600    1        both
7   3.4500    0        both
8   4.1262    0        both
10  4.1400    1        both
9   4.1400    0        both
11  4.8300    0        both
12  5.5200    0        both
13  5.5200    1        both
14  5.9754    0        both
15  6.4308    0        both
18  6.4308    1  not in df1
16  6.6654    0        both
17  6.9000    0        both

Note: pd.DataFrame.merge without the on defined will merge dataframes on all identical columns in the dataframe. If you want to keep the indexing in both dataframes, reset_index first.

And, if you only one the difference then add .query('in_df != "both"')