Python merge two dataframes (fuzzy matching, some columns exactly match one does not)

108 views Asked by At

I have two dataframes:

df1

district place year votes candidate
1        1     2000 25    bond james
1        1     2000 30    smith john peter
1        1     2000 10    caprio leonardo di
1        1     2001 5     bond james

df2

district place year money candidate
1        1     2000 500   bond james
1        1     2000 100   di caprio leonardo
1        1     2000 10    smith j.peter
1        1     2001 90    bond james

I want to match the two dataframes. There is exact match for the columns 'district', 'place' 'year' but not for the column 'candidate'. The problem is in column 'candidate', not always the name exactly matches. I tried the following code:

import pandas as pd
import fuzzy_pandas fpd

df1["district"] = df1["district"].astype(str)
df1["place"] = df1["place"].astype(str)
df1["year"] = df1["year"].astype(str)
df1["candidate"] = df1["candidate"].astype(str)
df2["district"] = df2["district"].astype(str)
df2["place"] = df2["place"].astype(str)
df2["year"] = df2["year"].astype(str)
df2["candidate"] = df2["candidate"].astype(str)

data= fpd.fuzzy_merge(df1, df2,
                        left_on=['district', 'place', 'year', 'candidate'],
                        right_on=['district', 'place', 'year', 'candidate'],
                        method='levenshtein',
                        threshold=0.6,
                        join='left-outer')

This is what I want to obtain:

district place year votes candidate           money
1        1     2000 25    bond james          500
1        1     2000 30    smith john peter    10
1        1     2000 10    caprio leonardo di  100
1        1     2001 5     bond james          90

But sometimes there is a wrong match because of columns 'district', 'place' or 'year'. How should I correct my code?

1

There are 1 answers

0
PaulS On

A possible solution, which uses difflib.get_close_matches to get the closest matches between columns candidate in the two dataframes and then merges the two dataframes:

import difflib

df2.candidate = df2.candidate.map(
    lambda x: difflib.get_close_matches(x, df1.candidate)[0])

df1.merge(df2, on= ['district', 'place', 'year', 'candidate'])

Output:

   district  place  year  votes           candidate  money
0         1      1  2000     25          bond james    500
1         1      1  2000     30    smith john peter     10
2         1      1  2000     10  caprio leonardo di    100
3         1      1  2001      5          bond james     90