I have two dataframes:
df1
district place year votes candidate
1 1 2000 25 bond james
1 1 2000 30 smith john peter
1 1 2000 10 caprio leonardo di
1 1 2001 5 bond james
df2
district place year money candidate
1 1 2000 500 bond james
1 1 2000 100 di caprio leonardo
1 1 2000 10 smith j.peter
1 1 2001 90 bond james
I want to match the two dataframes. There is exact match for the columns 'district', 'place' 'year' but not for the column 'candidate'. The problem is in column 'candidate', not always the name exactly matches. I tried the following code:
import pandas as pd
import fuzzy_pandas fpd
df1["district"] = df1["district"].astype(str)
df1["place"] = df1["place"].astype(str)
df1["year"] = df1["year"].astype(str)
df1["candidate"] = df1["candidate"].astype(str)
df2["district"] = df2["district"].astype(str)
df2["place"] = df2["place"].astype(str)
df2["year"] = df2["year"].astype(str)
df2["candidate"] = df2["candidate"].astype(str)
data= fpd.fuzzy_merge(df1, df2,
left_on=['district', 'place', 'year', 'candidate'],
right_on=['district', 'place', 'year', 'candidate'],
method='levenshtein',
threshold=0.6,
join='left-outer')
This is what I want to obtain:
district place year votes candidate money
1 1 2000 25 bond james 500
1 1 2000 30 smith john peter 10
1 1 2000 10 caprio leonardo di 100
1 1 2001 5 bond james 90
But sometimes there is a wrong match because of columns 'district', 'place' or 'year'. How should I correct my code?
A possible solution, which uses
difflib.get_close_matchesto get the closest matches between columnscandidatein the two dataframes and then merges the two dataframes:Output: