I have two data frames with the following format:
d = {'id2': ['1', '2'], 'name': ['paris city', 'london town']}
df1 = pd.DataFrame(data=d)
print(df1)
id2 name
0 1 paris city
1 1 london town
d = {'id2': ['3', '4'], 'name': ['parid cit', 'londoon town']}
df2 = pd.DataFrame(data=d)
print(df2)
id2 name
0 3 parid cit
1 4 londoon town
I am looking to do a fuzzy string merge based on the name columns. The end result would be the following:
id2 id2 name_x name_y match_level
0 3 1 parid cit paris city 0.91
1 4 2 londoon town london town 0.93
I have tried to use the fuzzy wuzzy library, but because both the tables I am merging have 100k+ rows, the code would take days to run.
I have also seen some 'fast' implementations like rapid fuzz or others that use tfidf or k nearest neighbors: However all of the code examples show how to find matches for a single string against a list, i've have not been able to find any other libraries/implemntation that do a fuzzy merge between two dataframes.
What is a fast and efficient way to fuzzy merge pandas dataframes?
This is the address:https://github.com/RobinL/fuzzymatcher