I have referred to this post but cannot get it to run for my particular case. I have two dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Couper", 1: "Cruise", 2: "Pit"},
"fname": {0: "Brad", 1: "Tom", 2: "Brad"},
"score": {0: 3, 1: 3.5, 2: 4},
}
)
Then I do:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {
tup: fuzz.ratio(*tup)
for tup in product(df1["lname"].tolist(), df2["lname"].tolist())
}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
degrees = {
tup: fuzz.ratio(*tup)
for tup in product(df1["fname"].tolist(), df2["fname"].tolist())
}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
df2["lname"] = df2["lname"].map(s1).fillna(df2["lname"])
df2["fname"] = df2["fname"].map(s2).fillna(df2["fname"])
df = df1.merge(df2, on=["lname", "fname"], how="outer")
The result is not what I expect. Can you help me with editing this code please? Note that I have millions of lines in df1 and millions in df2, so I need some efficiency as well.
Basically, I need to match people from df1 to people in df2. In the above example, I am matching them on last name (lname) and first name (fname). I also have a third one, which I leave out here for parsimony.
The expected result should look like:
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4
You could try this:
And then:
Given the size of your dataframes, I suppose you have namesakes (identical first and last names), hence the use of @cache decorator from Python standard library in order to try speeding things up (but you can do without it).