I am working on name matching problem where I have names of customers which need to be compared with 2.5 Million records of existing customers saved in csv file. Below is the code which I tried and its taking 5-12 minutes for single name matching. As this will be integrated as API with RPA process, suggest me any other way to achieve the same within one or two mins.
from fuzzywuzzy import fuzz
import time
# names is the list passed to the program as parameter
names_with_sno = [[sno, name] for sno, name in enumerate(names, 1)]
# dataframe created for the given customer names
df1 = pd.DataFrame(names_with_sno, columns=['s_no','SDN_NAME_SERACH'])
# dataframe for customer database via csv
cust_2 = pd.read_csv(r'...\customer-database-extract\extract.CSV')
# .... preprocessing of both the dataframes
# .... which are not time consuming ones
### CROSS JOIN
#doing the cross join between the given names and customer database
#creating common key in the dataframe having the given names
df1["key"]=1
#creating common key in customer db dataset
cust_2["key"]=1
#sdropping the common column key after creating the cross join
final_df = pd.merge(df1,cust_2,on="key").drop("key",1)
**def get_ratio(df):
cust_name=df["FIRST_NAME"]
hit_name=df["SDN_NAME_SERACH"]
return fuzz.token_set_ratio(cust_name,hit_name)**
st = time.mktime(time.localtime())
#applying the function for name _mtahcing and storing it in a series
**final_series = final_df.apply(get_ratio,axis=1)**
print('\n\nt23 - df.apply(get_ratio) - ',secondsToText(time.mktime(time.localtime()) - st))
here, df1 is the dataframe of given name and cust_2 is DB extract read from the csv file. The print gives the time as,
t23 - df.apply(get_ratio) - 5.0 minutes, 42.0 seconds