python dynamic fuzzy logic join

342 views Asked by At

I am trying to make a dynamic fuzzy logic join for 2 tables. What I mean by dynamic is allowing the arguments to specify the variables that will allow the two tables to join. The code noted below is a modified version of the static code under the following link: Python Pandas fuzzy merge/match with duplicates

I have compiled the dynamic code below:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

def get_donors_v1(fund_var,don_var, don_tab,row=None):
    d = don_tab.apply(lambda x: fuzz.ratio(x["%s" % don_var], 'row["%s" %fund_var]') * 2, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*3
    else:
        v = don_tab.ix[d.idxmax(), ["%s"% don_var ,'Email','Date']].values
    return pd.Series(v, index=['donor name', 'donor email', 'donor date'])

trial=pd.concat((fundraisers, fundraisers.apply(get_donors_v1(fund_var="name",don_var="name",don_tab=donors), axis=1)), axis=1)

I get the following error:

TypeError: get_donors_v1() takes exactly 4 arguments (3 given)

Should I replace the function to:

get_donors_v1(row=None,fund_var,don_var, don_tab)

then i get the following error:

TypeError: ("'NoneType' object has no attribute 'getitem'", u'occurred at index 0')

please help.

1

There are 1 answers

2
user508402 On BEST ANSWER

In your code example, you supply get_donors() with the value None for the argument 'row'. In the next line, you're trying to use row as a map (row["%s" %fund_var]) without testing whether the object exists, that is: not equals None.

Indexing an object like 'row["%s" %fund_var]' causes the getitem method to be called, that None does not have indeed.