Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

3.1k views Asked by At

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):

df1:

            PRODUCT_ID           PRODUCT_DESCRIPTION
0           165985858958         "Fish Burger with Lettuce"
1           185965653252         "Chicken Salad with Dressing"
2           165958565556         "Pork and Honey Rissoles"
3           655262522233         "Cheese, Ham and Tomato Sandwich"
4           857485966653         "Coleslaw with Yoghurt Dressing"
5           524156285551         "Lemon and Raspberry Cheesecake"

I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:

df2 (also saved as dict_2)

       PROD_ID   PROD_DESCRIPTION
0      548576    "Fish Burger"
1      156956    "Chckn Salad w/Ranch Dressing"
2      257848    "Rissoles - Lamb & Rosemary"
3      298770    "Lemn C-cake"
4      651452    "Potato Salad with Bacon"
5      100256    "Cheese Cake - Lemon Raspberry Coulis"

What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:

     PRODUCT_ID      PRODUCT_DESCRIPTION               PROD_ID   PROD_DESCRIPTION
0    165985858958    "Fish Burger with Lettuce"        548576    "Fish Burger"
1    185965653252    "Chicken Salad with Dressing"     156956    "Chckn Salad w/Ranch Dressing"
2    165958565556    "Pork and Honey Rissoles"         257848    "Rissoles - Lamb & Rosemary"     
3    655262522233    "Cheese, Ham and Tomato Sandwich" NaN       NaN
4    857485966653    "Coleslaw with Yoghurt Dressing"  NaN       NaN
5    524156285551    "Lemon and Raspberry Cheesecake"  298770    "Lemn C-cake"
6    524156285551    "Lemon and Raspberry Cheesecake"  100256    "Cheese Cake - Lemon Raspberry Coulis"

I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.

I've used the following questions and answers (amongst others):

Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array

and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.

I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.

I'm using Python 2.7, pandas and have fuzzywuzzy installed.

3

There are 3 answers

6
piRSquared On BEST ANSWER

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:
    for j in df3.columns:
        vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
        vj = df2.get_value(j, 'PROD_DESCRIPTION')
        df3.set_value(
            i, j, fuzz.ratio(vi, vj))

print(df3)

    0   1   2   3   4   5
0  63  15  24  23  34  27
1  26  84  19  21  52  32
2  18  31  33  12  35  34
3  10  31  35  10  41  42
4  29  52  32  10  42  12
5  15  28  21  49   8  55

Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.

threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)

Make assignments

df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df

enter image description here

0
Rick On

You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:

d = {
    'df1_id': [],
    'df1_prod_desc': [],
    'df2_id': [],
    'df2_prod_desc': [],
    'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
    for _, df2_row in df2.iterrows():
        d['df1_id'] = df1_row['PRODUCT_ID']
        ...
df3 = pd.DataFrame.from_dict(d)
0
Pombaastic On

I don't have enough reputation to be able to comment on answer from @piRSquared. Hence this answer.

  • The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
  • Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
  • Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)

A million thanks to @piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.