FuzzyWuzzy - Loop through list, match accepted values, and return a dataframe

5.9k views Asked by At

OBJECTIVE

  • Given an excel file (full of typos), use FuzzyWuzzy to compare and match the typos against an accepted list.
  • Correct the typo-filled excel file with the closest, accepted match.

APPROACH

  1. Import Excel file with pandas
  2. Push original, typo-filled excel file into a dataframe
  3. Create an accepted dataframe
  4. Compare typo dataframe with accepted dataframe using FuzzyWuzzy
  5. Return the original spelling, the accepted spelling, and the matching score
  6. Append the associated, accepted spelling to the original excel file/row for all spellings

CODE

#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))

excel dataframe

#Let's clarify how many similar categories exist... 
q = """
    SELECT DISTINCT Expense 
    FROM xl
    ORDER BY Expense ASC

"""

expenses = sqldf(q)
print(expenses)

similar categories

#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']

#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)

('Acquisition Fees', 38) not a high score, but high enough that it returns the expected output

!!!!!ISSUE!!!!!

#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
    for expense in expenses:
        return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)

correct_expense(expenses)

('Expense', ('Legal Fees', 47))

QUESTIONS

  1. As you can see, the process.extractOne runs correctly when tested on a case-by-case basis. However, when run in a loop, the returned value is unexpected. I believe I may be grabbing the first or last column, but even if that were the case, I would expect "Director Fees" or "Acquisition" to pop up (see original excel file).
2

There are 2 answers

0
shawnheide On

The way I've done this in the past is to just use the get_closest_matches function from the difflib module in Python. You can then create a function to get the closest match and apply it to the Expense column.

def correct_expense(row):
    accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
    match = get_close_matches(row, accepted, n=1, cutoff=0.3)
    return match[0] if match else ''

df['Expense_match'] = df['Expense'].apply(correct_expense)

Here's the original Expense column with the values matched to the accepted list:

enter image description here

You will probably need to fine-tune the accepted list and the cutoff value for get_closest_matches (I found that 0.3 worked pretty well for your sample data).

Once you're satisfied with the results, you can change the function to overwrite the Expense column and save to Excel using the pandas DataFrame method to_excel.

0
iEriii On

This is called gazetteer deduplication.
You perform deduplication by matching messy data against canonical data (i.e. gazette).

pandas-dedupe can do exactly that.
Example:

import pandas as pd
import pandas_dedupe

clean_data = pd.DataFrame({'street': ['Onslow square', 'Sydney Mews', 'Summer Place', 'Bury Walk', 'sydney mews']})
messy_data = pd.DataFrame({'street_name':['Onslow sq', 'Sidney Mews', 'Summer pl', 'Onslow square', 'Bury walk', 'onslow sq', 'Bury Wall'],
                           'city' : ['London', 'London', 'London', 'London', 'London', 'London', 'London']})

dd = pandas_dedupe.gazetteer_dataframe(
    clean_data, 
    messy_data, 
    field_properties = 'street_name', 
    canonicalize=True,
    )

During the process, pandas-dedupe will ask you to label few examples as duplicate or distinct records. The library will then use this knowledge to find potential duplicate entries, match them against the clean data and return all relevant info, including the confidence in the results.