FuzzyWuzzy - Loop through list, match accepted values, and return a dataframe

Question

FuzzyWuzzy - Loop through list, match accepted values, and return a dataframe

5.8k views Asked by jonplaca At 01 January 2017 at 19:24

OBJECTIVE

Given an excel file (full of typos), use FuzzyWuzzy to compare and match the typos against an accepted list.
Correct the typo-filled excel file with the closest, accepted match.

APPROACH

Import Excel file with pandas
Push original, typo-filled excel file into a dataframe
Create an accepted dataframe
Compare typo dataframe with accepted dataframe using FuzzyWuzzy
Return the original spelling, the accepted spelling, and the matching score
Append the associated, accepted spelling to the original excel file/row for all spellings

CODE

#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))

#Let's clarify how many similar categories exist... 
q = """
    SELECT DISTINCT Expense 
    FROM xl
    ORDER BY Expense ASC

"""

expenses = sqldf(q)
print(expenses)

#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']

#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)

('Acquisition Fees', 38) not a high score, but high enough that it returns the expected output

!!!!!ISSUE!!!!!

#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
    for expense in expenses:
        return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)

correct_expense(expenses)

('Expense', ('Legal Fees', 47))

QUESTIONS

As you can see, the process.extractOne runs correctly when tested on a case-by-case basis. However, when run in a loop, the returned value is unexpected. I believe I may be grabbing the first or last column, but even if that were the case, I would expect "Director Fees" or "Acquisition" to pop up (see original excel file).

Original Q&A

There are 2 answers

**shawnheide** · Answer 1 · 2017-01-02T20:58:37+00:00

The way I've done this in the past is to just use the get_closest_matches function from the difflib module in Python. You can then create a function to get the closest match and apply it to the Expense column.

def correct_expense(row):
    accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
    match = get_close_matches(row, accepted, n=1, cutoff=0.3)
    return match[0] if match else ''

df['Expense_match'] = df['Expense'].apply(correct_expense)

Here's the original Expense column with the values matched to the accepted list:

You will probably need to fine-tune the accepted list and the cutoff value for get_closest_matches (I found that 0.3 worked pretty well for your sample data).

Once you're satisfied with the results, you can change the function to overwrite the Expense column and save to Excel using the pandas DataFrame method to_excel.

**iEriii** · Answer 2 · 2020-12-19T12:11:57+00:00

This is called gazetteer deduplication.
You perform deduplication by matching messy data against canonical data (i.e. gazette).

pandas-dedupe can do exactly that.
Example:

import pandas as pd
import pandas_dedupe

clean_data = pd.DataFrame({'street': ['Onslow square', 'Sydney Mews', 'Summer Place', 'Bury Walk', 'sydney mews']})
messy_data = pd.DataFrame({'street_name':['Onslow sq', 'Sidney Mews', 'Summer pl', 'Onslow square', 'Bury walk', 'onslow sq', 'Bury Wall'],
                           'city' : ['London', 'London', 'London', 'London', 'London', 'London', 'London']})

dd = pandas_dedupe.gazetteer_dataframe(
    clean_data, 
    messy_data, 
    field_properties = 'street_name', 
    canonicalize=True,
    )

During the process, pandas-dedupe will ask you to label few examples as duplicate or distinct records. The library will then use this knowledge to find potential duplicate entries, match them against the clean data and return all relevant info, including the confidence in the results.

TechQA.

FuzzyWuzzy - Loop through list, match accepted values, and return a dataframe

There are 2 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in FUZZYWUZZY

Popular Questions

Popular Tags

Trending Questions