OBJECTIVE
- Given an excel file (full of typos), use FuzzyWuzzy to compare and
match the typos against an
accepted
list. - Correct the typo-filled excel file with the closest,
accepted
match.
APPROACH
- Import Excel file with pandas
- Push original, typo-filled excel file into a dataframe
- Create an
accepted
dataframe - Compare typo dataframe with
accepted
dataframe using FuzzyWuzzy - Return the original spelling, the accepted spelling, and the matching score
- Append the associated, accepted spelling to the original excel file/row for all spellings
CODE
#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))
#Let's clarify how many similar categories exist...
q = """
SELECT DISTINCT Expense
FROM xl
ORDER BY Expense ASC
"""
expenses = sqldf(q)
print(expenses)
#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)
('Acquisition Fees', 38) not a high score, but high enough that it returns the expected output
!!!!!ISSUE!!!!!
#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
for expense in expenses:
return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)
correct_expense(expenses)
('Expense', ('Legal Fees', 47))
QUESTIONS
- As you can see, the process.extractOne runs correctly when tested on a case-by-case basis. However, when run in a loop, the returned value is unexpected. I believe I may be grabbing the first or last column, but even if that were the case, I would expect "Director Fees" or "Acquisition" to pop up (see original excel file).
The way I've done this in the past is to just use the
get_closest_matches
function from thedifflib
module in Python. You can then create a function to get the closest match and apply it to theExpense
column.Here's the original
Expense
column with the values matched to theaccepted
list:You will probably need to fine-tune the
accepted
list and thecutoff
value forget_closest_matches
(I found that 0.3 worked pretty well for your sample data).Once you're satisfied with the results, you can change the function to overwrite the
Expense
column and save to Excel using the pandas DataFrame methodto_excel
.