Standardizing Slightly Different Values Across 4 Dataframes

49 views Asked by At

I am looking for an elegant (computationally less expensive solution) to this:

I have 4 scraped dataframes containing football club names. Sometimes 1 or 2 out of 4 frames contain the name slightly differently:

df1 = pd.DataFrame({'Home Team': ['France', 'Italy, 'Spain', 'Palmeiras SE']})

df2 = pd.DataFrame({'Home Team': ['France Woman', 'Italy, 'Spain', 'Palmeiras']})

df3 = pd.DataFrame({'Home Team': ['France', 'Italy, 'Spain Woman', 'Palmeiras SE']})

df4 = pd.DataFrame({'Home Team': ['France', 'Italy, 'Spain Woman', 'Palmeiras']})

In reality, all dataframes have around 50 to a 100 values and they are updated daily. I have tried fuzzywuzzy and difflib libraries and even thought about bringing in master data as a point of reference for all future updates. All of these require a high number of computation and someone might have a better idea.

Appreciate it. Kristof

A couple of things that somewhat worked for me:

Fuzzymatcher with 2 frames:

import fuzzymatcher

result = fuzzymatcher.fuzzy_left_join(whoscored, olbg_data, 'Home Team', 'Home Team')

output = result[['Home Team_left', 'Predicted Result_left', 'Predicted Result_right']]

output.columns = ['Home Team', 'Predicted Result (Whoscored)', 'Predicted Result (OLBG Data)']

This easily picked up things like:

Talleres Talleres CA Talleres de Cordoba Germany Germany Germany Women

Result

When I use 4 frames:

import fuzzymatcher

result = fuzzymatcher.fuzzy_left_join(whoscored, olbg_data, 'Home Team', 'Home Team')
output = result[['Home Team_left', 'Predicted Result_left', 'Predicted Result_right']]
output.columns = ['Home Team', 'Predicted Result (Whoscored)', 'Predicted Result (OLBG Data)']

result_predictz = fuzzymatcher.fuzzy_left_join(whoscored, predictz_output, 'Home Team', 'Home Team')
output = output.copy()
output.loc[:, 'Predicted Result (Predictz)'] = result_predictz['Predicted Result_right']

result_vitibet = fuzzymatcher.fuzzy_left_join(whoscored, vitibet, 'Home Team', 'Home Team')
output = output.copy()
output.loc[:, 'Predicted Result (Vitibet)'] = result_vitibet['Predicted Result_right']

It gives a somewhat ok result but some values come back as NaN even though they are in the dataframe. Dataframe itself: Dataframe

As you can see, Vitibet['Predicted Result'] will have row 7 back as NaN, even though it is present just with a slightly different value: Proof

0

There are 0 answers