I am looking for an elegant (computationally less expensive solution) to this:
I have 4 scraped dataframes containing football club names. Sometimes 1 or 2 out of 4 frames contain the name slightly differently:
df1 = pd.DataFrame({'Home Team': ['France', 'Italy, 'Spain', 'Palmeiras SE']})
df2 = pd.DataFrame({'Home Team': ['France Woman', 'Italy, 'Spain', 'Palmeiras']})
df3 = pd.DataFrame({'Home Team': ['France', 'Italy, 'Spain Woman', 'Palmeiras SE']})
df4 = pd.DataFrame({'Home Team': ['France', 'Italy, 'Spain Woman', 'Palmeiras']})
In reality, all dataframes have around 50 to a 100 values and they are updated daily. I have tried fuzzywuzzy and difflib libraries and even thought about bringing in master data as a point of reference for all future updates. All of these require a high number of computation and someone might have a better idea.
Appreciate it. Kristof
A couple of things that somewhat worked for me:
Fuzzymatcher with 2 frames:
import fuzzymatcher
result = fuzzymatcher.fuzzy_left_join(whoscored, olbg_data, 'Home Team', 'Home Team')
output = result[['Home Team_left', 'Predicted Result_left', 'Predicted Result_right']]
output.columns = ['Home Team', 'Predicted Result (Whoscored)', 'Predicted Result (OLBG Data)']
This easily picked up things like:
Talleres Talleres CA Talleres de Cordoba Germany Germany Germany Women
When I use 4 frames:
import fuzzymatcher
result = fuzzymatcher.fuzzy_left_join(whoscored, olbg_data, 'Home Team', 'Home Team')
output = result[['Home Team_left', 'Predicted Result_left', 'Predicted Result_right']]
output.columns = ['Home Team', 'Predicted Result (Whoscored)', 'Predicted Result (OLBG Data)']
result_predictz = fuzzymatcher.fuzzy_left_join(whoscored, predictz_output, 'Home Team', 'Home Team')
output = output.copy()
output.loc[:, 'Predicted Result (Predictz)'] = result_predictz['Predicted Result_right']
result_vitibet = fuzzymatcher.fuzzy_left_join(whoscored, vitibet, 'Home Team', 'Home Team')
output = output.copy()
output.loc[:, 'Predicted Result (Vitibet)'] = result_vitibet['Predicted Result_right']
It gives a somewhat ok result but some values come back as NaN even though they are in the dataframe. Dataframe itself: Dataframe
As you can see, Vitibet['Predicted Result'] will have row 7 back as NaN, even though it is present just with a slightly different value: Proof