I have a case where I need to match a name from a given string to a database of names. Below I have given a very simple example of the issue that I am running into, and I am unclear as to why one case works over the other? If I'm not mistaken, the Python default algorithm for extractOne() is the Levenshtein distance algorithm. Is it because the Clemens' names provide the first two initials, opposed to only one in the Gonzalez's case?
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
s = ['Gonzalez, E. walked down the street.', 'Gonzalez, R. went to the market.', 'Clemens, Ko. reach the intersection; Clemens, Ka. did not.']
names = []
for i in s:
name = [] #clear name
for k in i.split():
if k[0].isupper(): name.append(k)
else: break
names.append(' '.join(name))
if ';' in i:
for each in i.split(';')[1:]:
name = [] #clear name
for k in each.split():
if k[0].isupper(): name.append(k)
else: break
names.append(' '.join(name))
print(names)
choices = ['Kody Clemens','Kacy Clemens','Gonzalez Ryan', 'Gonzalez Eddy']
for i in names:
s = process.extractOne(i, choices)
print(s, i)
OUTPUT:
['Gonzalez, E.', 'Gonzalez, R.', 'Clemens, Ko.', 'Clemens, Ka.']
('Gonzalez Ryan', 85) Gonzalez, E.
('Gonzalez Ryan', 85) Gonzalez, R.
('Kody Clemens', 86) Clemens, Ko.
('Kacy Clemens', 86) Clemens, Ka.
Although @Igle's commment does solve this specific problem, I want to stress that this is a narrow solution that won't necessarily work for everything. Fuzzywuzzy has multiple scorers that use the Levenshtein distance algorithm combined with different logic to compare strings. The default scorer, fuzz.WRatio, compares the matching score of the straight Levenshtein distance algorithm (fuzz.ratio) with other variants, and returns the best match from all of the scorers. There's more to it than just that, including additional logic around weighting the score from different methods, if you're interested I suggest looking at the source code for fuzz.WRatio.
To see what's happening in your case, you can compare the scores for all the choices across scorers by slightly adapting the last lines of your code:
For token_set_ratio:
For token_sort_ratio:
Although token_sort_ratio shows a clear winning match, token_set_ratio returns higher scores which is how fuzz.WRatio picks what result it returns. Another major issue is that when you have such similar queries and choices, the order in which they are compared starts to matter. For example, when I run the exact same code as above, but reverse the order of the choices list we get 'Gonzalez Eddy' for both:
I'm guessing that the correct match actually has a higher score, but 'Eddy' and 'Ryan' are close enough to both round to the same final score.
Ways I've dealt with similar issues in the past: