Python fuzzy matching of names with only first initials

3.6k views Asked by At

I have a case where I need to match a name from a given string to a database of names. Below I have given a very simple example of the issue that I am running into, and I am unclear as to why one case works over the other? If I'm not mistaken, the Python default algorithm for extractOne() is the Levenshtein distance algorithm. Is it because the Clemens' names provide the first two initials, opposed to only one in the Gonzalez's case?

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

s = ['Gonzalez, E. walked down the street.', 'Gonzalez, R. went to the market.', 'Clemens, Ko. reach the intersection; Clemens, Ka. did not.']

names = []

for i in s:

    name = [] #clear name
    for k in i.split():
        if k[0].isupper(): name.append(k)
        else: break
    names.append(' '.join(name))

    if ';' in i:
        for each in i.split(';')[1:]:
            name = [] #clear name
            for k in each.split():
                if k[0].isupper(): name.append(k)
                else: break
            names.append(' '.join(name))

print(names)

choices = ['Kody Clemens','Kacy Clemens','Gonzalez Ryan', 'Gonzalez Eddy']

for i in names:
    s = process.extractOne(i, choices)
    print(s, i)

OUTPUT:

['Gonzalez, E.', 'Gonzalez, R.', 'Clemens, Ko.', 'Clemens, Ka.']
('Gonzalez Ryan', 85) Gonzalez, E.
('Gonzalez Ryan', 85) Gonzalez, R.
('Kody Clemens', 86) Clemens, Ko.
('Kacy Clemens', 86) Clemens, Ka.
1

There are 1 answers

2
YG14 On BEST ANSWER

Although @Igle's commment does solve this specific problem, I want to stress that this is a narrow solution that won't necessarily work for everything. Fuzzywuzzy has multiple scorers that use the Levenshtein distance algorithm combined with different logic to compare strings. The default scorer, fuzz.WRatio, compares the matching score of the straight Levenshtein distance algorithm (fuzz.ratio) with other variants, and returns the best match from all of the scorers. There's more to it than just that, including additional logic around weighting the score from different methods, if you're interested I suggest looking at the source code for fuzz.WRatio.

To see what's happening in your case, you can compare the scores for all the choices across scorers by slightly adapting the last lines of your code:

For token_set_ratio:

for i in names:
   s = process.extract(i, choices,scorer=fuzz.token_set_ratio)
   print(s, i)

[('Gonzalez Ryan', 89), ('Gonzalez Eddy', 89), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, E.
[('Gonzalez Ryan', 89), ('Gonzalez Eddy', 89), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Ryan', 26), ('Gonzalez Eddy', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.

For token_sort_ratio:

for i in names:
   s = process.extract(i, choices,scorer=fuzz.token_sort_ratio)
   print(s, i)

[('Gonzalez Eddy', 87), ('Gonzalez Ryan', 70), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, E.
[('Gonzalez Ryan', 87), ('Gonzalez Eddy', 70), ('Kody Clemens', 27), ('Kacy Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Ryan', 26), ('Gonzalez Eddy', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.

Although token_sort_ratio shows a clear winning match, token_set_ratio returns higher scores which is how fuzz.WRatio picks what result it returns. Another major issue is that when you have such similar queries and choices, the order in which they are compared starts to matter. For example, when I run the exact same code as above, but reverse the order of the choices list we get 'Gonzalez Eddy' for both:

for i in names:
   s = process.extract(i, choices[::-1],scorer=fuzz.token_set_ratio)
   print(s, i)
[('Gonzalez Eddy', 89), ('Gonzalez Ryan', 89), ('Kacy Clemens', 27), ('Kody Clemens', 27)] Gonzalez, E.
[('Gonzalez Eddy', 89), ('Gonzalez Ryan', 89), ('Kacy Clemens', 27), ('Kody Clemens', 27)] Gonzalez, R.
[('Kody Clemens', 91), ('Kacy Clemens', 82), ('Gonzalez Eddy', 26), ('Gonzalez Ryan', 26)] Clemens, Ko.
[('Kacy Clemens', 91), ('Kody Clemens', 82), ('Gonzalez Ryan', 35), ('Gonzalez Eddy', 26)] Clemens, Ka.

I'm guessing that the correct match actually has a higher score, but 'Eddy' and 'Ryan' are close enough to both round to the same final score.

Ways I've dealt with similar issues in the past:

  1. Use extract instead of extractOne (like I did in the examples above)
  2. Process the same queries/choices with multiple scorers (ratio, token_set_ratio, token_sort_ratio) and use a weighted average of these scores to pick your best match.
  3. Adjust the fuzzywuzzy source code to incorporate custom weighting or remove the rounding.