Why is the token set ratio so low using fuzzywuzzy?

1.8k views Asked by At

I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following:

# I preprocessed the comments to remove stop words and commonly mentioned other words

fuzz.token_set_ratio("reporting michael anders sven straumann guy called jonatjan smith partners","jonathan smith")

# returns 52.6

Jonathan Smith has only one spelling mistake, why is the ratio so low?

Moreover, would there be an option to overcome the problem so that Jonathan receives a higher score?

thanks for your help, Michael

1

There are 1 answers

4
maxbachmann On

Fuzz.token_set_ratio is not really the right ratio for your problem, since it sorts the words, while you would like to keep the pairing of first and second name. You could use fuzz.partial_ratio to compare only the best matching substring of the longer string to the shorter string.

fuzz.partial_ratio(
  "reporting michael anders sven straumann guy called jonatjan smith partners",
  "jonathan smith")
# returns 92.85714285714286