I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following:
# I preprocessed the comments to remove stop words and commonly mentioned other words
fuzz.token_set_ratio("reporting michael anders sven straumann guy called jonatjan smith partners","jonathan smith")
# returns 52.6
Jonathan Smith has only one spelling mistake, why is the ratio so low?
Moreover, would there be an option to overcome the problem so that Jonathan receives a higher score?
thanks for your help, Michael
Fuzz.token_set_ratio
is not really the right ratio for your problem, since it sorts the words, while you would like to keep the pairing of first and second name. You could usefuzz.partial_ratio
to compare only the best matching substring of the longer string to the shorter string.