I would like to know the mathematical logic and formula behind calculating the fuzzy matching score between two strings.
Let's say I have two strings s1 and s2 and I want to use fuzzy matching in python. I am aware of the fact that python libraries like fuzzywuzzy can do the trick. But I would like to know the exact mathematics and logic behind the fuzzy matching approach and the ratio calculation.
Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern. The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match. Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).
Fuzzy search works by using mathematical formulae that calculate the distance (or similarity between) two words. One such commonly used method is called the Levenshtein distance.
Here you can find the formula.
An alternative to the Levenshtein distance is to use cosine similarity. The real advantage of cosine distance is that you can perform dimensionality reduction. This allows you to work with very large documents efficiently and fuzzy. It also allows you to create efficient data structures for finding similar strings and much more.
Here you can find the formula.