How to calculate score in fuzzy string matching?

6.5k views Asked by At

I would like to know the mathematical logic and formula behind calculating the fuzzy matching score between two strings.

Let's say I have two strings s1 and s2 and I want to use fuzzy matching in python. I am aware of the fact that python libraries like fuzzywuzzy can do the trick. But I would like to know the exact mathematics and logic behind the fuzzy matching approach and the ratio calculation.

3

There are 3 answers

0
Elidor00 On BEST ANSWER

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern. The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match. Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy search works by using mathematical formulae that calculate the distance (or similarity between) two words. One such commonly used method is called the Levenshtein distance.

Here you can find the formula.

An alternative to the Levenshtein distance is to use cosine similarity. The real advantage of cosine distance is that you can perform dimensionality reduction. This allows you to work with very large documents efficiently and fuzzy. It also allows you to create efficient data structures for finding similar strings and much more.

Here you can find the formula.

0
Neeraj Thapliyal On

The fuzzy string matching for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w. Replace consonants with digits as follows (after the first letter):

b, f, p, v → 1

c, g, j, k, q, s, x, z → 2

d, t → 3

l → 4

m, n → 5

r → 6

If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have four or more numbers, retain only the first three.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261". "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P'), and "Honeyman" yields "H555".

You can found that in detail here: https://en.wikipedia.org/wiki/Soundex#:~:text=Soundex%20is%20a%20phonetic%20algorithm,despite%20minor%20differences%20in%20spelling.

0
Varghese PK On

The above answers have given an in-depth explanation of two commonly used algorithms for fuzzy string matching namely the Levenshtein distance and the Soundex algorithm. I will try and explain the use of Cosine Similarity in fuzzy string matching.

Cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors. The core idea is to find the corresponding vector representation of each string. This can be done by using either the Bag-of-words feature extractor or the TF-IDF feature extractor. The similarity between your strings is proportional to the cosine similarity metric.

For example: Consider the following names: Name1 : athaarv Name2: atharv

Step 1) 2gram representation of these strings is

Name1_2gram = ["at","th","ha","aa","ar","rv"]

Name2_2gram = ["at","th","ha","ar","rv"]

Step2) Bag of Words Representation:

Name1_BOW = [1,1,1,1,1,1]

Name2_BOW = [1,1,1,0,1,1]

Step3) Compute the Cosine Similarity:

Cosine similarity = ([1 1 1 1 1 1].[1 1 1 0 1 1])/(sqrt(5)*sqrt(6)) = 0.91

NOTE: String representations using TF-IDF are usually better than BOW.Also the Nanonets blog has several resources on this particular topic which might be helpful.