Python: How to correct misspelled names

6.4k views Asked by At

I have a list with city names, which some of them are misspelled:

['bercelona', 'emstrdam', 'Praga']

And a list with all possible city names well spelled:

['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']

I'm looking for an algorithm able to find the closest match between the names of the first and second list, and returns the first list with its well spelled names. So it should return the following list:

['Barcelona', 'Amsterdam', 'Prague']
3

There are 3 answers

6
pivanchy On BEST ANSWER

You may use built-in Ratcliff and Obershelp algorithm:

def is_similar(first, second, ratio):
    return difflib.SequenceMatcher(None, first, second).ratio() > ratio


first = ['bercelona', 'emstrdam', 'Praga']
second = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']

result = [s for f in first for s in second if is_similar(f,s, 0.7)]
print result
['Barcelona', 'Amsterdam', 'Prague']

Where 0.7 is coefficient of similarity. It may do some tests for your case and set this value. It shows how similar are both of strings(1 - it's the same string, 0 - very different strings)

5
ebeneditos On

First, you should use Levenshtein distances between strings, I found a link with the following Levenshtein Distance Algorithm for Python:

# Define Levenshtein distance function (from the mentioned link)
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 
            deletions = current_row[j] + 1  
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

Once you got this, you should make a function able to find the closest match between a given string and the list with the well spelled names.

names_list = ['bercelona', 'emstrdam', 'Praga']
good_names = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']

# Define a function that returns the best match
def get_closest_match(name, real_names):
    levdist = [levenshtein(name, real_name) for real_name in real_names]
    for i in range(len(levdist)):
        if levdist[i] == min(levdist):
            return real_names[i]

# Loops the first list
final_list = [ get_closest_match(name, good_names) for name in names_list ]

Finally you just have to loop the first list with this function. The outcome is the following:

>>> print final_list
['Barcelona', 'Amsterdam', 'Prague']
0
datawrestler On

This could be a good use case of a great package called fuzzywuzzy.

from fuzzywuzzy import fuzz
import numpy as np

bad = ['bercelona', 'emstrdam', 'Praga']

good = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']

# you can even set custom threshold and only return matches if above certain
# matching threshold
def correctspell(word, spellcorrect, thresh = 70):
    mtchs = map(lambda x: fuzz.ratio(x, word) if fuzz.ratio(x, word) > thresh else None, spellcorrect)
    max = np.max(mtchs)
    if max is not None:
        return spellcorrect[mtchs.index(max)]
    else:
        return None

# get correct spelling
map(lambda x: correctspell(x, good, thresh = 70), bad) # ['Barcelona', 'Amsterdam', 'Prague']