Python NLP: How do I autocorrect and tokenize a body of text only to a set list of words?

277 views Asked by At

Example:

token_list = ['Allen Bradley', 'Haas', 'Fanuc']

input_string = 'I use Alln Brdly machins but dont no how to use Has ones.'

output_tokens = ['Allen Bradley', 'Haas']
1

There are 1 answers

4
MGR-3301 On

Using textdistance could help you find the distance of two words for examply by using the hamming distance.

import textdistance as td

list = ['Allen', 'Bradley', 'Haas', 'Fanuc']

string = 'I use Alln Brdly machins but dont no how to use Has ones.'

#Defining a weight function to estimate the metrical distance of two words
#here the hamming similarity and distance are used
def word_correlation(word1: str, word2: str):
    sim_norm = td.hamming.normalized_similarity(word1, word2)
    dist_norm = td.hamming.normalized_distance(word1, word2)

    return {"similarity": sim_norm,
            "distance": dist_norm
            }

#splitting the sentence "string" into single words
words = [word for word in string.split(" ")]

#calculating the hamming distances and similarities for each word of the sentence
#with each of the chosen keywords contained in list
statistics = []
for i in range(len(list)):
    statistics.append({"check": list[i],
                   "with": {"words": [],
                            "cor": []
                            }
                   }
                  )
    for word in words:
        statistics[i]["with"]["words"].append(word)
        statistics[i]["with"]["cor"].append(word_correlation(word, list[i]))


#printing only the results with high similarities
result = []
for res in statistics:
    correction = res["check"]

    i = 0
    for cor in res["with"]["cor"]:
        
        #filtering of the propositional corrections by the normalized hamming
        #similarity
        if (cor["similarity"] > 0.25):
                result.append({"correction": correction,
                               "word": res["with"]["words"][i],
                               "likelyhood": cor["similarity"]
                               }
                              )

        i += 1


print(result)

This will return:

[{'correction': 'Allen', 'word': 'Alln', 'likelyhood': 0.6}, {'correction': 'Bradley', 'word': 'Brdly', 'likelyhood': 0.2857142857142857}, {'correction': 'Haas', 'word': 'Has', 'likelyhood': 0.5}]

You should definitely look into the definition of the metric between two words as my given solution using, e.g. the hamming distance, can give deviating results for words of different lengths! The definition should only be applied to words of the same size. hamming distance

As my example uses the hamming distance as the words are expected to be equal a typo will only change the length by +-1 in most cases. Therefore, the usage of the hamming distance or the hamming similarity as used in textdistance should work in simple cases.