Correcting mistakes in strings based on array of similar strings

237 views Asked by At

Can somebody point to the right direction how to do this task: I have a lot of array of strings with such buckets:

  • Somebody did that two years ago
  • Somebody did th two years ago
  • Somebody did that two years a
  • Somedody d that two years ago
  • Somebody that two years ago

And I need to get this: Somebody did that two years ago

Any links to algorithms or libs will be awesome. These strings came from OCR and sometimes OCR make mistakes in letter/words but I have 2-5 different spelling of same string.

Update Based on suggestions from @alec_djinn I`ve found the python lib that can create "median" string based on Levenshtein distance. https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median

1

There are 1 answers

3
alec_djinn On BEST ANSWER

You can use a sequence alignment algorithm and then find the consensus on the aligned sequences.

There are tons of libraries and software available but they are often suited for biological sequences only (DNA, RNA, proteins). One python library for general-string alignments is https://pypi.python.org/pypi/alignment/

Once you have the sequences aligned, you can use the following (very basic) way of computing a consensus.

def compute_consensus(sequences):
    consensus = ''
    for i in range(len(sequences[0])):
        char_count = Counter()
        for seq in sequences:
            char_count.update(seq[i])
        consensus += char_count.most_common()[0][0]

    return consensus.replace('-','') #assuming '-' represent deleted letters

Where sequences is a list of aligned sequences. All aligned sequences should be of the same length.