Correcting mistakes in strings based on array of similar strings

Question

Correcting mistakes in strings based on array of similar strings

227 views Asked by Gaploid At 11 September 2017 at 12:12

Can somebody point to the right direction how to do this task: I have a lot of array of strings with such buckets:

Somebody did that two years ago
Somebody did th two years ago
Somebody did that two years a
Somedody d that two years ago
Somebody that two years ago

And I need to get this: Somebody did that two years ago

Any links to algorithms or libs will be awesome. These strings came from OCR and sometimes OCR make mistakes in letter/words but I have 2-5 different spelling of same string.

Update Based on suggestions from @alec_djinn I`ve found the python lib that can create "median" string based on Levenshtein distance. https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median

Original Q&A

There are 1 answers

**alec_djinn** · Accepted Answer · 2017-09-11T12:22:05+00:00

You can use a sequence alignment algorithm and then find the consensus on the aligned sequences.

There are tons of libraries and software available but they are often suited for biological sequences only (DNA, RNA, proteins). One python library for general-string alignments is https://pypi.python.org/pypi/alignment/

Once you have the sequences aligned, you can use the following (very basic) way of computing a consensus.

def compute_consensus(sequences):
    consensus = ''
    for i in range(len(sequences[0])):
        char_count = Counter()
        for seq in sequences:
            char_count.update(seq[i])
        consensus += char_count.most_common()[0][0]

    return consensus.replace('-','') #assuming '-' represent deleted letters

Where sequences is a list of aligned sequences. All aligned sequences should be of the same length.

TechQA.

Correcting mistakes in strings based on array of similar strings

There are 1 answers

Related Questions in TEXT

Related Questions in GROUPING

Related Questions in FUZZY

Popular Questions

Popular Tags

Trending Questions