Can somebody point to the right direction how to do this task: I have a lot of array of strings with such buckets:
- Somebody did that two years ago
- Somebody did th two years ago
- Somebody did that two years a
- Somedody d that two years ago
- Somebody that two years ago
And I need to get this: Somebody did that two years ago
Any links to algorithms or libs will be awesome. These strings came from OCR and sometimes OCR make mistakes in letter/words but I have 2-5 different spelling of same string.
Update Based on suggestions from @alec_djinn I`ve found the python lib that can create "median" string based on Levenshtein distance. https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median
You can use a sequence alignment algorithm and then find the consensus on the aligned sequences.
There are tons of libraries and software available but they are often suited for biological sequences only (DNA, RNA, proteins). One python library for general-string alignments is https://pypi.python.org/pypi/alignment/
Once you have the sequences aligned, you can use the following (very basic) way of computing a consensus.
Where
sequences
is a list of aligned sequences. All aligned sequences should be of the same length.