Find the most similar subsequence in another sequence

1.1k views Asked by At

I need to write an algorithm, that finds the most similar substring in S1 to another string S2 (substring in S1 that has minimum Hamming distance with S2, in other words) in N log(N), where N = len(S1) + len(S2), len(S2)<=len(S1).

For instance:
S1 = AGTCAGTC
S2 = GTC
Answer: GTC (distance 0)

S1 = AAGGTTCC
S2 = TCAA
Answer: TTCC (distance 3)

Time complexity may not exceed O(N Log(N)). Space complexity doesn't matter.

LCS (Longest Common Subsequence) doesn't work in my case. For example:


    S1 = GAATCCAGTCTGTCT
    S2 = AATATATAT

    LCS answer: AATCCAGTC
    GAATCCAGTCTGTCT
     |||
     AATATATAT

    right answer: AGTCTGTCT
    GAATCCAGTCTGTCT
          | | | | |
          AATATATAT

1

There are 1 answers

1
AtlasMeh-ed On

I think you are trying to solve the longest common subsequence problem. This problem deals with trying to find least amount of modifications necessary to transform one string into another.

You said you are trying to write an algorithm to do this. Take a look at the LCS problem and try googling it if you want to roll you own algorithm or you could take advantage of the command line utility diff.

Just like the longest common subsequence problem, diff takes two files and finds a sequence of additions and deletions that results in the least modifications to transform file1 into file2.Diff is very efficient and I imagine it will be fast enough for your purposes. I am pretty sure most diffs have space and time complexity of O(Nlog(N)) or less but you might want to verify this yourself. More on diff here http://en.wikipedia.org/wiki/Diff_utility

I wrote a small python program that uses diff to find the longest contiguous common subsequence. This works on unix but I am sure whatever platform you are using there is a diff utility.

The files are expected to have one character per line. You might have to write a program to perform that transformation on your files.

import sys
import subprocess
import os
import re

def getLinesFromShellCommand(command):
        p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=os.getcwd())
        lines = []
        for curLine in p.stdout.readlines():
                lines.append(curLine)
        retVal = p.wait()
        return (retVal, lines)

#We need to find the longest sequence of lines that start with a space( a space meant the line was in common).
#We could use some kind of while loop to detect the start and end of a group of lines that start with spaces. 
#However there is a much simpler method. Create a string by concatenating the first char from each line and use regex
#to find all the subsequences that start with spaces. After that just take the longest subsequence.
def findLongestCommonSubsequence(diffOutput):
    outputFirstCharStr = reduce(lambda x, y: x+y[:1], diffOutput, "")
    commonSubsequences = [(m.start(0), m.end(0)) for m in re.finditer(" +", outputFirstCharStr)]
    longestCommonSubsequence = max(commonSubsequences, key=lambda (start,end) : end - start)
    return longestCommonSubsequence

def main():
    if len(sys.argv) != 3:
        sys.stderr.write("usage: python LongestCommonSubsequence.py <file1> <file2>\n")
        sys.exit(1)
    commandOutput = getLinesFromShellCommand("diff -u {0} {1}".format(sys.argv[1], sys.argv[2]))
    if commandOutput[0] != 1: # apparently diff returns 1 if its successful
        sys.stderr.write("diff failed with input files.\n")
        sys.exit(1)
    diffOutput = commandOutput[1]
    diffOutput = diffOutput[3:] # the first three lines are not needed
    longestCommonSubsequence = findLongestCommonSubsequence(diffOutput)
    print "Indices:",longestCommonSubsequence
    for i in range(longestCommonSubsequence[0], longestCommonSubsequence[1]):
        print diffOutput[i],

if __name__ == "__main__":
    main()

usage

python LongestCommonSubsequence.py f1.txt f2.txt

Output if f1.txt and f2.txt were the second example you gave:

Indices: (5, 7)
 T
 C

Edit: I saw your comments about why the above won't work. You might be interested in this article though: https://cs.stackexchange.com/questions/2519/efficiently-calculating-minimum-edit-distance-of-a-smaller-string-at-each-positi