Can someone explain this "Longest Common Subsequence" algorithm?

659 views Asked by At

The Longest Common Subsequence (LCS) problem is: given two sequences A and B, find the longest subsequence that is found both in A and in B. For example, given A = "peterparker" and B = "spiderman", the longest common subsequence is "pera".

Can someone explain this Longest Common Subsequence algorithm?

def longestCommonSubsequence(A: List, B: List) -> int:
    # n = len(A)
    # m = len(B)
    
    indeces_A = collections.defaultdict(list)
    
    # O(n)
    for i, a in enumerate(A):
        indeces_A[a].append(i)
    
    # O(n)
    for indeces_a in indeces_A.values():
        indeces_a.reverse()
    
    # O(m)
    indeces_A_filtered = []
    for b in B:
        indeces_A_filtered.extend(indeces_A[b])
    
    # The length of indeces_A_filtered is at most n*m, but in practice it's more like O(m) or O(n) as far as I can tell.
    iAs = []
    # O(m log m) in practice as far as I can tell.
    for iA in indeces_A_filtered:
        j = bisect.bisect_left(iAs, iA)
        if j == len(iAs):
            iAs.append(iA)
        else:
            iAs[j] = iA
    return len(iAs)

The algorithm as written finds the length of the longest common subsequence, but can be modified to find the longest common subsequence outright.

I found this algorithm as I was looking at the fastest python solutions to an equivalent problem on leetcode link. This algorithm was the fastest python solution (40 ms) for the problem and it also seems to have O(m log m) time complexity, which is much better than the O(m*n) time complexity of most other solutions.

I do not fully understand why it works and tried looking all over for known algorithms to the Longest Common Subsequence problem to find other mentions of it, but couldn't find anything like it. The closest thing I could find was the Hunt–Szymanski algorithm link which is also said to have O(m log m) in practice, but does not seem to be the same algorithm.

What I kind of understand:

  1. indeces_a are reversed so that in the iAs for loop, the smaller index is kept (this is more apparent when doing the walkthrough below.)
  2. As far as I can tell, the iAs for loop finds the longest increasing subsequence of indeces_A_filtered.

Thanks!


Here's a walkthrough of the algorithm for example A = "peterparker" and B = "spiderman"

     01234567890
A = "peterparker"
B = "spiderman"

indeces_A = {'p':[0,5], 'e':[1,3,9], 't':[2], 'r':[4,7,10], 'a':[6], 'k':[8]}

# after reverse
indeces_A = {'p':[5,0], 'e':[9,3,1], 't':[2], 'r':[10,7,4], 'a':[6], 'k':[8]}

#                     -p-  --e--  ---r--  a
indeces_A_filtered = [5,0, 9,3,1, 10,7,4, 6]

# the `iAs` loop

iA = 5
j = 0
iAs = [5]

iA = 0
j = 0
iAs = [0]

iA = 9
j = 1
iAs = [0,9]

iA = 3
j = 1
iAs = [0,3]

iA = 1
j = 1
iAs = [0,1]

iA = 10
j = 2
iAs = [0,1,10]

iA = 7
j = 2
iAs = [0,1,7]

iA = 4
j = 2
iAs = [0,1,4]

iA = 6
j = 3
iAs = [0,1,4,6] # corresponds to indices of A that spell out "pera", the LCS

return len(iAs) # 4, the length of the LCS
2

There are 2 answers

2
Tim Peters On BEST ANSWER

The missing bit here is "patience sorting", whose connection to longest increasing subsequence (LIS) is a bit subtle but well known. The final loop in the code is a bare bones implementation of patience sorting with "the greedy strategy". It does not, in general, compute a LIS directly, but rather the length of a LIS.

An easy-enough correctness proof, which includes a sketch of what's needed to reliably compute a LIS too (not just its length), can be found as Lemma 1 early in

"Longest Increasing Subsequences: From Patience Sorting to the Baik-Deift-Johansson Theorem" David Aldous and Persi Diaconis

0
joseville On

Understanding the LIS algorithm

def lenLIS(self, nums):
    lis = []
    for num in nums:
        i = bisect.bisect_left(lis, num)
        if i == len(lis):
            lis.append(num) # Append
        else:
            lis[i] = num # Overwrite
    return len(lis)

The algorithm above gives us the length of the Longest Increasing Subsequence (LIS) of nums, but it doesn't necessarily give us a LIS of nums. To understand why the algorithm above is correct as well as how to modify it to get a LIS of nums read on.

I explain by way of examples.


Example 1

nums = [7,2,8,1,3,4,10,6,9,5]
lenLIS(nums) == 5

The algorithm tells us the length of a LIS of nums is 5, but how do we get a LIS of nums.

We represent the history of lis as follows (this is explained below):

7
2, 8
1, 3, 4, 10
          6, 9
          5

The way we represent the history of lis is meaningful. To begin with, picture an empty table of rows and colums. We are initially at the top row. In each iteration of the for num in nums: loop, we either Append Overwrite depending the value of num and the values of lis:

  • Append: We represent this in the table by writing the appended value (num) in the next column over, i.e. at the ith column of the current row. The appended value is always greater than all the values in the current row.
  • Overwrite: If lis[i] is already equal to num, we don't do anything to the table. Otherwise, we represent this in the table by moving down to the next row and writing the new value (num) at the ith column of the new row. The new value is always less than all the other values in the colum.

Observations:

  1. The table can be sparse.
  2. Values are inserted in left to right, top to bottom order. Thus, whenever we move up or left in the table, we are moving to earlier elements of nums.
  3. As we go across a row, the values increase. Thus, a move to the left brings us to a lesser value.
  4. Suppose there is a value v at (r, i), but not at (r, i-1). This can only happen as the result of an Overwrite. Consider the state of lis before the Overwrite. There is a value v which must be placed in lis and we compute this place as i = bisect_left(lis, v). i will be computed s.t. lis[i-1] < v < lis[i]. From v at (r, i), we can reach lis[i-1] in the table by moving left once (to the empty (r, i-1)) and then moving up one ore more times until we encounter a value. That value will be lis[i-1].
  5. Putting 2. and 3. together, we have shown that within the table, we can always move left once and then up zero or more times to reach a lesser value. Denote one application of this movement as prev. Furthermore, 1. tells us that the lesser value we encounter by performing prev is a value which occurs earlier in nums.

We use 4. to get LIS(nums) from the table. Start at one of the rightmost values of the table, then perform prev repeatedly to encouter the other values of LIS(nums) in reverse.

Doing this procedure on the example, we start at 9. Applying prev once, we get 6. A second time, we get 4, then 3 then 1. Indeed [1,3,4,6,9] is one of the LIS of nums.


Example 2:

nums = [2,7,10,14,25,5,6,5,10,20,1,22,4,12,7,11,9,25]
lenLIS(nums)

The history of lis:

2, 7, 10, 14, 25
   5,  6  10, 20
1                22
   4          12
       7      11
           9     25

So lenLIS(nums) == 6 is len(LIS(nums). Let's find LIS(nums):

Again, start at one of the rightmost values on the table: 22. Applying prev once, we get 20. A second time, we get 10, then 6, then 5, then 2. So [2,5,6,10,20,22] is an LIS of nums.

We could have started at the other rightmost values: 25. Applying prev once, we get 11. A second time, we get 10, then 6, then 5, then 2. So [2,5,6,10,11,25] is another valid LIS of nums.


This explanation is similar to the one in https://www.stat.berkeley.edu/~aldous/Papers/me86.pdf, but I found it easier to understand for myself and just wanted to share it in case it was useful to someone else.