Needleman algorithm not working when matrix values are the same

110 views Asked by At

I am trying to solve the "Longest Common Subsequence" question using needleman.

Example: Input: text1 = "abcde", text2 = "ace" Output: 3
Explanation: The longest common subsequence is "ace" and its length is 3.

I am very confused on how the algorithm should work for the case where text1 ="ezu" and text2= "ubm".

The Needleman matrix is (assuming the cost of mismatch is -3, cost of gap is -4 and match is 1):

x - u b m
- 0 -4 -8 -12
e -4 -3 -7 -11
z -8 -7 -6 -10
u -12 -7 -10 -9

Now the algorithm states to traceback from the bottom corner of the matrix and:

  • if text[i] == text[j] => move diagonal up
  • else move to the maximum value between up and down.

Thus starting in -9, i have to choose between -10 and -10 (should i move up or down?) and regardless of the decision taken i will never hit the case [i,j] = [3,1]

My Code:

int longestCommonSubsequence(string text1, string text2) {
        
        int m = text1.length();
        int n = text2.length();
        
        vector<vector<int>> vec(m+1, vector(n+1,0));
        int mismatch  = -3;
        int gap = -4;
        int match = 1;
        
        for (int i = 1; i <= m; i++) {
            vec[i][0] = vec[i-1][0] + gap;
        }
        
        for (int i = 1; i <= n; i++) {
            vec[0][i] = vec[0][i-1] + gap;
        }
        
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                int cost = 0;
                if (text1[i-1] != text2[j-1]) {
                    cost = max(vec[i-1][j-1]+ mismatch, min(vec[i-1][j]+gap, vec[i][j-1]+gap));
                }  else {
                    cost = match  + vec[i-1][j-1];
                }
                vec[i][j] = cost;
            }
        }
        
        int matchLen = 0;
        
        int i = m;
        int j = n;
        
        while (i >0 && j >0) {
            
            if (text1[i-1] == text2[j-1]) {
                matchLen++;
                i--;
                j--;
            } else if (i > 0 && j > 0) {
                
                if (vec[i-1][j] > vec[i][j-1]) {
                    i--;
                } else {
                    
                    j--;
                }
            } else {
                break;
            }
            
        }
        
        return matchLen;
        
    }

Thanks!

1

There are 1 answers

2
Amir Sk On BEST ANSWER

Each cell of the Needle matrix is actually showing which action is considered best when aligning two sequence:

  • Insert a letter from both sequence and thus moving diagonally (Match or Mismatch)
  • Insert a letter from one sequence and a gap instead of the other one and thus moving vertical or horizontally

based on penalty you assigned to mismatch(-3) and gap (-4) and the score you gave to match (1),your only way to get a match (length =1):

e z u _ _

_ _ u b m    

will get: gap(-4) + gap(-4) + match(1) + gap(-4) + gap(-4) = -15

and the algorithm will select the highest score alignment (if your code was right which is not right now because you didn't implement mismatch selection and already assigning scores considering it) which is a total mismatch:

e z u

u b m

will get: mismatch(-3) + mismatch(-3) + mismatch(-3) = -9

If you want to have that match to happen in the selected sequence you should consider balancing your score system.