Needleman-Wunsch implementation gives different alignments in cogent and in skbio

525 views Asked by At

The implementation in skbio gives an odd result compared to the result that you would get from the implementation in pycogent.

from cogent.align.algorithm import nw_align as nw_align_cogent
from skbio.alignment import global_pairwise_align_nucleotide as nw_align_scikit

seq_1 = 'ATCGATCGATCG'
seq_2 = 'ATCGATATCGATCG'

print "Sequences: "
print "     %s" % seq_1
print "     %s" % seq_2
print

alignment = nw_align_scikit(seq_1, seq_2)
al_1, al_2 = [alignment.get_seq(_id).__str__() for _id in alignment.ids()]

print "    nw alignment using scikit:"
print "        %s" % al_1
print "        %s" % al_2
print

al_1, al_2 = nw_align_cogent(seq_1, seq_2)

print "    nw alignment using cogent:"
print "        %s" % al_1
print "        %s" % al_2
print

The output looks like this:

nw alignment using scikit:
    ------ATCGATCGATCG
    ATCGATATCGATCG----

nw alignment using cogent:
    ATCGAT--CGATCG
    ATCGATATCGATCG
1

There are 1 answers

0
gregcaporaso On BEST ANSWER

This is a good question, and has to do with differences in how alignments are scored in scikit-bio and PyCogent. By default, in scikit-bio, terminal gaps are not penalized as that can lead to some very strange alignments. This issue was briefly discussed here and illustrated here (see the last cell of the notebook).

If you want to achieve a solution more similar to the one in PyCogent, you can pass penalize_terminal_gaps=True to global_pairwise_align_nucleotide as follows:

alignment = nw_align_scikit(seq_1, seq_2, penalize_terminal_gaps=True)
al_1, al_2 = [alignment.get_seq(_id).__str__() for _id in alignment.ids()]

print "    nw alignment using scikit:"
print "        %s" % al_1
print "        %s" % al_2

output:

nw alignment using scikit:
        ATCG--ATCGATCG
        ATCGATATCGATCG

You'll notice that the alignment is still not identical to the one you get from PyCogent, but this is a minor implementation difference: the two resulting alignments have the same score (the difference is in whether the -- aligns to the first AT or the second AT in the ATAT repeat), and the two implementations make a different choice in how they break that tie.

If you go back to the alignment that you posted (the default from scikit-bio), what you'll notice is that the alignment that is returned is very good - in fact, it's the optimally scoring alignment if not penalizing the terminal gaps (by definition, because the optimal scoring alignment is what it returns). However, you're right that it's strange, because the alignment that scikit-bio returns in this specific case is probably not the most biologically relevant alignment. If you know that your sequences all start at the same position and end at the same position, you could pass penalize_terminal_gaps=True. However, yours is a toy example and in most cases with real sequences, I think the most biologically relevant alignment would be returned with the default parameters.