The implementation in skbio gives an odd result compared to the result that you would get from the implementation in pycogent.
from cogent.align.algorithm import nw_align as nw_align_cogent
from skbio.alignment import global_pairwise_align_nucleotide as nw_align_scikit
seq_1 = 'ATCGATCGATCG'
seq_2 = 'ATCGATATCGATCG'
print "Sequences: "
print " %s" % seq_1
print " %s" % seq_2
print
alignment = nw_align_scikit(seq_1, seq_2)
al_1, al_2 = [alignment.get_seq(_id).__str__() for _id in alignment.ids()]
print " nw alignment using scikit:"
print " %s" % al_1
print " %s" % al_2
print
al_1, al_2 = nw_align_cogent(seq_1, seq_2)
print " nw alignment using cogent:"
print " %s" % al_1
print " %s" % al_2
print
The output looks like this:
nw alignment using scikit:
------ATCGATCGATCG
ATCGATATCGATCG----
nw alignment using cogent:
ATCGAT--CGATCG
ATCGATATCGATCG
This is a good question, and has to do with differences in how alignments are scored in scikit-bio and PyCogent. By default, in scikit-bio, terminal gaps are not penalized as that can lead to some very strange alignments. This issue was briefly discussed here and illustrated here (see the last cell of the notebook).
If you want to achieve a solution more similar to the one in PyCogent, you can pass
penalize_terminal_gaps=True
toglobal_pairwise_align_nucleotide
as follows:output:
You'll notice that the alignment is still not identical to the one you get from PyCogent, but this is a minor implementation difference: the two resulting alignments have the same score (the difference is in whether the
--
aligns to the firstAT
or the secondAT
in theATAT
repeat), and the two implementations make a different choice in how they break that tie.If you go back to the alignment that you posted (the default from scikit-bio), what you'll notice is that the alignment that is returned is very good - in fact, it's the optimally scoring alignment if not penalizing the terminal gaps (by definition, because the optimal scoring alignment is what it returns). However, you're right that it's strange, because the alignment that scikit-bio returns in this specific case is probably not the most biologically relevant alignment. If you know that your sequences all start at the same position and end at the same position, you could pass
penalize_terminal_gaps=True
. However, yours is a toy example and in most cases with real sequences, I think the most biologically relevant alignment would be returned with the default parameters.