Total Mismatches between two strings

Question

Total Mismatches between two strings

6.2k views Asked by Hunter Gibbons At 24 November 2014 at 16:26

I am looking for a way to find the total number of mismatches between two strings in python. My input is a list that looks like this

['sequence=AGATGG', 'sequence=AGCTAG', 'sequence=TGCTAG',
 'sequence=AGGTAG', 'sequence=AGCTAG', 'sequence=AGAGAG']

and I for each string, I want to see how many differences it would have from the sequence "sequence=AGATAA". so if the input was the [0] from the list above, the output would read like this:

sequence=AGATGG, 2

I cannot figure out whether to split each of the letters into individual lists or if I should try and compare the whole string somehow. Any help is useful, thanks

Original Q&A

There are 3 answers

ch3ka On 24 November 2014 at 16:32

You can easily compute the total number of pairwise mismatches between two strings using sum and zip:

>>> s1='AGATGG'
>>> s2='AGATAA'
>>> sum(c1!=c2 for c1,c2 in zip(s1,s2))
2

if you have to deal with strings which are not of the same size, you might want to prefer from itertools import zip_longest instead of zip

GeneralBecos On 24 November 2014 at 16:33

See Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance.

You'll find a large number of python libraries that implement this algorithm efficiently.

I believe it is more appropriate for comparing such gene sequences (since it also handles inserts and deletions well).

**xbello** · Accepted Answer · 2014-11-24T18:19:17+00:00

First of all, I think your safest bet it to use Levenshtein distance with some library. But since you are tagging with Biopython, you can use pairwise:

First you want to get rid of the "sequence=". You can slice each string or

seqs = [x.split("=")[1] for x in ['sequence=AGATGG',
                                  'sequence=AGCTAG',
                                  'sequence=TGCTAG',
                                  'sequence=AGGTAG',
                                  'sequence=AGCTAG',
                                  'sequence=AGAGAG']]

Now define the reference sequence:
```
ref_seq = "AGATAA"
```

And using pairwise you can calculate the alignment:

from Bio import pairwise2

for seq in seqs:
    print pairwise2.align.globalxx(ref_seq, seq)

I'm using pairwise2.align.globalxx that is alignment without parameters. Other functions accept different values for matches and gaps. Check them at http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html.

TechQA.

Total Mismatches between two strings

There are 3 answers

Related Questions in PYTHON

Related Questions in BIOPYTHON

Related Questions in GENETICS

Popular Questions

Popular Tags

Trending Questions