I have millions of protein sequences to compare and output the total mutation count number. The sequences are aligned but contains '-' and 'X' and as guessed, I don't want to count these letters. I tried basic python coding for this but this seems pretty slow. Is there any fastest way to do this?
e.g. -
seq1='GYPX-AM'
seq2='GNPXXCM'
These should return a mutation value of 2.
I tried python zip function and basic for loop while the sequences are in a list. I also tried Levenshtein distance module for this purpose which is faster, but I didn't find any ignore or omit option for those '-' and 'X'.
You can use Numpy to speed things up. First up: a simple loop implementation.
Comparing two 10 million character strings takes about 8 seconds on my machine. Could be worse, but we can speed it up significantly by using Numpy.
The Numpy implementation is about
60x160x times faster in my tests, taking about140 ms50 ms for 10 million characters. You can probably make it even faster by using Numba as well, or Cython, or writing your own function in C or Rust or whatever. But if you want something that "just works" and is easy to implement, straight-up Numpy might be a good solution.Edit: I am certain you can do things to improve the Numpy implementation too. This version is still much much slower than a naive implementation in Rust, so I'm clearly doing something wrong. Improvement suggestions are welcome. I've never done anything with strings in Numpy.
Edit again: Instead of doing a sum on the boolean array we can do
count_nonzero, which is marginally faster. We can also construct the entire mask in one go so we don't have to modify the original arrays. That speeds it up quite a bit. Still slower than the Rust version though. Here is the original implementation:Hopefully final edit:
Here is the complete Numpy implementation. All you need to do is
import numpy as npand call it with two strings of equal length.I also looked up how to get Numpy to treat the string as a buffer of uint8, which made it a bit faster. This version is about 450x faster than the native Python loop implementation, taking about 10 ms to compare two ten-million character protein strings.