Mutation/mismatch counts among sequences/strings in Python

70 views Asked by At

I have millions of protein sequences to compare and output the total mutation count number. The sequences are aligned but contains '-' and 'X' and as guessed, I don't want to count these letters. I tried basic python coding for this but this seems pretty slow. Is there any fastest way to do this?

e.g. -

seq1='GYPX-AM'
seq2='GNPXXCM'

These should return a mutation value of 2.

I tried python zip function and basic for loop while the sequences are in a list. I also tried Levenshtein distance module for this purpose which is faster, but I didn't find any ignore or omit option for those '-' and 'X'.

1

There are 1 answers

5
Simon Lundberg On

You can use Numpy to speed things up. First up: a simple loop implementation.

# Here's a simple "normal loop" implementation.
# It's not very fast, but it seems to work.
def compare(a: str, b: str) -> int:
    if a in ("-", "X") or b in ("-", "X"):
        return 0
    
    if a == b:
        return 0
    
    return 1

def compare_proteins(a: str, b: str) -> int:
    score = 0
    for i in range(len(a)):
        score += compare(a[i], b[i])
    return score

Comparing two 10 million character strings takes about 8 seconds on my machine. Could be worse, but we can speed it up significantly by using Numpy.

# This assumes 'proteins_a' and 'proteins_b' are arrays of strings,
# not strings
a = np.array(proteins_a)
b = np.array(proteins_b)


# We can construct a boolean array of the same shape as a and b
# by using the bitwise or operator, checking if either a or b
# is a gap or a match.
equal = (a == '-') | (a == 'X') | (b == '-') | (b == 'X') | (a == b)

# Then we can count the number of "nonzero" values in the inverted
# array, which will be the number of differences.
diffs = np.count_nonzero(~equal)

The Numpy implementation is about 60x 160x times faster in my tests, taking about 140 ms 50 ms for 10 million characters. You can probably make it even faster by using Numba as well, or Cython, or writing your own function in C or Rust or whatever. But if you want something that "just works" and is easy to implement, straight-up Numpy might be a good solution.

Edit: I am certain you can do things to improve the Numpy implementation too. This version is still much much slower than a naive implementation in Rust, so I'm clearly doing something wrong. Improvement suggestions are welcome. I've never done anything with strings in Numpy.

Edit again: Instead of doing a sum on the boolean array we can do count_nonzero, which is marginally faster. We can also construct the entire mask in one go so we don't have to modify the original arrays. That speeds it up quite a bit. Still slower than the Rust version though. Here is the original implementation:

# We can ignore '-' and 'X' by setting them to the same value
mask = (a == '-') | (a == 'X') | (b == '-') | (b == 'X')
a[mask] = '-'
b[mask] = '-'

# Then we can count the total number of differences via np.sum.
# For a boolean array, True is 1 and False is 0, so this works
# and is very efficient.
diffs = np.sum(a!=b)

Hopefully final edit:

Here is the complete Numpy implementation. All you need to do is import numpy as np and call it with two strings of equal length.

def compare_proteins_np(protein_1: str, protein_2: str) -> int:
    # Check that the proteins have the same length. Numpy will complain if we
    # try to compare arrays of different lengths. That's one weakness of this
    # approach. If you want to compare proteins of different lengths, you'll
    # need to trucate the longer one and then decide what to do with the remainder.
    if len(protein_1) != len(protein_2):
        raise ValueError("Proteins must have the same length.")

    # Convert the proteins to numpy arrays of unsigned 8-bit integers. We assume
    # that the proteins are ASCII-encoded, so each character is represented by
    # a single byte. If the string passed in is a unicode string with complex
    # characters (such as "ä" or "ツ"), this will raise a UnicodeEncodeError. 
    a = np.frombuffer(protein_1.encode("ascii"), dtype=np.uint8)
    b = np.frombuffer(protein_2.encode("ascii"), dtype=np.uint8)

    # We want to special-case '-' and 'X', but we need them to be uint8s as well,
    # otherwise Numpy can't do the comparison.
    uint_dash = ord('-')
    uint_X = ord('X')

    # Create a boolean array that is True if a == b or either is a dash or X, and False otherwise.
    not_equal = (a != uint_dash) & (a != uint_X) & (b != uint_dash) & (b != uint_X) & (a != b)

    # np.count_nonzero returns the number of elements in the array that are not zero.
    # Since 'not_equal' is a boolean array, it counts the number of `True` in the array.
    diffs = np.count_nonzero(not_equal)

    return diffs

I also looked up how to get Numpy to treat the string as a buffer of uint8, which made it a bit faster. This version is about 450x faster than the native Python loop implementation, taking about 10 ms to compare two ten-million character protein strings.