Mutation/mismatch counts among sequences/strings in Python

Question

Mutation/mismatch counts among sequences/strings in Python

70 views Asked by Rubayetul Alam At 13 January 2024 at 17:50

I have millions of protein sequences to compare and output the total mutation count number. The sequences are aligned but contains '-' and 'X' and as guessed, I don't want to count these letters. I tried basic python coding for this but this seems pretty slow. Is there any fastest way to do this?

e.g. -

seq1='GYPX-AM'
seq2='GNPXXCM'

These should return a mutation value of 2.

I tried python zip function and basic for loop while the sequences are in a list. I also tried Levenshtein distance module for this purpose which is faster, but I didn't find any ignore or omit option for those '-' and 'X'.

Original Q&A

There are 1 answers

**Simon Lundberg** · Answer 1 · 2024-01-13T18:50:46+00:00

You can use Numpy to speed things up. First up: a simple loop implementation.

# Here's a simple "normal loop" implementation.
# It's not very fast, but it seems to work.
def compare(a: str, b: str) -> int:
    if a in ("-", "X") or b in ("-", "X"):
        return 0
    
    if a == b:
        return 0
    
    return 1

def compare_proteins(a: str, b: str) -> int:
    score = 0
    for i in range(len(a)):
        score += compare(a[i], b[i])
    return score

Comparing two 10 million character strings takes about 8 seconds on my machine. Could be worse, but we can speed it up significantly by using Numpy.

# This assumes 'proteins_a' and 'proteins_b' are arrays of strings,
# not strings
a = np.array(proteins_a)
b = np.array(proteins_b)


# We can construct a boolean array of the same shape as a and b
# by using the bitwise or operator, checking if either a or b
# is a gap or a match.
equal = (a == '-') | (a == 'X') | (b == '-') | (b == 'X') | (a == b)

# Then we can count the number of "nonzero" values in the inverted
# array, which will be the number of differences.
diffs = np.count_nonzero(~equal)

The Numpy implementation is about ~~60x~~ 160x times faster in my tests, taking about ~~140 ms~~ 50 ms for 10 million characters. You can probably make it even faster by using Numba as well, or Cython, or writing your own function in C or Rust or whatever. But if you want something that "just works" and is easy to implement, straight-up Numpy might be a good solution.

Edit: I am certain you can do things to improve the Numpy implementation too. This version is still much much slower than a naive implementation in Rust, so I'm clearly doing something wrong. Improvement suggestions are welcome. I've never done anything with strings in Numpy.

Edit again: Instead of doing a sum on the boolean array we can do count_nonzero, which is marginally faster. We can also construct the entire mask in one go so we don't have to modify the original arrays. That speeds it up quite a bit. Still slower than the Rust version though. Here is the original implementation:

# We can ignore '-' and 'X' by setting them to the same value
mask = (a == '-') | (a == 'X') | (b == '-') | (b == 'X')
a[mask] = '-'
b[mask] = '-'

# Then we can count the total number of differences via np.sum.
# For a boolean array, True is 1 and False is 0, so this works
# and is very efficient.
diffs = np.sum(a!=b)

Hopefully final edit:

Here is the complete Numpy implementation. All you need to do is import numpy as np and call it with two strings of equal length.

def compare_proteins_np(protein_1: str, protein_2: str) -> int:
    # Check that the proteins have the same length. Numpy will complain if we
    # try to compare arrays of different lengths. That's one weakness of this
    # approach. If you want to compare proteins of different lengths, you'll
    # need to trucate the longer one and then decide what to do with the remainder.
    if len(protein_1) != len(protein_2):
        raise ValueError("Proteins must have the same length.")

    # Convert the proteins to numpy arrays of unsigned 8-bit integers. We assume
    # that the proteins are ASCII-encoded, so each character is represented by
    # a single byte. If the string passed in is a unicode string with complex
    # characters (such as "ä" or "ツ"), this will raise a UnicodeEncodeError. 
    a = np.frombuffer(protein_1.encode("ascii"), dtype=np.uint8)
    b = np.frombuffer(protein_2.encode("ascii"), dtype=np.uint8)

    # We want to special-case '-' and 'X', but we need them to be uint8s as well,
    # otherwise Numpy can't do the comparison.
    uint_dash = ord('-')
    uint_X = ord('X')

    # Create a boolean array that is True if a == b or either is a dash or X, and False otherwise.
    not_equal = (a != uint_dash) & (a != uint_X) & (b != uint_dash) & (b != uint_X) & (a != b)

    # np.count_nonzero returns the number of elements in the array that are not zero.
    # Since 'not_equal' is a boolean array, it counts the number of `True` in the array.
    diffs = np.count_nonzero(not_equal)

    return diffs

I also looked up how to get Numpy to treat the string as a buffer of uint8, which made it a bit faster. This version is about 450x faster than the native Python loop implementation, taking about 10 ms to compare two ten-million character protein strings.

TechQA.

Mutation/mismatch counts among sequences/strings in Python

There are 1 answers

Related Questions in PYTHON

Related Questions in STRING

Related Questions in SEQUENCE

Related Questions in LEVENSHTEIN-DISTANCE

Related Questions in MISMATCH

Popular Questions

Trending Questions