choosing reads with Hamming distance zero

304 views Asked by At

I have a fastq files, say reads.fastq. I have a list of 7-mer strings. For each read in reads.fastq, I want to check if it contains at least one of the 7-mer strings in the list. The condition is that, if a match is found (hamming distance ==0) the read is written into an array chosen_reads and the next read from the fastq file is matched. If a match is not found the loop continues till a match is found. The output array consists of unique reads, since the matching loop terminates once the first match is found. I wrote the following code but the reads in the output array are not unique since all matches with hamming distance zero are reported. Please suggest edits:

def hamming(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")

    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

for x in Bio.SeqIO.parse("reads.fastq","fastq"):
        reads_array.append(x)

nmer = 7
l_chosen = ['gttattt','attattt','tgctagt']

chosen_reads = []
for x in reads_array:
    s2 = str(x.seq)
    for s in [s2[i:i+nmer] for i in range(len(s2)-nmer-1)]:
        for ds in l_chosen:    
            dist = hamming(ds,s)
            if dist == 0:
                print s2, s,ds,dist       
                chosen_reads.append(x)
2

There are 2 answers

2
Anand S Kumar On BEST ANSWER

You current code does not break out of the loop to read the next read from reads.fastq when it has found a string with hamming distance 0 , you should use flags to decide when to break out , and assign that flag the True value when you need to break out -

def hamming(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

for x in Bio.SeqIO.parse("reads.fastq","fastq"):
        reads_array.append(x)

nmer = 7

l_chosen = ['gttattt','attattt','tgctagt']
chosen_reads = []

for x in reads_array:
        s2 = str(x.seq)
        breakFlag = False
        for s in [s2[i:i+nmer] for i in range(len(s2)-nmer-1)]:
                for ds in l_chosen:
                        dist = hamming(ds,s)
                        if dist == 0:
                                print s2, s,ds,dist
                                chosen_reads.append(x)
                                breakFlag = True
                                break;
                if breakFlag:
                        break;

And are you sure you want to be appending x into chosen_reads , seems wrong to be, to get the unique matches maybe you should be appending the s2 string and the matched ds instead right? If that is what you want , you can append a tuple to chosen_reads like below instead of your current appending logic -

chosen_reads.append((ds, s2))
0
Phil Cooper On

If I understand what you are asking, the hamming distance is trying to find at least one of the 3 "chosen" strings exactly. Iterating as you are doing is slow and trying to break out can be ugly.

I'd maybe suggest the a regex is what would help here. You can create your match string automatically:

import re
chosen_re = re.compile('|'.join(l_chosen))

chosen_reads = [x for x in reads_array if chosen_re.search(str(s.seq))]

You will be very hard pressed to beat the spead of a regex engine