I have a fastq files, say reads.fastq
. I have a list of 7-mer
strings. For each read in reads.fastq
, I want to check if it contains at least one of the 7-mer
strings in the list. The condition is that, if a match is found (hamming distance ==0
) the read is written into an array chosen_reads
and the next read from the fastq file is matched. If a match is not found the loop continues till a match is found. The output array consists of unique reads, since the matching loop terminates once the first match is found. I wrote the following code but the reads in the output array are not unique since all matches with hamming distance zero are reported. Please suggest edits:
def hamming(s1, s2):
#Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for x in Bio.SeqIO.parse("reads.fastq","fastq"):
reads_array.append(x)
nmer = 7
l_chosen = ['gttattt','attattt','tgctagt']
chosen_reads = []
for x in reads_array:
s2 = str(x.seq)
for s in [s2[i:i+nmer] for i in range(len(s2)-nmer-1)]:
for ds in l_chosen:
dist = hamming(ds,s)
if dist == 0:
print s2, s,ds,dist
chosen_reads.append(x)
You current code does not break out of the loop to read the next
read
fromreads.fastq
when it has found a string with hamming distance 0 , you should use flags to decide when to break out , and assign that flag the True value when you need to break out -And are you sure you want to be appending
x
intochosen_reads
, seems wrong to be, to get the unique matches maybe you should be appending thes2
string and the matchedds
instead right? If that is what you want , you can append a tuple tochosen_reads
like below instead of your current appending logic -