searching for sequences in a FASTA format

944 views Asked by At

I am trying to look for multiple specific sequences in a DNA sequence within a FASTA format and then print them out. For simplicity, I made a short string sequence to show my problem.

import re
seq = "QPPLSK"
find_in_seq = re.search(r"[^P](P|K|R|H|W)", seq)
print find_in_seq.string[find_in_seq.start():find_in_seq.end()]

I only get one output of a match "QP" when there are 2 matches "QP" and "SK". How do I get to show the 2 matches instead of just only showing the first match?

Thanks

1

There are 1 answers

4
Wiktor Stribiżew On BEST ANSWER

Use re.findall and change the regex so that there is no more capturing group - [^P](?:P|K|R|H|W) or [^P][PKRHW]:

import re
seq = "QPPLSK"
find_in_seq = re.findall(r"[^P][PKRHW]", str(seq))
print(find_in_seq)

See the Python demo

Note that if you want to match any letter other than P, you'd better use [A-OQ-Z].