how to determine/ find the longest poly-purine tract in any genome (consecutive As and Gs with no interspersed C or T, or vice versa) and this needs to be on the E. coli genome . is it to figure out the polypurine tract and then figure out the longest chain ? or is it to splice the introns and exons away from the DNA ? since E. coli's genome is 4.6 million BP long, i need some help in breaking this down ?
determine length of polypurine tract
241 views Asked by user3923728 At
2
There are 2 answers
0
On
There is now a method in (the development version of) scikit-bio for the BiologicalSequence
class called (and subclasses) find_features
. For example
my_seq = DNASequence(some_long_string)
for run in my_seq.find_features('purine_run', min_length=10):
print run
or
my_seq = DNASequence(some_long_string)
all_runs = list(my_seq.find_features('purine_run', min_length=10))
I agree that the methodological aspects of this question are better suited for https://biology.stackexchange.com/ (i.e., should introns/exons be removed, etc), but briefly that depends entirely on the biological question that you're trying to answer. If you care about whether those stretches span intron/exon boundaries, then you should not split them first. However I'm not sure that's relevant to E. coli sequences as (as far as I know) introns and exons are specific to eukaryotes.
To address the technical aspect of this question, here's some code that illustrates how you could do this using scikit-bio. (I also posted this as a scikit-bio cookbook recipe here.)