Reading frames modification to read window size of 3. Python

70 views Asked by At

I'm trying to find a way to scan for reading frames in a sequence. They start with a sequence ATG and then iterate through the sequence in units of 3 until a STOP (STOP = TAG|TGA|TAA) is encountered.

e.g. seq = 'ATGGGGTGAGGG' it would read through as 'ATG', 'GGG','TGA' and then stop.

So I have figured out how to do this, but I'm trying to figure out how to account reading frames that start in the sequence but may end outside of the sequence.

In the script i've written below, the sequence is read and if it finds a reading frame (ATG-4*(xxx)-STOP) that is at least 15 letters long (not including the last STOP) then it is included in the output.

How can I modify this to find these as is, but also give potential ones that are at least 15 letters long but whose STOP sequence may be after the end of the given sequence area. For frames that are at the very end.

(e.g.

seq = 'AAATTTATGGGGTTTAAAGGGTGAGGATGGGGGGGAAATTTGG'
current_output = ['ATGGGGTTTAAAGGGTGA']
desired_output = ['ATGGGGTTTAAAGGGTGA','ATGGGGGGGAAATTTGG']
do_not_want = ['ATGGGGTTTAAAGGGTGA''ATGGGGGGGAAATTTG','ATGGGGTTTAAAGGGTGAGGATGGGGGGGAAATTTGG']

in the above example, my current output gives me ones that start with ATG-xxx-STOP but I also want it to still consider all of these but IF THERE IS one ATG that starts after the last ATG in the ATG-xxx-STOP frame, that is also 15 characters long, then consider that as well. In do_not_want, it also took the first instance and brought it to the very end even tho there was a STOP signal.

import re

seq = 'AAATTTATGGGGTTTAAAGGGTGAGGATGGGGGGGAAATTTGG'

frames = re.findall('ATG(?:(?!TAA|TAG|TGA)...){4,}?(?:TAA|TAG|TGA)',seq)

I apologize if I did not explain this up to par. Please contact me if I need to reword or reorganize the question. Extension from Scan Reading frame [3] Python

0

There are 0 answers