How to extract short sequence using window with specific step size?

955 views Asked by At

The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs?

Example code

from Bio import SeqIO

with open("testA_out.fasta","w") as f:
        for seq_record in SeqIO.parse("testA.fasta", "fasta"):
            i = 0
            while ((i+4) < len(seq_record.seq)) :
              f.write(">" + str(seq_record.id) + "\n")
              f.write(str(seq_record.seq[i:i+4]) + "\n")
              i += 2

Example Input of testA.fasta

>human1
ACCCGATTT

Example Output of testA_out

>human1
ACCC
>human1
CCGA
>human1
GATT

The problem with this output is that there are one T left out so in this case I hope to include it as well. How can I come out with this output? With a reverse extract as well to include base pairs that are probably left out when extract from start to end. Can anyone help me?


Expected output

>human1
ACCC
>human1
CCGA
>human1
GATT
>human1
ATTT
>human1
CGAT    
>human1
CCCG
3

There are 3 answers

0
tobias_k On BEST ANSWER

You can use a for loop with range, using the third step parameter for range. This way, it's a bit cleaner than using a while loop. If the data can not be divided by the chunk size, then the last chunk will be smaller.

data = "ACCCGATTT"
step = 2
chunk = 4
for i in range(0, len(data) - step, step):
    print(data[i:i+chunk])

Output is

ACCC
CCGA
GATT
TTT
0
tripleee On

Your particular example can be solved by moving to a step size of 1 instead. But your question seems to be asking, "how do I repeat with the same window size from the end of the sequence if there are not enough characters in the sequence". So an example where this would make a difference might be

AAAATTT

with a window size of 6 and a step size 2, where you want AAAATT from the "forward" direction, and AAATTT from the "reverse" direction, but no other subsequences.

Obviously, running the code once in the forward direction and once in the backwards direction would do that, but it introduces repetition, which is usually not a good thing. However, you can refactor the problem so that you divide the step into pairs of steps.

For a sequence of length x with a step of y, you can divide y into x%y and y-(x%y) and just move forward with these pairwise steps. (Skip the first member of the pair when x%y == 0.)

I'm posting just the string handling functions, as none of this is at all specific to gene sequences.

seq = "AAAATTT"
window = 6
step = 2

length = len(seq)
modulo = length % step

for i in range(0, length-window, step):
    if modulo > 0:
        print(seq[i:i+window])
    print(seq[i+modulo:i+modulo+window])
0
Mel On

For any window size and any step size:

fasta='ACCCGATTT'
windowSize=4
step=1
i=0
while (i+windowSize)<=len(fasta):
    currentWindow=fasta[i:i+windowSize]
    print(currentWindow)
    i+=step

Output with windowSize=4, step=2:

ACCC 
CCGA                       
GATT 

Output with windowSize=4, step=1:

ACCC                       
CCCG
CCGA
CGAT
GATT
ATTT

The last one is exactly as "Expected output", sorted differently.