How is this python script copying into the final output file?

86 views Asked by At

I wrote a script using the BioPython Entrez/SeqIO modules to grab sequences from NCBI and write them to a file.

The problem I am having is that one term with about 60,000 results only resulted in 37,000 sequences in the final file. Also, the size of the final output file is only updated every 1 megabyte when I use ls -lth, why isn't it being updated dynamically? OR is it being updated dynamically but the server is just not showing it?

edit: I should also note that this script worked for ~10 other searches, one with ~70,000 sequences and the rest with ~1000 sequences. The search term it did not work with is: txid6200[Organism:exp] in the 'protein' database

Here is the code (also does anyone know if there's a more efficient way to do this? SeqIO overwrites the "currentseqs" file each time, so I worked around this by concatenating everything into a different output file):

# command line usage: python entrez.py database searchterm output.fasta

from Bio import Entrez, SeqIO
import sys
import os

dataBase = sys.argv[1]
searchTerm = sys.argv[2]
outFile = sys.argv[3]

Entrez.email = "[email protected]"
handle = Entrez.esearch(db = dataBase, retmax = 100000, term = searchTerm)
record = Entrez.read(handle)
handle.close()
with open(outFile, 'w') as w:
    for id in record["IdList"]:
        fetch_handle = Entrez.efetch(db = dataBase, id = id, rettype = "fasta", retmode="text")
        fetch_record = SeqIO.read(fetch_handle, "fasta")
        fetch_handle.close()
        SeqIO.write(fetch_record, "current_seq.fasta", "fasta")
        for line in open('current_seq.fasta'):
            w.write(line)
os.remove("current_seq.fasta")
0

There are 0 answers