I wrote a script using the BioPython Entrez/SeqIO modules to grab sequences from NCBI and write them to a file.
The problem I am having is that one term with about 60,000 results only resulted in 37,000 sequences in the final file. Also, the size of the final output file is only updated every 1 megabyte when I use ls -lth, why isn't it being updated dynamically? OR is it being updated dynamically but the server is just not showing it?
edit: I should also note that this script worked for ~10 other searches, one with ~70,000 sequences and the rest with ~1000 sequences. The search term it did not work with is: txid6200[Organism:exp] in the 'protein' database
Here is the code (also does anyone know if there's a more efficient way to do this? SeqIO overwrites the "currentseqs" file each time, so I worked around this by concatenating everything into a different output file):
# command line usage: python entrez.py database searchterm output.fasta
from Bio import Entrez, SeqIO
import sys
import os
dataBase = sys.argv[1]
searchTerm = sys.argv[2]
outFile = sys.argv[3]
Entrez.email = "[email protected]"
handle = Entrez.esearch(db = dataBase, retmax = 100000, term = searchTerm)
record = Entrez.read(handle)
handle.close()
with open(outFile, 'w') as w:
for id in record["IdList"]:
fetch_handle = Entrez.efetch(db = dataBase, id = id, rettype = "fasta", retmode="text")
fetch_record = SeqIO.read(fetch_handle, "fasta")
fetch_handle.close()
SeqIO.write(fetch_record, "current_seq.fasta", "fasta")
for line in open('current_seq.fasta'):
w.write(line)
os.remove("current_seq.fasta")