Retrieving and parsing protein sequences from GenBank using Entrez in BioPython

2.3k views Asked by At

As will soon be obvious, I am new to Python and coding in general. I have a list of Gene IDs stored as a text file and I want to use the Entrez functions to search the GenBank database and retrieve the protein sequences corresponding to the IDs. Ideally I want the end product to be a FASTA file as I am really only interested in the sequence at this point. Using the Biopython tutorial (http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec15), I came up with this:

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "[email protected]"
id_list = set(open('test.txt', 'rU'))
handle = Entrez.efetch(db="protein", id=id_list, rettype="fasta", retmode="text")   
for seq_record in SeqIO.parse(handle, "fasta"):
    print ">" + seq_record.id, seq_record.description
print seq_record.seq
handle.close()

But when I run it, I get the error:

File "C:/Python27/Scripts/entrez_files.py", line 5, in <module>
  handle = Entrez.efetch(db="protein", id=id_list, rettype="fasta", retmode="text")
File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 145, in efetch
  if ids.count(",") >= 200:
AttributeError: 'set' object has no attribute 'count'

I get a similar error every time I use rettype = 'fasta'. When I use rettype = 'gb' I don't get this error, but I really want to end up with a fasta file. Does anybody have some suggestions? Thank you!

EDIT: sorry I neglected to include what the input file is like. In a perfect world the code would accept an input format like this:

gi|285016822|ref|YP_003374533.1|
gi|285018887|ref|YP_003376598.1|
gi|285016823|ref|YP_003374534.1|
gi|285016824|ref|YP_003374535.1| 
....

But I have also tried using a simplified version with only the Gene IDs (GIs) like this:

285016822 
285018887 
285016823
285016824...
1

There are 1 answers

0
Paulo Almeida On BEST ANSWER

As you can see in efetch's source code, the id parameter must have a count method. Usually this will be a string with a single ID or a Python list with all the IDs. You are using a set, presumably to eliminate repeated values, so you can convert to a list like this:

id_list = set(open('test.txt', 'rU'))
id_list = list(id_list)