Retrieving and parsing protein sequences from GenBank using Entrez in BioPython

Question

Retrieving and parsing protein sequences from GenBank using Entrez in BioPython

2.3k views Asked by JayB At 30 December 2013 at 22:11

As will soon be obvious, I am new to Python and coding in general. I have a list of Gene IDs stored as a text file and I want to use the Entrez functions to search the GenBank database and retrieve the protein sequences corresponding to the IDs. Ideally I want the end product to be a FASTA file as I am really only interested in the sequence at this point. Using the Biopython tutorial (http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec15), I came up with this:

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "[email protected]"
id_list = set(open('test.txt', 'rU'))
handle = Entrez.efetch(db="protein", id=id_list, rettype="fasta", retmode="text")   
for seq_record in SeqIO.parse(handle, "fasta"):
    print ">" + seq_record.id, seq_record.description
print seq_record.seq
handle.close()

But when I run it, I get the error:

File "C:/Python27/Scripts/entrez_files.py", line 5, in <module>
  handle = Entrez.efetch(db="protein", id=id_list, rettype="fasta", retmode="text")
File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 145, in efetch
  if ids.count(",") >= 200:
AttributeError: 'set' object has no attribute 'count'

I get a similar error every time I use rettype = 'fasta'. When I use rettype = 'gb' I don't get this error, but I really want to end up with a fasta file. Does anybody have some suggestions? Thank you!

EDIT: sorry I neglected to include what the input file is like. In a perfect world the code would accept an input format like this:

gi|285016822|ref|YP_003374533.1|
gi|285018887|ref|YP_003376598.1|
gi|285016823|ref|YP_003374534.1|
gi|285016824|ref|YP_003374535.1| 
....

But I have also tried using a simplified version with only the Gene IDs (GIs) like this:

285016822 
285018887 
285016823
285016824...

Original Q&A

There are 1 answers

**Paulo Almeida** · Accepted Answer · 2013-12-30T22:30:00+00:00

As you can see in efetch's source code, the id parameter must have a count method. Usually this will be a string with a single ID or a Python list with all the IDs. You are using a set, presumably to eliminate repeated values, so you can convert to a list like this:

id_list = set(open('test.txt', 'rU'))
id_list = list(id_list)

TechQA.

Retrieving and parsing protein sequences from GenBank using Entrez in BioPython

There are 1 answers

Related Questions in PYTHON

Related Questions in PARSING

Related Questions in BIOPYTHON

Related Questions in FASTA

Related Questions in PROTEIN-DATABASE

Popular Questions

Popular Tags

Trending Questions