Downloading Protein Sequences of multiple Organisms

1.5k views Asked by At

I am attempting to use biopython to download all of the proteins of a list of organisms sequenced by a specific institution. I have the organism names and BioProject's associated with each organism; specifically I am looking to analyze the proteins found in some recent genome sequences. I'd like to download the protein files in bulk, in the friendliest manner possible with efetch . My most recent attempt of downloading all of protein FASTA sequences for an associated organism is as follows:

  net_handle = Entrez.efetch(db="protein",
                             id=mydictionary["BioPROJECT"][i],
                             rettype="fasta")

There are roughly 3000-4500 proteins associated with each organism; so using esearch and trying to efetch each protein one at a time is not realistic. Plus I'd like have a single FASTA file for each organism that encompasses all of its proteins.

Unfortunately when I run this line of code, I receive the following error: urllib2.HTTPError: HTTP Error 400: Bad Request.

It appears for all of the organisms I am interested in, I can't simply find their genonome sequence in their Nucleotide databank and download the "Protein encoding Sequences"

How may obtain these protein sequences I want in a manner that won't overload the NCBI servers? I was hoping that I could replicate what I can do on NCBI's web browser: select the protein database, search for the Bioproject number, and then save all of the found protein sequences into a single fasta file (under the "Send to" drop down menu)

2

There are 2 answers

0
dgg32 On BEST ANSWER

Try to download the sequence from PATRIC's FTP, which is a gold mine, first it is much better organized and second, the data are A LOT cleaner than NCBI. PATRIC is backed by NIH by the way.

PATRIC contains some 15000+ genomes and provides their DNA, protein, the DNA of protein coding regions, EC, pathway, genbank in separate files. Super convenient. Have a look yourself there:

ftp://ftp.patricbrc.org/patric2.

I suggest you download all the desired files from all organisms first and then pick up those you need once you have them all on your hard drive. The following python script download the ec number annotation files provided by PATRIC in one go (if you have proxy, you need to config it in the comment section):

from ftplib import FTP
import sys, os

#######if you have proxy

####fill in you proxy ip here
#site = FTP('1.1.1.1')

#site.set_debuglevel(1)
#msg = site.login('[email protected]')

site = FTP("ftp.patricbrc.org")
site.login()
site.cwd('/patric2/current_release/ec/')

bacteria_list = []
site.retrlines('LIST', bacteria_list.append)

output = sys.argv[1]
if not output.endswith("/"):
    output += "/"

print "bacteria_list: ", len(bacteria_list)


for c in bacteria_list:

    path_name = c.strip(" ").split()[-1]

    if "PATRIC.ec" in path_name:

        filename = path_name.split("/")[-1]
        site.retrbinary('RETR ' + path_name, open(output + filename , 'w').write)
3
Mike Z On

While I have no experience with python let alone biopython, a quick google search found a couple things for you to look at.

urllib2 HTTP Error 400: Bad Request

urllib2 gives HTTP Error 400: Bad Request for certain urls, works for others