Return values from a Python Entrez dictionary of dictionaries

395 views Asked by At

I want to scrape the Interactions table from the Entrez Gene page.

The Interactions table is populated from a web server and when I tried to use the XML package in R, I could get the Entrez gene page, but the Interactions table body was empty (it had not been populated by the web server).

Dealing with the web server issue in R may be solvable (and I'd love to see how), but it seemed Biopython was an easier path.

I put together the following, which gives me what I want for an example gene:

# Pull the Entrez gene page for MAP1B using Biopython

from Bio import Entrez
Entrez.email = "[email protected]"
handle = Entrez.efetch(db="gene", id="4131", retmode="xml")
record = Entrez.read(handle)
handle.close()

PPI_Entrez = []
PPI_Sym = []

# Find the Dictionary that contains the Interaction table
    for x in range(1, len(record[0]["Entrezgene_comments"])):
   if ('Gene-commentary_heading', 'Interactions') in record[0]["Entrezgene_comments"][x].items():
       for y in range(0, len(record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'])):
          EntrezID = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
          PPI_Entrez.append(EntrezID)
          Sym = record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']
          PPI_Sym.append(Sym)

# Return the desired values: I want the Entrez ID and Gene symbol for each interacting protein
PPI_Entrez  # Returns the EntrezID
PPI_Sym  # Returns the gene symbol

This code works, giving me what I want. But I think its ugly, and am concerned that if the Entrez gene page changes slightly in format it will break the code. In particular, there must be a better way to extract the desired information than specifying the full path, as I do with:

record[0]["Entrezgene_comments"][x]['Gene-commentary_comment'][y]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor']

But I cannot figure out how to search through a dictionary of dictionaries without specifying each level I want to descend. When I try functions like find(), they operate on the next level down, but not all the way to the bottom.

Is there a wildcard symbol, a Python equivalent of "//", or a function I can use to get to ['Object-id_id'] without naming the full path? Other suggestions for cleaner code are also appreciated.

1

There are 1 answers

5
Chris S. On BEST ANSWER

I'm not sure about xpath in Python, but if the code works, then I would not worry removing full paths or if Entrez Gene XML will change. Since you first tried R, you could get the XML using a system call to Entrez Direct below or a package like rentrez.

doc <- xmlParse( system("efetch -db=gene -id=4131 -format xml", intern=TRUE) )

Next, get the nodes corresponding to rows in the table at http://www.ncbi.nlm.nih.gov/gene/4131#interactions

x <- getNodeSet(doc, "//Gene-commentary_heading[.='Interactions']/../Gene-commentary_comment/Gene-commentary" )

length(x)
[1] 64
x[1]
x[50]

Try the easy stuff first

xmlToDataFrame(x[1:4])

  Gene-commentary_type  Gene-commentary_text Gene-commentary_refs Gene-commentary_source                         Gene-commentary_comment
1                   18   Affinity Capture-MS             24457600   BioGRID110304BioGRID   255BioGRID110304255GeneID8726EEDBioGRID114265
2                   18 Reconstituted Complex             20195357   BioGRID110304BioGRID   255BioGRID110304255GeneID2353FOSBioGRID108636
3                   18 Reconstituted Complex             20195357   BioGRID110304BioGRID 255BioGRID110304255GeneID1936EEF1DBioGRID108256
4                   18   Affinity Capture-MS     2345592220562859   BioGRID110304BioGRID  255BioGRID110304255GeneID6789STK4BioGRID112665
  Gene-commentary_create-date Gene-commentary_update-date
1                  2014461120                201410513330
2                201312810490                201410513330
3                201312810490                201410513330
4                 20137710360                201410513330

Some tags like text, refs, source, and dates should be easy to parse

sapply(x, function(x) paste( xpathSApply(x, ".//PubMedId", xmlValue), collapse=", "))

I'm not sure about the comments or how Products, Interactants and Other Genes listed in the table are stored in the XML, but I get one or three symbols and three ids for each node here.

sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Other-source_anchor", xmlValue), collapse=" + "))
sapply(x, function(x) paste( xpathSApply(x, ".//Gene-commentary_comment//Object-id_id", xmlValue), collapse=" + "))

Finally, since I think Entrez Gene just copies IntAct and BioGrid, you could try those sites too. Biogrid has a really simple Rest service, but you have to register for a key.

url <- "http://webservice.thebiogrid.org/interactions?geneList=MAP1B&taxId=9606&includeHeader=TRUE&accesskey=[ your ACCESSKEY ]"

biogrid <- read.delim(url)
 dim(biogrid)
[1] 58 24

head(biogrid[, c(8:9,12)])
  Official.Symbol.Interactor.A Official.Symbol.Interactor.B      Experimental.System
1                       ANP32A                        MAP1B               Two-hybrid
2                        MAP1B                       ANP32A               Two-hybrid
3                       RASSF1                        MAP1B Affinity Capture-Western
4                       RASSF1                        MAP1B               Two-hybrid
5                       ANP32A                        MAP1B Affinity Capture-Western
6                          GAN                        MAP1B Affinity Capture-Western