Extracting affiliation information from PubMed search string in R

1k views Asked by At

I need some help extracting affiliation information from PubMed search strings in R. I have already successfully extracted affiliation information from a single PubMed ID XML, but now I have a search string of multiple terms that I need to extract the affiliation information from with hope of then creating a data frame with columns such as: PMID, author, country, state etc.

This is my code so far:

my_query <- (PubMed Search String)
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")

The PubMed search string is very long, hence why I haven't included it here. The main aim is therefore to produce a dataframe from this search string which is a table clearly showing affiliation and other general information from the PubMed articles.

Any help would be greatly appreciated!

1

There are 1 answers

5
ciakovx On BEST ANSWER

Have you tried the pubmedR package? https://cran.rstudio.com/web/packages/pubmedR/index.html

library(pubmedR)
library(purrr)
library(tidyr)

my_query <- '(((("diabetes mellitus"[MeSH Major Topic]) AND ("english"[Language])) AND (("2020/01/01"[Date - Create] : "3000"[Date - Create]))) AND ("coronavirus"[MeSH Major Topic])'

my_request <- pmApiRequest(query = my_query,
                            limit = 5)

You can use the built in function my_pm_df <- pmApi2df(my_request) but this will not provide affiliations for all authors.

You can use a combination of pluck() and map() from purrr to extract what you need into a tibble.

auth <- pluck(my_request, "data") %>% {
  tibble(
    pmid = map_chr(., pluck, "MedlineCitation", "PMID", "text"),
    author_list = map(., pluck, "MedlineCitation", "Article", "AuthorList")
  )
  }

All author data is contained in that nested list, in the Author$AffiliationInfo list (note it is a list because one author can have multiple affiliations).

================================================= EDIT based on comments:

First construct your request URLs. Make sure you replace &email with your email address:

library(httr)
library(xml2)

mypmids <- c("32946812", "32921748", "32921727", "32921708", "32911500", 
             "32894970", "32883566", "32880294", "32873658", "32856805",
             "32856803", "32820143", "32810084", "32809963", "32798472")

my_query <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=",
                   mypmids,
                   "&retmode=xml&[email protected]")

I like to wrap my API requests in safely to catch any errors. Then use map to loop through the my_query vector. Note we Sys.sleep for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.

get_safely <- safely(GET)

my_req <- map(my_query, function(z) {
  print(z)
  req <- get_safely(url = z)
  Sys.sleep(5)
  return(req)
})

Next we parse the request with content() in read_xml(). Note that we are parsing the result:

my_resp <- map(my_req, function(z) {
  read_xml(content(z$result,
                   as = "text",
                   encoding = "UTF-8"))
})

This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of map() , pluck() and unnest(). Note that a given author might have more than one affiliation but am only plucking the first one.

my_pm_list <- map(my_resp, function (z) {
  my_xml <- xml_child(xml_child(z, 1), 1)
  pmid <- xml_text(xml_find_first(my_xml, "//PMID"))
  authinfo <- as_list(xml_find_all(my_xml, ".//AuthorList"))
  return(list(pmid, authinfo))
})

myauthinfo <- map(my_pmids, function(z) {
  auth <- z[[2]][[1]]
})

mytibble <- myauthinfo %>% {
  tibble(
    lastname = map_depth(., 2, pluck, "LastName", 1, .default = NA_character_),
    firstname = map_depth(., 2, pluck, "ForeName", 1, .default = NA_character_),
    affil = map_depth(., 2, pluck, "AffiliationInfo", "Affiliation", 1, .default = NA_character_)
  )
}

my_unnested_tibble <- mytibble %>%
  bind_cols(pmid = map_chr(my_pm_list, pluck, 1)) %>%
  unnest(c(lastname, firstname, affil))