I need some help extracting affiliation information from PubMed search strings in R. I have already successfully extracted affiliation information from a single PubMed ID XML, but now I have a search string of multiple terms that I need to extract the affiliation information from with hope of then creating a data frame with columns such as: PMID, author, country, state etc.
This is my code so far:
my_query <- (PubMed Search String)
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")
The PubMed search string is very long, hence why I haven't included it here. The main aim is therefore to produce a dataframe from this search string which is a table clearly showing affiliation and other general information from the PubMed articles.
Any help would be greatly appreciated!
Have you tried the
pubmedR
package? https://cran.rstudio.com/web/packages/pubmedR/index.htmlYou can use the built in function
my_pm_df <- pmApi2df(my_request)
but this will not provide affiliations for all authors.You can use a combination of
pluck()
andmap()
frompurrr
to extract what you need into a tibble.All author data is contained in that nested list, in the
Author$AffiliationInfo
list (note it is a list because one author can have multiple affiliations).================================================= EDIT based on comments:
First construct your request URLs. Make sure you replace
&email
with your email address:I like to wrap my API requests in
safely
to catch any errors. Then usemap
to loop through themy_query
vector. Note weSys.sleep
for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.Next we parse the request with
content()
inread_xml()
. Note that we are parsing theresult
:This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of
map()
,pluck()
andunnest()
. Note that a given author might have more than one affiliation but am only plucking the first one.