How to extract the complete list of PMC article titles and abstracts using Bio.Entrez?

568 views Asked by At

I'm trying to download the complete title/abstract data from PMC/Pubmed. This is an age-old question but none of the answers at stackoverflow seems to answer it.

A general approach is to use Entrez package, but then again, you need to specify search terms. Also there is a limit on the query request you can send over time.

from Bio import Entrez
Entrez.email = "[email protected]"  
handle = Entrez.esearch(db="pubmed", term="orchid", retmax=463)
record = Entrez.read(handle)
handle.close()
idlist = record["IdList"]
handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline", retmode="text")
records = Medline.parse(handle)

for record in records:
     print("title:", record.get("TI", "?"))
     print("authors:", record.get("AU", "?"))
     print("source:", record.get("SO", "?"))
     print("")

Is there anyway I can download the entire article+abstract data from PMC, using Python or directly from any other sources?

1

There are 1 answers

0
nelson quiƱones On

One way you can attack this problem is using esearch method with a term that allows to search articles from the beginning of pubmed, and start to bring the articles in a iterative way changing the retstart parameter.

batch_size = 20
start = 0

while start<1000:
 handle = Entrez.esearch(db="pubmed",term = "2015/3/1:2022/4/30[Publication Date]",retmode="xml",retstart = start, retmax = batch_size)
 summaries = Entrez.read(handle)
 handle.close()
 start = start + batch_size