I've been working with the Sickle library in Python in order to access OAI-PMH records from the Directory of Open Access Journals. I've noticed that the following code will produce similar, but slightly different results in the number of English articles from the first 4000 articles accessed by sickle.ListRecords() every time it's run (~2500-2600 each time though). In another bit of code I have run previously to retrieve and download full article texts, I noticed that the articles changed each time. This seems that the Sickle is not grabbing the OAI records in the same order every time, which makes me wonder if they are grabbed in a random(ish) order? I'm new to the OAI format, so I am unsure if this (seemingly) random ordering is something that's a property of how OAI records tend to be stored in general, a property of how DOAJ might be storing them, or a property of the way the Sickle library grabs OAI records before placing them in its OAIIterator object.
from sickle import Sickle
import time
from langdetect import detect
def get_time_estimate():
sickle = Sickle('https://doaj.org/oai.article')
records = sickle.ListRecords(metadataPrefix='oai_doaj')
tot = 0
num_eng = 0
start_time = time.time()
for rec in records:
tot += 1
metadata = rec.metadata
if 'abstract' not in metadata:
continue
if 'fullTextUrl' not in metadata:
continue
abs = metadata['abstract'][0]
full = metadata['fullTextUrl'][0]
language = detect(abs)
if language == 'en':
num_eng += 1
if tot == 4000:
break
print("Completed in %.2f seconds" % (time.time() - start_time))
print("Number of English records: %s" % num_eng)
Yes, to quote the OAI-PMH specification:
http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl
You could use selective harvesting via datestamp or sets: http://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvesting