Does Sickle access OAI records in random order?

215 views Asked by At

I've been working with the Sickle library in Python in order to access OAI-PMH records from the Directory of Open Access Journals. I've noticed that the following code will produce similar, but slightly different results in the number of English articles from the first 4000 articles accessed by sickle.ListRecords() every time it's run (~2500-2600 each time though). In another bit of code I have run previously to retrieve and download full article texts, I noticed that the articles changed each time. This seems that the Sickle is not grabbing the OAI records in the same order every time, which makes me wonder if they are grabbed in a random(ish) order? I'm new to the OAI format, so I am unsure if this (seemingly) random ordering is something that's a property of how OAI records tend to be stored in general, a property of how DOAJ might be storing them, or a property of the way the Sickle library grabs OAI records before placing them in its OAIIterator object.

from sickle import Sickle
import time
from langdetect import detect

def get_time_estimate():
    sickle = Sickle('https://doaj.org/oai.article')
    records = sickle.ListRecords(metadataPrefix='oai_doaj')
    tot = 0
    num_eng = 0
    start_time = time.time()
    for rec in records:
        tot += 1
        metadata = rec.metadata
        if 'abstract' not in metadata:
            continue
        if 'fullTextUrl' not in metadata:
            continue
        abs = metadata['abstract'][0]
        full = metadata['fullTextUrl'][0]
        language = detect(abs)
        if language == 'en':
            num_eng += 1
        if tot == 4000:
            break
    print("Completed in %.2f seconds" % (time.time() - start_time))
    print("Number of English records: %s" % num_eng)
1

There are 1 answers

0
Mathias Loesch On

I'm new to the OAI format, so I am unsure if this (seemingly) random ordering is something that's a property of how OAI records tend to be stored in general,

Yes, to quote the OAI-PMH specification:

The protocol does not define the semantics of incompleteness. Therefore, a harvester should not assume that the members in an incomplete list conform to some selection criteria (e.g., date ordering).

http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl

You could use selective harvesting via datestamp or sets: http://www.openarchives.org/OAI/openarchivesprotocol.html#SelectiveHarvesting