Is there an efficient way to get all the metadata of a paper using Python?

371 views Asked by At

I am currently working on a project that involves collecting the latest papers along with comprehensive metadata encompassing authors, titles, affiliations, and summaries. How can I efficiently obtain this desired metadata using Python?

e.g.

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky University of Toronto

Ilya Sutskever University of Toronto

Geoffrey E. Hinton University of Toronto

I've attempted to use the arxiv API, but unfortunately, it does not provide information such as the authors' affiliations. Additionally, I discovered that using the scholarly API could get me the desired full metadata, including affiliations. However, it comes with the risk of being blocked by Google Scholar. Furthermore, other options like serapi are not free. Could anyone suggest some ideas or alternatives for obtaining the full metadata (including affiliations) using Python with a limited rate of about 200~400 API calls per day?

1

There are 1 answers

2
Adam Tuft On

You could try a metadata provider that is a Registration Agency of the DOI Foundation, for example:

These services provide metadata like author affiliations. I was able to find your paper's metadata on Crossref.

In general, if you have the DOI for a paper you can look up which RA provides the metadata by looking for their DOI prefix on doi.org and then querying their particular API for the paper in question.

Edit: To access these RAs you could query them directly in Python, or use a library. For examples of libraries implementing the Crossref API, see here. Other RAs may require their own libraries.

Edit 2: In response to your question about arXiv submissions without a DOI, the arXiv API does expose author affiliation, where provided by the authors. The package you're using doesn't implement this due to an issue with the library it uses to parse the feeds from the arXiv API. You could get this data yourself using the urllib library (see here for an example). This depends on the authors adding affiliation data in the first place, which doesn't appear to be required (For named collaborations, it is acceptable to only use the collaboration name within the metadata, ...).