I am currently working on a project that involves collecting the latest papers along with comprehensive metadata encompassing authors, titles, affiliations, and summaries. How can I efficiently obtain this desired metadata using Python?
e.g.
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky University of Toronto
Ilya Sutskever University of Toronto
Geoffrey E. Hinton University of Toronto
I've attempted to use the arxiv API, but unfortunately, it does not provide information such as the authors' affiliations. Additionally, I discovered that using the scholarly API could get me the desired full metadata, including affiliations. However, it comes with the risk of being blocked by Google Scholar. Furthermore, other options like serapi are not free. Could anyone suggest some ideas or alternatives for obtaining the full metadata (including affiliations) using Python with a limited rate of about 200~400 API calls per day?
You could try a metadata provider that is a Registration Agency of the DOI Foundation, for example:
These services provide metadata like author affiliations. I was able to find your paper's metadata on Crossref.
In general, if you have the DOI for a paper you can look up which RA provides the metadata by looking for their DOI prefix on doi.org and then querying their particular API for the paper in question.
Edit: To access these RAs you could query them directly in Python, or use a library. For examples of libraries implementing the Crossref API, see here. Other RAs may require their own libraries.
Edit 2: In response to your question about arXiv submissions without a DOI, the arXiv API does expose author affiliation, where provided by the authors. The package you're using doesn't implement this due to an issue with the library it uses to parse the feeds from the arXiv API. You could get this data yourself using the
urlliblibrary (see here for an example). This depends on the authors adding affiliation data in the first place, which doesn't appear to be required (For named collaborations, it is acceptable to only use the collaboration name within the metadata, ...).