Scraping Authors based on tags from Google Scholar

1.2k views Asked by At

I am working on a project where I hope to scrape data from Google Scholar. I want to scrape all authors tagged in a category (eg. Anaphylaxis) and store their number of citations, h-index and i-10 index in a CSV file. However, I am unsure how to do this given that Google Scholar has no API. I understand I can use a scraper like beautiful soup but am unsure how to scrape the data without being blocked.

So, my question is how can I use bs4 to store all authors tagged as Anaphylaxis and each author's citations, h-index and i-10 index in a csv file.


There are 2 answers

Kyle Pastor On

All the scraper is doing is parsing some HTML pages. Upon a search, the authors are in the div with class = "gs_a" If you use Beautiful Soup and look for this class you will be able to find all of the authors. You can go page by page by updating the url.,5 to,5

ie. The start=30 then 40 etc.

Then you can loop over the author names base on the link path in the gs_a class tags.

Let me know if this helps!


Milos Djurdjevic On

To get all the profiles for any "category" (label:query), or a "name" you could use a third party solution like SerpApi. It's a paid API with a free trial.

Example python code (available in other libraries also):

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_scholar_profiles",
  "q": "Coffee",
  "hl": "en",
  "mauthors": "label:anaphylaxis"

search = GoogleSearch(params)
results = search.get_dict()

Example JSON output:

"profiles": [
    "name": "Jerrold H Levy",
    "link": "",
    "serpapi_link": "",
    "author_id": "qnH5V28AAAAJ",
    "affiliations": "Professor of Anesthesiology and Surgery (Cardiothoracic)",
    "email": "Verified email at",
    "cited_by": 80353,
    "interests": [
        "title": "bleeding",
        "serpapi_link": "",
        "link": ""
        "title": "anaphylaxis",
        "serpapi_link": "",
        "link": ""
        "title": "anticoagulation",
        "serpapi_link": "",
        "link": ""
        "title": "shock",
        "serpapi_link": "",
        "link": ""
    "thumbnail": ""

You can check out the documentation for more details.

Disclaimer: I work at SerpApi.