Scraping Authors based on tags from Google Scholar

1.2k views Asked by At

I am working on a project where I hope to scrape data from Google Scholar. I want to scrape all authors tagged in a category (eg. Anaphylaxis) and store their number of citations, h-index and i-10 index in a CSV file. However, I am unsure how to do this given that Google Scholar has no API. I understand I can use a scraper like beautiful soup but am unsure how to scrape the data without being blocked.

So, my question is how can I use bs4 to store all authors tagged as Anaphylaxis and each author's citations, h-index and i-10 index in a csv file.

2

There are 2 answers

1
Kyle Pastor On

All the scraper is doing is parsing some HTML pages. Upon a search, the authors are in the div with class = "gs_a" If you use Beautiful Soup and look for this class you will be able to find all of the authors. You can go page by page by updating the url.

https://scholar.google.ca/scholar?start=20&q=polymer&hl=en&as_sdt=0,5 to https://scholar.google.ca/scholar?start=30&q=polymer&hl=en&as_sdt=0,5

ie. The start=30 then 40 etc.

Then you can loop over the author names base on the link path in the gs_a class tags.

Let me know if this helps!

-Kyle

0
Milos Djurdjevic On

To get all the profiles for any "category" (label:query), or a "name" you could use a third party solution like SerpApi. It's a paid API with a free trial.

Example python code (available in other libraries also):

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_scholar_profiles",
  "q": "Coffee",
  "hl": "en",
  "mauthors": "label:anaphylaxis"
}

search = GoogleSearch(params)
results = search.get_dict()

Example JSON output:

"profiles": [
  {
    "name": "Jerrold H Levy",
    "link": "https://scholar.google.com/citations?hl=en&user=qnH5V28AAAAJ",
    "serpapi_link": "https://serpapi.com/search.json?author_id=qnH5V28AAAAJ&engine=google_scholar_author&hl=en",
    "author_id": "qnH5V28AAAAJ",
    "affiliations": "Professor of Anesthesiology and Surgery (Cardiothoracic)",
    "email": "Verified email at duke.edu",
    "cited_by": 80353,
    "interests": [
      {
        "title": "bleeding",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Ableeding",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:bleeding"
      },
      {
        "title": "anaphylaxis",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aanaphylaxis",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:anaphylaxis"
      },
      {
        "title": "anticoagulation",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aanticoagulation",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:anticoagulation"
      },
      {
        "title": "shock",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Ashock",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:shock"
      }
    ],
    "thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=qnH5V28AAAAJ&citpid=2"
  },
  ...
}

You can check out the documentation for more details.

Disclaimer: I work at SerpApi.