Get year of first publication Google Scholar

254 views Asked by At

I am working on scraping data from Google Scholar using bs4 and urllib. I am trying to get the first year an article is publsihed. For example, from this page I am trying to get the year 1996. This can be read from the bar chart, but only after the bar chart is clicked. I have written the following code, but it prints out the year visible before the bar chart is clicked.

from bs4 import BeautifulSoup
import urllib.request

url = 'https://scholar.google.com/citations?user=VGoSakQAAAAJ'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
year = soup.find('span', {"class": "gsc_g_t"})
print (year)
2

There are 2 answers

1
eLRuLL On BEST ANSWER

the chart information is on a different request, this one. There you can get the information you want with the following xpath:

'//span[@class="gsc_g_t"][1]/text()'

or in soup:

soup.find('span', {"class": "gsc_g_t"}).text
0
Dmitriy Zub On

Make sure you're using the latest user-agent. Old user-agents is a signal to the website that it might be a bot that sends a request. But a new user-agent does not mean that every website would think that it's a "real" user visit. Check what's your user-agent.

The code snippet is using parsel library which is similar to bs4 but it supports full XPath and translates every CSS selector query to XPath using the cssselect package.

Example code to integrate:

from collections import namedtuple

import requests
from parsel import Selector

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "user": "VGoSakQAAAAJ",
    "hl": "en",
    "view_op": "citations_histogram"
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

Publications = namedtuple("Years", "first_publication")
publications = Publications(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])

print(selector.css(".gsc_g_t::text").get())
print(sorted([publication.get() for publication in selector.css(".gsc_g_t::text")])[0])
print(publications.first_publication)


# output:
'''
1996
1996
1996
'''

Alternatively, you can achieve the same thing by using Google Scholar Author API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to figure out how to parse the data and maintain the parser over time, figure out how to scale it, and bypass blocks from a search engine, such as Google Scholar search engine.

Example code to integrate:


from serpapi import GoogleScholarSearch


params = {
  "api_key": "Your SerpApi API key",
  "engine": "google_scholar_author",
  "hl": "en",
  "author_id": "VGoSakQAAAAJ"
}

search = GoogleScholarSearch(params)
results = search.get_dict()

# already sorted data
first_publication = [year.get("year") for year in results.get("cited_by", {}).get("graph", [])][0]
print(first_publication)

# 1996

If you want to scrape all Profile results based on a given query or you have a list of Author IDs, there's a dedicated scrape all Google Scholar Profile, Author Results to CSV blog post of mine about it.

Disclaimer, I work for SerpApi.