I'm trying to scrape the search results of google scholar (https://scholar.google.com/scholar?hl=en&as_sdt=40000005&sciodt=0%2C22&cites=5652101630448192864&scipsc=&as_ylo=2015&as_yhi=) with BeautifulSoup. I need the titles, journal names, and (potentially) abstracts of these search result papers. When I send http get request to url, the response seems to contain information. However, when I try to use BeautifulSoup's find_all method to extract information in "divs" of "gs_r gs_or gs_scl," there are zero results.
I'm not sure if the issue is my IP address getting blocked or something, but does anyone know how to resolve this problem? Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
import time
# the url from which our scraping starts
base_url = "https://scholar.google.com/scholar?hl=en&as_sdt=40000005&sciodt=0%2C22&cites=5652101630448192864&scipsc=&as_ylo=2015&as_yhi="
# headers to mask activity as from real user
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
csv_filename = "GoyenkoHoldenTrzcinka_lit_review.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["Title", "Journal", "Abstract"])
page_number = 0
while True:
# make url for current page
url = base_url + f"&start={page_number * 10}"
# sent http get request to url
response = requests.get(url, headers=headers)
# parse html of the page
soup = BeautifulSoup(response.text, "html.parser")
# find search results div
results = soup.find_all("div", class_="gs_r gs_or gs_scl") # line with a potential issue
# check if there's no more results
if not results:
break
# loop through current page's search results
for result in results:
title = result.find("h3", class_="gs_rt").text
journal_element = result.find("div", class_="gs_citi")
journal = journal_element.text.strip() if journal_element else ""
if journal in desired_journals:
# attempt to extract the abstract
abstract_element = result.find("div", class_="gs_rs")
abstract = abstract_element.text if abstract_element else ""
csv_writer.writerow([title, journal, abstract])
# move to next page of results
page_number += 1
time.sleep(0.5)
print(f"Search results for desired papers were saved to {csv_filename}")
I tried looking at the html structure of google scholar's site again, but I think my code is indeed consistent with that. But still, when I run my code, my while loop immediately terminates and my resulting csv file is empty.