Scraping Google Patents using BeautifulSoup

Question

Scraping Google Patents using BeautifulSoup

389 views Asked by antopol At 17 August 2023 at 04:25

I would like to scrape titles, abstracts, claims, and inventor names from google patents and add this to an existing csv file. Could you please help me in this? A sample of my code is as follows:

# Create empty lists to store extracted information
claim_list = []

# Define a function to extract application number and claims from a URL and add them to the lists
def add_info_to_lists(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract claims
    claims = [claim.get_text(strip=True) for claim in soup.select("li.claim, li.claim-dependent")]
    if claims:
        claim_text = " ".join(claims)
        claim_list.append(claim_text)
    else:
        claim_list.append("N/A")

A similar snippet seems to work with strings (e.g. application numbers), but it does not with other json elements.

Thank you in advance!

Original Q&A

There are 1 answers

**Saeed** · Answer 1 · 2023-08-17T23:17:01+00:00

I wasn't able to figure out how to parse the response object from the requests library. But this uses selenium and launching a chrome driver. You will need to do this for each page.

from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd

options = Options()

url = 'https://patents.google.com/?q=(artificial+intelligence)&oq=artificial+intelligence'
driver = webdriver.Chrome(executable_path=ChromeDriverManager(log_level=0).install(), options=options)
driver.get(url)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')

items_first_line = [x.text.replace('\n', ' ').split('   ') for x in soup.find_all('h4', attrs = {'class': "metadata style-scope search-result-item"})]

locations = [x[0] for x in items_first_line]
patent_numbers = [x[1] for x in items_first_line]
patent_holders = [x[2] for x in items_first_line]
companies = [x[3] for x in items_first_line]

dates = [x.text.replace('\n', ' ').split('   ') for x in soup.find_all('h4', attrs = {'class': "dates style-scope search-result-item"})]

pd.DataFrame( {'locations': locations, 'patent_numbers':patent_numbers, 'patent_holders':patent_holders, 'companies':companies, 'dates' : dates})

Output:

Also, since you are on the search results page, you can't get the entire abstracts. If you want all the info about the patents including the full abstracts, you probably want to navigate to each patent's page and scrape the data there rather than from the search results page. All the hrefs are on the search results page so the job of going to each is easy.

TechQA.

Scraping Google Patents using BeautifulSoup

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in BEAUTIFULSOUP

Related Questions in GOOGLE-PATENT-SEARCH

Popular Questions

Trending Questions