I would like to scrape titles, abstracts, claims, and inventor names from google patents and add this to an existing csv file. Could you please help me in this? A sample of my code is as follows:
# Create empty lists to store extracted information
claim_list = []
# Define a function to extract application number and claims from a URL and add them to the lists
def add_info_to_lists(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract claims
claims = [claim.get_text(strip=True) for claim in soup.select("li.claim, li.claim-dependent")]
if claims:
claim_text = " ".join(claims)
claim_list.append(claim_text)
else:
claim_list.append("N/A")
A similar snippet seems to work with strings (e.g. application numbers), but it does not with other json elements.
Thank you in advance!
I wasn't able to figure out how to parse the response object from the requests library. But this uses selenium and launching a chrome driver. You will need to do this for each page.
Also, since you are on the search results page, you can't get the entire abstracts. If you want all the info about the patents including the full abstracts, you probably want to navigate to each patent's page and scrape the data there rather than from the search results page. All the hrefs are on the search results page so the job of going to each is easy.