Google-Get Search "featured snippet"?

968 views Asked by At

How can I extract a

featured snippet

from a Google search results page?

1

There are 1 answers

0
Denis Skopa On

If you want to scrape Google Search Results Snippet you can use BeautifulSoup web scraping library, but with this solution, problems can arise if a lot of requests are made.

You can try to solve the blocking issue by adding headers where your user-agent will be specified, this is necessary for Google to recognize the request as from a user, and not as from a bot, and not block it:

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

An additional step could be to rotate user-agents.

The code example below shows a solution using pagination to get more values. You can paginate all pages using an infinite while loop. Pagination is possible as long as the next button exists (determined by the presence of a button selector on the page, in our case the CSS selector ".d6cvqb a[id=pnnext]", you need to increase the value of ["start"] by 10 to access the next page, if present, otherwise, we need to exit the while loop:

if soup.select_one('.d6cvqb a[id=pnnext]'):
    params["start"] += 10
else:
    break

Check code in online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "python",       # query example
    "hl": "en",          # language
    "gl": "us",          # country of the search, US -> USA
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

page_num = 0

website_data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
          snippet = result.select_one(".lEBKkf").text
        except:
          snippet = None
                    
        website_data.append({
              "title": title,
              "snippet": snippet  
        })
      
    if soup.select_one('.d6cvqb a[id=pnnext]'):
        params["start"] += 10
    else:
        break

print(json.dumps(website_data, indent=2, ensure_ascii=False))

Example output:

[
    {
    "title": "Welcome to Python.org",
    "snippet": "The official home of the Python Programming Language."
  },
  {
    "title": "Python (programming language) - Wikipedia",
    "snippet": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
  },
  {
    "title": "Python Courses & Tutorials - Codecademy",
    "snippet": "Python is a general-purpose, versatile, and powerful programming language. It's a great first language because Python code is concise and easy to read."
  },
  {
    "title": "Python - GitHub",
    "snippet": "Repositories related to the Python Programming language - Python. ... Collection of library stubs for Python, with static types. Python 3.3k 1.4k."
  },
  {
    "title": "Learn Python - Free Interactive Python Tutorial",
    "snippet": "learnpython.org is a free interactive Python tutorial for people who want to learn Python, fast."
  },
  # ...
]

You can also use Google Search Engine Results API from SerpApi. It's a paid API with the free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

Code example:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi key
  "engine": "google",              # serpapi parser engine
  "q": "python",                   # search query
  "num": "100"                     # number of results per page (100 per page in this case)
  # other search parameters: https://serpapi.com/search-api#api-parameters
}

search = GoogleSearch(params)      # where data extraction happens

organic_results_data = []

while True:
    results = search.get_dict()    # JSON -> Python dictionary
    
    for result in results["organic_results"]:
        organic_results_data.append({
            "title": result.get("title"),
            "snippet": result.get("snippet")   
        })
    
    if "next_link" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
    else:
        break
    
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

The output is exactly the same as in bs4's answer.