Webscraping sale prices from a grocery store- Am I on the right track or is there a simpler way?

2k views Asked by At

I am new to all of this, and this is my first real coding project so forgive me if the answer is obvious :)

I am trying to extract sale items from [my grocery store] with BeautifulSoup but the href I need is buried. Ultimately I want the simplest way to compare items on sale against my database of recipes to automate meal planning. I have spent days trying to learn how to scrape webpages but most every tutorial or question covers a site with a much simper layout.

My initial approach was to just scrape the html with BeautifulSoup like most tutorials describe, using the following, but it couldn't access the <body>:

import requests

from bs4 import BeautifulSoup

page = requests.get('https://www.realcanadiansuperstore.ca/deals/all?sort=relevance&category=27985').text
soup = BeautifulSoup(page, 'html.parser')

print(soup.select("li.product-tile-group__list__item:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(3) > div:nth-child(1) > h3:nth-child(1) > a:nth-child(1)"))

After some searching I gathered that the DOM tree needed to be loaded to access the portion of the html I needed and that Selenium was my best bet. Now after another few hours of trouble shooting I've managed to get my code to (most of the time) navigate to the correct page, and last night it even managed to scrape some html (although not the correct part, I think I've corrected that but it hasn't run far enough to tell again...).

My current code looks like this:

import os
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.service import Service as FirefoxService
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from webdriver_manager.firefox import GeckoDriverManager

options = Options()
options.headless = True

service = FirefoxService(executable_path=GeckoDriverManager().install())
driver = webdriver.Firefox(service=service, options=options)
driver.maximize_window()
print("Headless=", options.headless)
driver.get("https://www.realcanadiansuperstore.ca/deals/all?sort=relevance&category=27985")
print("-Page launched")
print("Wait for page to load location selection and click Ontario")
ontarioButton = '/html/body/div[1]/div/div[6]/div[2]/div/div/ul/li[4]/button'
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, ontarioButton))).click()
print("-Ontario clicked")
print("Wait for page to load location entry and send city")
WebDriverWait(driver, 30).until(EC.invisibility_of_element_located((By.CLASS_NAME, 'region-selector--is-loading')))
WebDriverWait(driver, 20).until(
    EC.element_to_be_clickable((By.XPATH, '//*[@id="location-search__search__input"]'))).click()
WebDriverWait(driver, 20).until(
    EC.element_to_be_clickable((By.XPATH, '//*[@id="location-search__search__input"]'))).send_keys('Oshawa',
                                                                                                   Keys.RETURN)
print("-Sent Oshawa")
print("Wait until Gibb flyer is clickable")
privacyClose = '.lds__privacy-policy__btnClose'
privacyPolicy = WebDriverWait(driver, 200).until(EC.element_to_be_clickable((By.CSS_SELECTOR, privacyClose)))
if WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '/html/body/div[2]/div/div/button'))):
    print("Closing privacy policy")
    driver.implicitly_wait(5)
    privacyPolicy.click()
    print("-PP closed")

storeFlyer = '/html/body/div[1]/div/div[2]/main/div/div/div/div/div[2]/div[1]/div[1]/div/div[2]/button'
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, storeFlyer))).click()
print("-Gibb clicked")

foodButton = '/html/body/div[1]/div/div[2]/main/div/div/div/div/div[2]/div/div[1]/div/div/div/div[1]/div/div/ul/li[1]/button'
WebDriverWait(driver, 200).until(EC.element_to_be_clickable((By.XPATH, foodButton))).click()

os.system('clear')

print('ALL DEALS:')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

print(soup.find_all('a'))
driver.quit()

This works most of the time but sometimes gets hung up on:

Traceback (most recent call last):
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/SuperstoreScraper0.04.py", line 40, in <module>
    WebDriverWait(driver, 20000000).until(EC.element_to_be_clickable((By.XPATH, storeFlyer))).click()
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 81, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 740, in _execute
    return self._parent.execute(command, params)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
    self.error_handler.check_response(response)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementClickInterceptedException: Message: Element <button class="flyers-location-search-item__main__content__button"> is not clickable at point (483,666) because another element <div class="lds__privacy-policy__innerWrapper"> obscures it
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
ElementClickInterceptedError@chrome://remote/content/shared/webdriver/Errors.jsm:282:5
webdriverClickElement@chrome://remote/content/marionette/interaction.js:166:11
interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:203:24
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:91:31

Which was my attempt at addressing selenium.common.exceptions.ElementClickInterceptedException: Message: Element <button class="flyers-location-search-item__main__content__button"> is not clickable at point (483,666) because another element <div class="lds__privacy-policy__innerWrapper"> obscures it Which it threw 100% of the time otherwise. But the main issue I'm having now is:

  File "/mnt/1TB/PythonProjects/SuperstoreScraper/SuperstoreScraper0.04.py", line 36, in <module>
    privacyPolicy.click()
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 81, in click
    self._execute(Command.CLICK_ELEMENT)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 740, in _execute
    return self._parent.execute(command, params)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
    self.error_handler.check_response(response)
  File "/mnt/1TB/PythonProjects/SuperstoreScraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.ElementNotInteractableException: Message: Element <button class="lds__privacy-policy__btnClose" type="button"> could not be scrolled into view
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:183:5
ElementNotInteractableError@chrome://remote/content/shared/webdriver/Errors.jsm:293:5
webdriverClickElement@chrome://remote/content/marionette/interaction.js:156:11
interaction.clickElement@chrome://remote/content/marionette/interaction.js:125:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:203:24
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:91:31

I read somewhere that it needed to be clicked with java and I keep seeing variations of this:

WebElement element = driver.findElement(By.xpath("//a[@href='itemDetail.php?id=19']"));    
JavascriptExecutor js = (JavascriptExecutor) driver;  
js.executeScript("arguments[0].scrollIntoView();",element);
element.click();

but JavascriptExecutor isn't recognized, and I'm having a hard time finding more info on what to do next except for this here:

"Selenium supports javaScriptExecutor. There is no need for an extra plugin or add-on. You just need to import (org.openqa.selenium.JavascriptExecutor) in the script as to use JavaScriptExecutor."

but no variation of that seems to be able to get JavascriptExecutor to do anything...

I've put off asking any questions because I enjoy the challenge of figuring it out but I'm starting to get the feeling I'm missing something. Am I on the right track? Or is there a simpler way to approach this problem? Thanks in advance!

PS. Just before I hit post I changed the wait time in line 36 from 20 to 20000000 and it still gave the same error in the same amount of time. Am I using WebDriverWait wrong?

2

There are 2 answers

0
Antoine Veillette On

I'm working on the same projet at the moment. By inspecting the web page of my local grocery store flyers page, I found a dictionary publicly accessible of the items listed with the price, discount ,etc.

It's located in : Network section | XHR/recuperation | products.

I can acces the file with a url but I fear that the acces_token located in the url might change for every request.

Hope this helps!

0
domigmr On

I would suggest that you avoid using Selenium in this case. You could, as Antoine suggested, use the inspect element feature to examine if there is an API exposed.

I think what happens when you scroll to the bottom is that the web page fires off a request to the back end for more data. As Antoine suggested, you can mimic this request. Use Inspect Element on the webpage, navigate to Network, and then to response. Scroll to the bottom of the page and let it load, and you'll see some new requests.

I'd suggest John Watson Rooney's video from here on out.https://www.youtube.com/watch?v=DqtlR0y0suo