I am trying to build a scraper using selenium in python. Selenium webdriver opening window and trying to load the page but suddenly stop loading. I can access the same link in my local chrome browser.

Here are the error logs I'm getting from the webdriver:

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1 - Failed to load resource: the server responded with a status of 429 (Too Many Requests)', 'source': 'network', 'timestamp': 1556997743637}

{'level': 'SEVERE', 'message': 'about:blank - Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME', 'source': 'network', 'timestamp': 1556997745338}

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1556997748339}

My script:

from selenium import webdriver
import os

path = os.path.join(os.getcwd(), 'chromedriver')
driver = webdriver.Chrome(executable_path=path)

links = [
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/baby-accessories?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/food?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/formula?pageNumber=1",
]


for link in links:
    driver.get(link)

1 Answers

1
DebanjanB On Best Solutions

429 Too Many Requests

The HTTP 429 Too Many Requests response status code indicates that the user has sent too many requests in a given amount of time ("rate limiting"). The response representations SHOULD include details explaining the condition, and MAY include a Retry-After header indicating how long to wait before making a new request.

When a server is under attack or just receiving a very large number of requests from a single party, responding to each with a 429 status code will consume resources. Therefore, servers are not required to use the 429 status code; when limiting resource usage, it may be more appropriate to just drop connections, or take other steps.


404 Not Found

The HTTP 404 Not Found client error response code indicates that the server can not find requested resource. In the browser, this means the URL is not recognized. In an API, this can also mean that the endpoint is valid but the resource itself does not exist. Servers may also send this response instead of 403 to hide the existence of a resource from an unauthorized client. This response code is probably the most famous one due to its frequent occurence on the web.

A 404 status code does not indicate whether the resource is temporarily or permanently missing. But if a resource is permanently removed, a 410 (Gone) should be used instead of a 404 status. Additionally, 404 status code is used when the requested resource is not found, whether it doesn't exist or if there was a 401 or 403 that, for security reasons, the service wants to mask.


Analysis

When I tried your code block, I faced similar consequences. If you inspect the DOM Tree of the webpage you will find that quite a few tags are having the keyword dist. As an example:

  • <link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
  • <link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
  • 'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

The presence of the term dist is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in: