I am running a Python script that scrapes a website. It uses Imperva to detect automated scripts crawling through it's web pages. Imperva has blocked my IP from accessing the site as soon as I run the script. I did read someone suggest including a time.sleep(random.randint(a,b)) (to try and mimic human behaviour) in the script which it didn't work or perhaps it just wouldn't work as a standalone method. If it's the chrome driver itself that they detect then I guess it would be impossible to avoid. Does anyone have any practical suggestions on things that I could include in my script to bypass this?. Thanks in advance.
How do I avoid imperva bot detection?
2k views Asked by AudioBubble At
1
Introduction
There are many different components that need to be added to a web scraper to make it undetectable. I recommend using the below code to test your current level of detection:
More than likely, you will fail most of those tests right off the bat, fortunately, it's easy to configure a scraper that will pass all of those tests and be completely undetectable.
Selenium-Stealth
selenium-stealth is a python package that is used to avoid detection. Simply...
and follow the below configuration:
Your web scraper should pass all of the tests, now try to implement this solution on the Imperva site.
More information
If you are still getting blocked, I recommend looking into the random-user-agent library to cycle your user agent within the "user_agent" variable of the selenium-stealth configuration. Otherwise, you could pay for a proxy provider to completely disguise your IP. Although keep in mind, proxy networks currently do not have a selenium configuration.
Information on Proxy Network Selenium Configuration: Python Selenium Proxy Network
Information on Selenium Detectability in the Cloud: Python Selenium AWS Lambda Change WebGL Vendor/Renderer For Undetectable Headless Scraper