Not redirecting to Cookie enable page via Python3.8+Selenium+beautifulSoap

98 views Asked by At

I am working with Python + Sellenium + BeautifulSoup4 using below code but unable to open website (https://check.spamhaus.org/) and scrape over it.

My code is:

import requests
from bs4 import BeautifulSoup
url = "https://check.spamhaus.org/"

r = requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')
print(soup.prettify) 

Please help in informing the GAP.

2

There are 2 answers

0
Kamal Luthra On BEST ANSWER

I included some of the information from @Nic's answer and used find_element_by_id(id) and find_element_by_name(name) and things started working for me. Also, in driver.switch_to.window function I changed the index number to 1.

Here is my code:

#!/usr/bin/python3.8
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

import time

url = "https://check.spamhaus.org/"

options = Options()
# Disable this switch to help prevent the website to detect an automated tool (selenium)
options.add_argument("--disable-blink-features=AutomationControlled")

with webdriver.Chrome(executable_path='/home/kkishore/Documents/doc/python-practise/updated-driver2/chromedriver',options=options) as driver:
    # This will open a new tab with the desired url
    driver.execute_script(f'''window.open("{url}","_blank");''')
    driver.switch_to.window(driver.window_handles[1])

    # Created by Kamal
    time.sleep(5)
    # INPUT the domain name 
    driver.find_element_by_name('ip-search').send_keys('example.com')
    # Click on lookup button
    driver.find_element_by_id('ipLookup').click()
    time.sleep(10)
##    soup = BeautifulSoup(driver.page_source, 'html.parser')
##    print(soup.prettify)
4
Nic Laforge On

The page does some bot detection prior allowing you to have access to the content. Because of it, the requests package won't be of any help, since r.content will return you the content of the loading page.

You will need to use selenium to achieve what you want.

url = "https://check.spamhaus.org/"

options = Options()
# Disable this switch to help prevent the website to detect an automated tool (selenium)
options.add_argument("--disable-blink-features=AutomationControlled")

with webdriver.Chrome(options=options) as driver:
    # This will open a new tab with the desired url
    driver.execute_script(f'''window.open("{url}","_blank");''')
    # Switch to the active window
    driver.switch_to.window(driver.window_handles[-1])

    # Waiting until the page is loaded (Validating the ipLookup button)
    wait = WebDriverWait(driver, 15)
    wait.until(EC.presence_of_element_located((By.ID, 'ipLookup')))

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    print(soup.prettify)

You will also need to add the following imports:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

I have not used driver.get(url) but instead window.open() this will actually open a new tab with the desired URL. I cannot 100% explain why, but using driver.get(), the page was still not passing the loading page.

As for your comment question here's how you may perform a search

search_button = wait.until(EC.presence_of_element_located((By.ID, 'ipLookup')))

search_text_field = driver.find_element(By.CSS_SELECTOR, '[name="ip-search"]')
search_text_field.send_keys('1.1.1.1')
search_button.click()