Http error 405 when trying to scrape data with Python 3.3

4.5k views Asked by At

I want to scrape data from a website; however I keep getting the HTTP: Error 405: Not Allowed. What am I doing wrong?

(I have looked at the documentation, and tried their code, with only my url in place of the example's; I still have the same error.)

Here's the code:

import requests, urllib
from urllib.request import Request, urlopen

list_url= ["http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm"]

for url in list_url:
    req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    response=urllib.request.urlopen(req).read()

If I skip the user-agent term, I get HTTP Error 403: Forbidden.

In the past, I have successfully scraped data (from another website) using the following:

for url in list_url:
    raw_html = urllib.request.urlopen(url).read()
    soup=None
    soup = BeautifulSoup(raw_html,"lxml")

Ideally, I would like to keep a similar structure, that is, pass the content of the fetched url to BeautifulSoup. Thanks!

2

There are 2 answers

0
neverwalkaloner On

Not sure about exactly reason of the issue, but try this code it is working for me:

import http.client

connection = http.client.HTTPSConnection("www.glassdoor.com")
connection.request("GET", "/Reviews/WhiteWave-Reviews-E9768.htm")

res = connection.getresponse()
data = res.read()
0
Jinesh Shah On

The error you are getting is "Pardon our Interruption. something about your browser made us think you were a bot". Implies scraping ain't permitted and they have anti-scraping bots on their webpages.

Try using a fake-browser. Link to how to make requests using a fake-browser. (How to use Python requests to fake a browser visit? )

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'http://www.glassdoor.com/Reviews/WhiteWave-Reviews-E9768.htm'
web_page = requests.get(url,headers=headers)

I tried this and what I found is their page is getting loaded via JS. So I think you might want to use a headless Browser ( Selenium / PhantomJS ) and scrape rendered html pages. Hope it helps.