I have a script that takes a URL and returns the value of the page's <title>
tag. After a few hundred or so runs, I always get the same error:
File "/home/edmundspenser/Dropbox/projects/myfiles/titlegrab.py", line 202, in get_title
status, response = http.request(pageurl)
File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1390, in _request
raise RedirectLimit("Redirected more times than rediection_limit allows.", response, content)
httplib2.RedirectLimit: Redirected more times than rediection_limit allows.
My function looks like:
def get_title(pageurl):
http = httplib2.Http()
status, response = http.request(pageurl)
x = BeautifulSoup(response, parseOnlyThese=SoupStrainer('title'))
x = str(x)
y = x[7:-8]
z = y.split('-')[0]
return z
Pretty straightforward. I used try
and except
and time.sleep(1)
to give it time to maybe get unstuck if that was the issue but so far nothing has worked. And I don't want to pass
on it. Maybe the website is rate-limiting me?
edit: As of right now the script doesn't work at all, it runs into said error with the first request.
I have a json file of over 80,000 URLs of www.wikiart.org painting pages. For each one I run my function to get the title. So:
print repr(get_title('http://www.wikiart.org/en/vincent-van-gogh/van-gogh-s-chair-1889'))
returns
"Van Gogh's Chair"
Try using the
Requests
library. On my end, there seems to be no rate-limiting that I've seen. I was able to retrieve 13 titles in 21.6s. See below:Code:
Output:
However, out of personal ethics, I don't recommend doing it like this. With a fast connection, you'll pull data too fast. Allowing the scrape to sleep every 20 pages or so for a few seconds won't hurt.
EDIT: An even faster version, using
grequests
, which allows asynchronous requests to be made. This pulls the same data above in 2.6s, nearly 10 times faster. Again, limit your scrape speed out of respect for the site.