Crawlera, cookies, sessions, rate limiting

868 views Asked by At

I'm trying to use scrapinghub to crawl a website that heavily limits request rate.

If I run the spider as-is, I get 429 pretty soon.

If I enable crawlera as per standard instructions, the spider doesn't work anymore.

If I set headers = {"X-Crawlera-Cookies": "disable"} the spider works again, but I get 429s -- so I assume the limiter works (also) on the cookie.

So what would an approach be here?

1

There are 1 answers

2
Manualmsdos On

You can try RandomUserAgent, If you don't want to write your own implementation, you can try use this:

https://github.com/cnu/scrapy-random-useragent