I am trying to scrape craiglist. When I try to fetch https://tampa.craigslist.org/search/jjj?query=bookkeeper in the spider I am getting the following error:
(extra newlines and white space added for readability)
[scrapy.downloadermiddlewares.retry] DEBUG:
Retrying <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper> (failed 1 times):
[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost:
Connection to the other side was lost in a non-clean fashion: Connection lost.>]
But, when I try to crawl it on scrapy shell, it is being crawled successfully.
[scrapy.core.engine] DEBUG:
Crawled (200) <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper>
(referer: None)
I don't know what I am doing wrong here. I have tried forcing TLSv1.2 but had no luck. I would really appreciate your help. Thanks!
I've asked for an MCVE in the comments, which means you should provide a Minimal, Complete, and Verifiable example. To help you out, this is what it's all about:
Now, this MCVE does everything you want to do in a nutshell:
This should be your starting point for debugging, removing all the unrelated boilerplate.
Please test the above and verify if it's working? If it works, add more functionality in steps so you can figure out which part introduces the problem. If it doesn't work, don't add anything else until you can figure out why.
UPDATE:
Adding a delay between requests can be done in two ways:
Globally for all spiders in
settings.pyby specifying for exampleDOWNLOAD_DELAY = 2for a 2 second delay between each download.Per-spider by defining an attribute
download_delay,for example:
Documentation: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay