Scrapy spider cannot crawl a url but scrapy shell does it successfully

1k views Asked by At

I am trying to scrape craiglist. When I try to fetch https://tampa.craigslist.org/search/jjj?query=bookkeeper in the spider I am getting the following error:

(extra newlines and white space added for readability)

[scrapy.downloadermiddlewares.retry] DEBUG:
    Retrying <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper> (failed 1 times):
    [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost:
    Connection to the other side was lost in a non-clean fashion: Connection lost.>]

But, when I try to crawl it on scrapy shell, it is being crawled successfully.

[scrapy.core.engine] DEBUG:
    Crawled (200) <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper>
    (referer: None)

I don't know what I am doing wrong here. I have tried forcing TLSv1.2 but had no luck. I would really appreciate your help. Thanks!

1

There are 1 answers

1
bosnjak On

I've asked for an MCVE in the comments, which means you should provide a Minimal, Complete, and Verifiable example. To help you out, this is what it's all about:

import scrapy


class CLSpider(scrapy.Spider):
    name = 'CL Spider'
    start_urls = ['https://tampa.craigslist.org/search/jjj?query=bookkeeper']

    def parse(self, response):
            for url in response.xpath('//a[@class="result-title hdrlnk"]/@href').extract():
                yield scrapy.Request(response.urljoin(url), self.parse_item)

    def parse_item(self, response):
        # TODO: scrape item details here
        return {
            'url': response.url,
            # ...
            # ...
        }

Now, this MCVE does everything you want to do in a nutshell:

  • visits one of the search pages
  • iterates through the results
  • visits each item for parsing

This should be your starting point for debugging, removing all the unrelated boilerplate.

Please test the above and verify if it's working? If it works, add more functionality in steps so you can figure out which part introduces the problem. If it doesn't work, don't add anything else until you can figure out why.

UPDATE:

Adding a delay between requests can be done in two ways:

  1. Globally for all spiders in settings.py by specifying for example DOWNLOAD_DELAY = 2 for a 2 second delay between each download.

  2. Per-spider by defining an attribute download_delay,

for example:

class CLSpider(scrapy.Spider):
    name = 'CL Spider'
    download_delay = 2

Documentation: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay