Scrapy spider cannot crawl a url but scrapy shell does it successfully

Question

Scrapy spider cannot crawl a url but scrapy shell does it successfully

1k views Asked by Arindam Ghosh At 12 April 2018 at 22:41

I am trying to scrape craiglist. When I try to fetch https://tampa.craigslist.org/search/jjj?query=bookkeeper in the spider I am getting the following error:

(extra newlines and white space added for readability)

[scrapy.downloadermiddlewares.retry] DEBUG:
    Retrying <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper> (failed 1 times):
    [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost:
    Connection to the other side was lost in a non-clean fashion: Connection lost.>]

But, when I try to crawl it on scrapy shell, it is being crawled successfully.

[scrapy.core.engine] DEBUG:
    Crawled (200) <GET https://tampa.craigslist.org/search/jjj?query=bookkeeper>
    (referer: None)

I don't know what I am doing wrong here. I have tried forcing TLSv1.2 but had no luck. I would really appreciate your help. Thanks!

Original Q&A

There are 1 answers

**bosnjak** · Answer 1 · 2018-05-14T07:43:52+00:00

I've asked for an MCVE in the comments, which means you should provide a Minimal, Complete, and Verifiable example. To help you out, this is what it's all about:

import scrapy


class CLSpider(scrapy.Spider):
    name = 'CL Spider'
    start_urls = ['https://tampa.craigslist.org/search/jjj?query=bookkeeper']

    def parse(self, response):
            for url in response.xpath('//a[@class="result-title hdrlnk"]/@href').extract():
                yield scrapy.Request(response.urljoin(url), self.parse_item)

    def parse_item(self, response):
        # TODO: scrape item details here
        return {
            'url': response.url,
            # ...
            # ...
        }

Now, this MCVE does everything you want to do in a nutshell:

visits one of the search pages
iterates through the results
visits each item for parsing

This should be your starting point for debugging, removing all the unrelated boilerplate.

Please test the above and verify if it's working? If it works, add more functionality in steps so you can figure out which part introduces the problem. If it doesn't work, don't add anything else until you can figure out why.

UPDATE:

Adding a delay between requests can be done in two ways:

Globally for all spiders in settings.py by specifying for example DOWNLOAD_DELAY = 2 for a 2 second delay between each download.
Per-spider by defining an attribute download_delay,

for example:

class CLSpider(scrapy.Spider):
    name = 'CL Spider'
    download_delay = 2

Documentation: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

TechQA.

Scrapy spider cannot crawl a url but scrapy shell does it successfully

There are 1 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Related Questions in CRAIGSLIST

Popular Questions

Trending Questions