How scrapy write in log while running spider?

196 views Asked by At

While running scrapy spider, I am seeing that the log message has "DEBUG:" which has 1. DEBUG: Crawled (200) (GET http://www.example.com) (referer: None) 2. DEBUG: Scraped from (200 http://www.example.com)

I want to know that 1. what to those "Crawled" and "Scraped from" meant for? 2. From where those above both ULRs returned from(i.e. while scraping page which variable/argument has holding those URLs)

1

There are 1 answers

0
Frank Martin On BEST ANSWER

Let me try to explain based on the Scrapy Sample Code shown on the Scrapy Website. I saved this in a file scrapy_example.py.

from scrapy import Spider, Item, Field

class Post(Item):
    title = Field()

class BlogSpider(Spider):
    name, start_urls = 'blogspider', ['http://blog.scrapinghub.com']

    def parse(self, response):
        return [Post(title=e.extract()) for e in response.css("h2 a::text")]

Executing this with the command scrapy runspider scrapy_example.py it will produce the following output:

(...)
DEBUG: Crawled (200) <GET http://blog.scrapinghub.com> (referer: None) ['partial']
DEBUG: Scraped from <200 http://blog.scrapinghub.com>
    {'title': u'Using git to manage vacations in a large distributed\xa0team'}
DEBUG: Scraped from <200 http://blog.scrapinghub.com>
    {'title': u'Gender Inequality Across Programming\xa0Languages'}
(...)

Crawled means: scrapy has downloaded that webpage.

Scraped means: scrapy has extracted some data from that webpage.

The URL is given in the script as start_urls parameter.

Your output must have been generated by running a spider. Search the file where that spider is defined and you should be able to spot the place where the url is defined.