Scrapy CrawlSpider not following links

Question

Scrapy CrawlSpider not following links

2.8k views Asked by Sanchit At 09 June 2015 at 03:23

I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck!

Below is the code:

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/chandoka",
    ]
    Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
         callback='parse_items', follow=True)


    def parse_items(self, response):
        print "response", response
        hxs= HtmlXPathSelector(response)
        title=hxs.select("//*[@id='content']/div/h4").extract()
        title="".join(title)
        title=title.strip().replace("\n","").lstrip()
        print "title is:",title

Can someone please advise what wrong I am doing here?

Original Q&A

There are 2 answers

Jithin On 09 June 2015 at 06:57

Seems like you have some syntax errors. Try this,

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/",
    ]

    rules = (
            Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),

        )

    def parse_items(self, response):
    print "response", response

**alecxe** · Accepted Answer · 2015-06-09T07:30:51+00:00

The key problem with your code is that you have not set the rules for the CrawlSpider.

Other improvements I would suggest:

there is no need to instantiate HtmlXPathSelector, you can use response directly
select() is deprecated now, use xpath()
get the text() of the title element in order to retrieve, for instance, get Chandoka instead of <h4>Chandoka</h4>
I think you meant to start with the cheese shop catalog page instead: http://stinkybklyn.com/shop/cheese

The complete code with the applied improvements:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]

    start_urls = [
        "http://stinkybklyn.com/shop/cheese",
    ]

    rules = [
        Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
    ]

    def parse_items(self, response):
        title = response.xpath("//*[@id='content']/div/h4/text()").extract()
        title = "".join(title)
        title = title.strip().replace("\n", "").lstrip()
        print "title is:", title

TechQA.

Scrapy CrawlSpider not following links

There are 2 answers

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in WEB-CRAWLER

Related Questions in SCRAPY

Popular Questions

Popular Tags

Trending Questions