Scrapy CrawlSpider not following links

2.8k views Asked by At

I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck!

Below is the code:

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/chandoka",
    ]
    Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
         callback='parse_items', follow=True)


    def parse_items(self, response):
        print "response", response
        hxs= HtmlXPathSelector(response)
        title=hxs.select("//*[@id='content']/div/h4").extract()
        title="".join(title)
        title=title.strip().replace("\n","").lstrip()
        print "title is:",title

Can someone please advise what wrong I am doing here?

2

There are 2 answers

0
alecxe On BEST ANSWER

The key problem with your code is that you have not set the rules for the CrawlSpider.

Other improvements I would suggest:

  • there is no need to instantiate HtmlXPathSelector, you can use response directly
  • select() is deprecated now, use xpath()
  • get the text() of the title element in order to retrieve, for instance, get Chandoka instead of <h4>Chandoka</h4>
  • I think you meant to start with the cheese shop catalog page instead: http://stinkybklyn.com/shop/cheese

The complete code with the applied improvements:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]

    start_urls = [
        "http://stinkybklyn.com/shop/cheese",
    ]

    rules = [
        Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
    ]

    def parse_items(self, response):
        title = response.xpath("//*[@id='content']/div/h4/text()").extract()
        title = "".join(title)
        title = title.strip().replace("\n", "").lstrip()
        print "title is:", title
0
Jithin On

Seems like you have some syntax errors. Try this,

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/",
    ]

    rules = (
            Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),

        )

    def parse_items(self, response):
    print "response", response