scrapy request/ response (crawling to page 2,3, etc)

332 views Asked by At
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from asdf.items import AsdfItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.http.request import Request
import scrapy

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()

class MySpider(BaseSpider):
    name = "asdf"

search_text = "midi key synth"

allowed_domains = ["http://www.amazon.com"]
  start_urls = ["http://www.amazon.com/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3A" + search_text]

def parse(self, response):
    #title
    view = '//a[contains(@class, "a-link-normal s-access-detail-page  a-text-normal")]'
    nextPage = '//a[contains(@title, "Next Page")]'
    nextPageLink = 'http://www.amazon.com' + response.xpath(nextPage + '/@href').extract()[0]
    i = 0
    for sel in response.xpath(view):

        l = ItemLoader(item=AsdfItem(), selector=sel)
        l.add_xpath('title','.//@title')
        i+=1
        yield l.load_item()

    request = Request(nextPageLink, callback=self.parse_page2)
    request.meta['item'] = AsdfItem()
    yield request

def parse_page2(self, reponse):
    #title
    view = '//a[contains(@class, "a-link-normal s-access-detail-page  a-text-normal")]'
    nextPage = '//a[contains(@title, "Next Page")]'
    nextPageLink = 'http://www.amazon.com' + response.xpath(nextPage + '/@href').extract()[0]
    i = 0
    for sel in response.xpath(view):

        l = ItemLoader(item=AsdfItem(), selector=sel)
        l.add_xpath('title','.//@title')
        i+=1
        yield l.load_item()

I have a scrapy bot that crawls amazon and looks for the titles. Why is the response/request not working for crawling to the subsequent pages? I identify the next page by creating a nextPageLink variable and pushing that into the request. Why does this not work? and how could I fix it?

Ideally, I would like to crawl all the subsequent pages.

1

There are 1 answers

0
tegancp On

Some things you should consider:

  1. Debugging: Scrapy has several ways to help determine why your spider is not behaving the way you want/expect. Check out Debugging Spiders in the scrapy docs; this may well be the most important page in the docs.

  2. Scrapy Shell: In particular, the scrapy shell is invaluable for examining what is actually going on with your spider (as opposed to what you want to have happen). For example, if you run scrapy shell with the url you want to start on, then call view(response), you can verify whether the spider is going to the page that you expect it to.

  3. Your code: A few specific observations from a quick look at your code:

    • remove the http:// from your allowed domains
    • having spaces in the url you give the spider is probably not going to get you what you want
    • if you want the spider to basically do the same thing on each page (i.e. collect information and follow the "next page" link), you are probably better organizing your code with just one callback method (i.e. why do you need parse-page2?)
    • what is the variable i doing?
    • for what you are trying to accomplish, you may want to subclass CrawlSpider instead