from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from asdf.items import AsdfItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.http.request import Request
import scrapy
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
class MySpider(BaseSpider):
name = "asdf"
search_text = "midi key synth"
allowed_domains = ["http://www.amazon.com"]
start_urls = ["http://www.amazon.com/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3A" + search_text]
def parse(self, response):
#title
view = '//a[contains(@class, "a-link-normal s-access-detail-page a-text-normal")]'
nextPage = '//a[contains(@title, "Next Page")]'
nextPageLink = 'http://www.amazon.com' + response.xpath(nextPage + '/@href').extract()[0]
i = 0
for sel in response.xpath(view):
l = ItemLoader(item=AsdfItem(), selector=sel)
l.add_xpath('title','.//@title')
i+=1
yield l.load_item()
request = Request(nextPageLink, callback=self.parse_page2)
request.meta['item'] = AsdfItem()
yield request
def parse_page2(self, reponse):
#title
view = '//a[contains(@class, "a-link-normal s-access-detail-page a-text-normal")]'
nextPage = '//a[contains(@title, "Next Page")]'
nextPageLink = 'http://www.amazon.com' + response.xpath(nextPage + '/@href').extract()[0]
i = 0
for sel in response.xpath(view):
l = ItemLoader(item=AsdfItem(), selector=sel)
l.add_xpath('title','.//@title')
i+=1
yield l.load_item()
I have a scrapy bot that crawls amazon and looks for the titles. Why is the response/request not working for crawling to the subsequent pages? I identify the next page by creating a nextPageLink variable and pushing that into the request. Why does this not work? and how could I fix it?
Ideally, I would like to crawl all the subsequent pages.
Some things you should consider:
Debugging: Scrapy has several ways to help determine why your spider is not behaving the way you want/expect. Check out Debugging Spiders in the scrapy docs; this may well be the most important page in the docs.
Scrapy Shell: In particular, the scrapy shell is invaluable for examining what is actually going on with your spider (as opposed to what you want to have happen). For example, if you run
scrapy shell
with the url you want to start on, then callview(response)
, you can verify whether the spider is going to the page that you expect it to.Your code: A few specific observations from a quick look at your code:
http://
from your allowed domainsparse-page2
?)i
doing?CrawlSpider
instead