scrapinghub/splash loses some data on rendering

161 views Asked by At

I'm trying to create web-scraper for dynamic website. For this purpose I'm using Scrapy 1.2.1 and scrapy-splash 0.7 library. The problem appears when using splash server, most of the times, it returns different data to scrapy. From log I can see that all pages are crawled. If I use scrapy.Request instead of SplashRequest, everything is OK (I get the same data each time).

My code:

import scrapy
import re
from scrapy_splash import SplashRequest
from scraper.items import ScraperRozetka


class RozetkaSpider(scrapy.Spider):
    name = "rozetka_laptops"
    start_urls = [
        'http://rozetka.com.ua/notebooks/c80004/filter/producer=dell;page=1/',
    ]

    def parse(self, response):
        last_page = response.xpath('//ul[@name="paginator"]//li[last()]//@id').extract_first()
        last_page_num = int(last_page[-1])

        i = 1
        while i <= last_page_num:
            url = re.sub(r'page=\d+', r'page={}'.format(i), response.url)
            i += 1
            yield SplashRequest(url, self.parse_results, endpoint='render.html', args={'wait': 0.5, 'timeout': 60})

    def parse_results(self, response):
        items = []
        records = response.css('div.g-i-tile-catalog')
        for record in records:
            item = ScraperRozetka()
            item['title'] = record.css('img::attr(title)').extract_first()
            item['price'] = record.css('div.g-price-uah::text').extract_first()
            item['link'] = record.css('div.g-i-tile-i-title a::attr(href)').extract_first()
            items.append(item)
        return items

Would be grateful if someone helped me. Thanks.

0

There are 0 answers