Need helpwith YellowPages spider

Question

Need helpwith YellowPages spider

162 views Asked by oscarQ At 01 July 2017 at 21:41

I'm new to scrapy, I've been able to create a few spiders so far. I would like to write a spider that will crawl Yellowpages, looking for websites that have a 404 response, the spider is working OK, however, the pagination is not working. Any help will be much appreciated. thanks in advance

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
    for listing in response.css('div.search-results.organic div.srp-listing'):

        url = listing.css('a.track-visit-website::attr(href)').extract_first()

        yield scrapy.Request(url=url, callback=self.parse_details)


    # follow pagination links

    next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
    next_page_url = response.urljoin(next_page_url)
    if next_page_url:
        yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
    yield{'Response': response,}

Original Q&A

There are 1 answers

**Adrien Blanquer** · Accepted Answer · 2017-07-02T10:47:11+00:00

I ran your code and found out that there are some errors. In the first loop, you don't check the value of url and sometimes it is None. This error stops the execution, that's why you thought the pagination didn't work.

Here is a working code:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
        for listing in response.css('div.search-results.organic div.srp-listing'):
            url = listing.css('a.track-visit-website::attr(href)').extract_first()
            if url:
                yield scrapy.Request(url=url, callback=self.parse_details)
        next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        if next_page_url:
            yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
        yield{'Response': response,}

TechQA.

Need helpwith YellowPages spider

There are 1 answers

Related Questions in SCRAPY

Related Questions in YELLOW-PAGES

Popular Questions

Popular Tags

Trending Questions