Need helpwith YellowPages spider

166 views Asked by At

I'm new to scrapy, I've been able to create a few spiders so far. I would like to write a spider that will crawl Yellowpages, looking for websites that have a 404 response, the spider is working OK, however, the pagination is not working. Any help will be much appreciated. thanks in advance

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
    for listing in response.css('div.search-results.organic div.srp-listing'):

        url = listing.css('a.track-visit-website::attr(href)').extract_first()

        yield scrapy.Request(url=url, callback=self.parse_details)


    # follow pagination links

    next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
    next_page_url = response.urljoin(next_page_url)
    if next_page_url:
        yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
    yield{'Response': response,}
1

There are 1 answers

1
Adrien Blanquer On BEST ANSWER

I ran your code and found out that there are some errors. In the first loop, you don't check the value of url and sometimes it is None. This error stops the execution, that's why you thought the pagination didn't work.

Here is a working code:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    #allowed_domains = ['www.yellowpages.com']
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']

    def parse(self, response):
        for listing in response.css('div.search-results.organic div.srp-listing'):
            url = listing.css('a.track-visit-website::attr(href)').extract_first()
            if url:
                yield scrapy.Request(url=url, callback=self.parse_details)
        next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        if next_page_url:
            yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self,response):
        yield{'Response': response,}