Using Scrapy to parse table page and extract data from underlying links

Question

Using Scrapy to parse table page and extract data from underlying links

70 views Asked by Owais Arshad At 25 June 2018 at 19:40

I am trying to scrape the underlying data on the table in the following pages: https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries

What I want to do is access the underlying link for each row, and capture:

The ID tag (e.g. QDE001),
The name
The reason for listing / additional information
Other linked entities

This is what I have, but it does not seems to be working, I keep getting a "NotImplementedError('{}.parse callback is notdefined'.format(self.class.name)).I believe that the Xpaths I have defined are OK, not sure what I am missing.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class UNSCItem(scrapy.Item):
    name = scrapy.Field()
    uid = scrapy.Field()
    link = scrapy.Field()
    reason = scrapy.Field()
    add_info = scrapy.Field()



class UNSC(scrapy.Spider):
    name = "UNSC"
    start_urls = [
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',      
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
        'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]

    rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')


    def data_extract(self, response):
        item = UNSCItem()
        name = response.xpath('//*[@id="content"]/article/div[3]/div//text()').extract()
        uid = response.xpath('//*[@id="content"]/article/div[2]/div/div//text()').extract()
        reason =  response.xpath('//*[@id="content"]/article/div[6]/div[2]/div//text()').extract() 
        add_info = response.xpath('//*[@id="content"]/article/div[7]//text()').extract()
        related = response.xpath('//*[@id="content"]/article/div[8]/div[2]//text()').extract()
        yield item

Original Q&A

There are 1 answers

**SIM** · Accepted Answer · 2018-06-25T20:34:57+00:00

Try the below approach. It should fetch you all the ids and corresponding names from all the six pages. I suppose, the rest of the fields you can manage yourself.

Just run it as it is:

import scrapy

class UNSC(scrapy.Spider):
    name = "UNSC"

    start_urls = ['https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page={}'.format(page) for page in range(0,7)]

    def parse(self, response):
        for item in response.xpath('//*[contains(@class,"views-table")]//tbody//tr'):
            idnum = item.xpath('.//*[contains(@class,"views-field-field-reference-number")]/text()').extract()[-1].strip()
            name = item.xpath('.//*[contains(@class,"views-field-title")]//span[@dir="ltr"]/text()').extract()[-1].strip()
            yield{'ID':idnum,'Name':name}

TechQA.

Using Scrapy to parse table page and extract data from underlying links

There are 1 answers

Related Questions in PYTHON

Related Questions in XPATH

Related Questions in WEB-SCRAPING

Related Questions in SCRAPY

Popular Questions

Popular Tags

Trending Questions