I am trying to scrape the underlying data on the table in the following pages: https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries
What I want to do is access the underlying link for each row, and capture:
- The ID tag (e.g. QDE001),
- The name
- The reason for listing / additional information
- Other linked entities
This is what I have, but it does not seems to be working, I keep getting a "NotImplementedError('{}.parse callback is notdefined'.format(self.class.name)).I believe that the Xpaths I have defined are OK, not sure what I am missing.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class UNSCItem(scrapy.Item):
name = scrapy.Field()
uid = scrapy.Field()
link = scrapy.Field()
reason = scrapy.Field()
add_info = scrapy.Field()
class UNSC(scrapy.Spider):
name = "UNSC"
start_urls = [
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=0',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=1',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=2',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=3',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=4',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=5',
'https://www.un.org/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries?type=All&page=6',]
rules = Rule(LinkExtractor(allow=('/sc/suborg/en/sanctions/1267/aq_sanctions_list/summaries/',)),callback='data_extract')
def data_extract(self, response):
item = UNSCItem()
name = response.xpath('//*[@id="content"]/article/div[3]/div//text()').extract()
uid = response.xpath('//*[@id="content"]/article/div[2]/div/div//text()').extract()
reason = response.xpath('//*[@id="content"]/article/div[6]/div[2]/div//text()').extract()
add_info = response.xpath('//*[@id="content"]/article/div[7]//text()').extract()
related = response.xpath('//*[@id="content"]/article/div[8]/div[2]//text()').extract()
yield item
Try the below approach. It should fetch you all the
ids
and correspondingnames
from all the six pages. I suppose, the rest of the fields you can manage yourself.Just run it as it is: