Export scraping data in multiple formats using scrapy

1.5k views Asked by At

I'm scraping a website to export the data into a semantic format (n3). However, I also want to perform some data analysis on that data, so having it in a csv format is more convenient.

To get the data in both formats I can do

scrapy spider -t n3 -o data.n3
scrapy spider -t csv -o data.csv

However, this scrapes the data twice and I cannot afford it with big amounts of data.

Is there a way to export the same scraped data into multiple formats? (without downloading the data more than once)

I find interesting to have an intermediate representation of the scraped data that could be exported into different formats. But it seems there is no way to do this with scrapy.

1

There are 1 answers

1
alecxe On BEST ANSWER

From what I understand after exploring the source code and the documentation, -t option refers to the FEED_FORMAT setting which cannot have multiple values. Also, the FeedExporter built-in extension (source) works with a single exporter only.

Actually, think about making a feature request at the Scrapy Issue Tracker.

As more like a workaround, define a pipeline and start exporting with multiple exporters. For example, here is how to export into both CSV and JSON formats:

from collections import defaultdict

from scrapy import signals
from scrapy.exporters import JsonItemExporter, CsvItemExporter


class MyExportPipeline(object):
    def __init__(self):
        self.files = defaultdict(list)

     @classmethod
     def from_crawler(cls, crawler):
         pipeline = cls()
         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
         return pipeline

    def spider_opened(self, spider):
        csv_file = open('%s_products.csv' % spider.name, 'w+b')
        json_file = open('%s_products.json' % spider.name, 'w+b')

        self.files[spider].append(csv_file)
        self.files[spider].append(json_file)

        self.exporters = [
            JsonItemExporter(json_file),
            CsvItemExporter(csv_file)
        ]

        for exporter in self.exporters:
            exporter.start_exporting()

    def spider_closed(self, spider):
        for exporter in self.exporters:
            exporter.finish_exporting()

        files = self.files.pop(spider)
        for file in files:
            file.close()

    def process_item(self, item, spider):
        for exporter in self.exporters:
            exporter.export_item(item)
        return item