I am looking for a Scrapy Spider
that instead of getting URL's and crawls them, it gets as input a WARC
file (preferably from S3) and send to the parse
method the content.
I actually need to skip all the download phase, that means that from start_requests
method i would like to return a Response
that will then send to the parse
method.
This is what i have so far:
class WarcSpider(Spider):
name = "warc_spider"
def start_requests(self):
f = warc.WARCFile(fileobj=gzip.open("file.war.gz"))
for record in f:
if record.type == "response":
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
url=record['WARC-Target-URI']
yield Response(url=url, status=200, body=body, headers=headers)
def parse(self, response):
#code that creates item
pass
Any ideas of what is the Scarpy
way of doing that ?
What you want to do is something like this: