I'm new to programming and am trying to process a WARC file by splitting it into chunks and then storing each chunk in a dictionary.
Each chunk should start with the WARC/1.0 header and is separated by 3 empty lines. I also would like to remove the first 2 paragraphs:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz
isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin ([email protected])
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
#Keep everything from here down:
WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372
I've tried using a generator to group the chunks, but it's returning one group (the whole file). Is there a simple way to separate these?
I can't import any libraries.
Any help would be greatly appreciated!!
By far the best way to do this task is to use the warcio library, which knows how to properly parse warc files into records.
Barring that, I would copy the warcio code into yours (the license is permissive.)
Warc files are complicated, and using a fully tested and widely used library is the right way to parse them.
If you're downloading data from Common Crawl, I would also recommend checking out my python package cdx_toolkit. It uses warcio under the hood, and handles the downloading steps.