Get offset and length of a subset of a WAT archive from Common Crawl index server

1.4k views Asked by At

I would like to download a subset of a WAT archive segment from Amazon S3.

Background:

Searching the Common Crawl index at http://index.commoncrawl.org yields results with information about the location of WARC files on AWS S3. For example, searching for url=www.celebuzz.com/2017-01-04/*&output=json yields JSON-formatted results, one of which is

{ "urlkey":"com,celebuzz)/2017-01-04/watch-james-corden-george-michael-tribute", ... "filename":"crawl-data/CC-MAIN-2017-34/segments/1502886104631.25/warc/CC-MAIN-20170818082911-20170818102911-00023.warc.gz", ... "offset":"504411150", "length":"14169", ... }

The filename entry indicates which archive segment contains the WARC file for this particular page. This archive file is huge; but fortunately the entry also contains offset and length fields, which can be used to request the range of bytes containing the relevant subset of the archive segment (see, e.g., lines 22-30 in this gist).

My question:

Given the location of a WARC file segment, I know how to construct the name of the corresponding WAT archive segment (see, e.g., this tutorial). I only need a subset of the WAT file, so I would like to request a range of bytes. But how do I find the corresponding offset and length for the WAT archive segment?

I have checked the API documentation for the Common Crawl index server, and it isn't clear to me that this is even possible. But in case it is, I'm posting this question.

2

There are 2 answers

2
Sebastian Nagel On BEST ANSWER

The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same.

2
dlazesz On

After many trial and error I had managed to get a range from a warc file in python and boto3 the following way:

# You have this form the index
offset, length, filename = 2161478, 12350, "crawl-data/[...].warc.gz"

import boto3
from botocore import UNSIGNED
from botocore.client import Config

# Boto3 anonymous login to common crawl
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Count the range
offset_end = offset + length - 1
byte_range = 'bytes={offset}-{end}'.format(offset=2161478, end=offset_end)
gzipped_text = s3.get_object(Bucket='commoncrawl', Key=filename, Range=byte_range)['Body'].read()

# The requested file in GZIP
with open("file.gz", 'w') as f:
  f.write(gzipped_text)

The rest is optimisation... Hope it helps! :)