How to open Commoncrawl.org WARC.GZ S3 Data in Spark

2.3k views Asked by At

I want to access a commoncrawl file from the Amazon public dataset repository from the spark shell. The files are in WARC.GZ format.

val filenameList = List("s3://<ID>:<SECRECT>@aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-41/segments/1410657102753.15/warc/CC-MAIN-20140914011142-00000-ip-10-196-40-205.us-west-1.compute.internal.warc.gz")

// TODO: implement functionality to read the WARC.GZ file here
val loadedFiles = sc.parallelize(filenameList, filenameList.length).mapPartitions(i => i)
loadedFiles.foreach(f => f.take(1))

I would now implement a function to read the WARC.GZ format inside the mapPartitions function. Is this a good approach to do that? I ask because i am fairly new to the Spark platform and wanted to implement a small demo application using a small part of the commoncrawl corpus. I saw mapPartitions being used in a thread here.

I a first attempt, i tried to open the file directly from my own computer using sc.textFile("s3://....").take(1) which resulted in am access denied error. Are the S3 amazon public repository files accessible only from EC2 instances?

1

There are 1 answers

0
Smerity On

There is an example code from the "Analyzing Web Domain Vulnerabilities" analysis that shows you how to access WARC files from Spark as Spark supports the Hadoop InputFormat interface. The code itself is hosted on GitHub.

We're hoping to provide an example in the Common Crawl GitHub repository soon, as we do for Hadoop using Python and Java.