How to open Commoncrawl.org WARC.GZ S3 Data in Spark

Question

How to open Commoncrawl.org WARC.GZ S3 Data in Spark

2.3k views Asked by Philipp At 16 November 2014 at 14:10

I want to access a commoncrawl file from the Amazon public dataset repository from the spark shell. The files are in WARC.GZ format.

val filenameList = List("s3://<ID>:<SECRECT>@aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-41/segments/1410657102753.15/warc/CC-MAIN-20140914011142-00000-ip-10-196-40-205.us-west-1.compute.internal.warc.gz")

// TODO: implement functionality to read the WARC.GZ file here
val loadedFiles = sc.parallelize(filenameList, filenameList.length).mapPartitions(i => i)
loadedFiles.foreach(f => f.take(1))

I would now implement a function to read the WARC.GZ format inside the mapPartitions function. Is this a good approach to do that? I ask because i am fairly new to the Spark platform and wanted to implement a small demo application using a small part of the commoncrawl corpus. I saw mapPartitions being used in a thread here.

I a first attempt, i tried to open the file directly from my own computer using sc.textFile("s3://....").take(1) which resulted in am access denied error. Are the S3 amazon public repository files accessible only from EC2 instances?

Original Q&A

There are 1 answers

**Smerity** · Answer 1 · 2015-01-27T11:10:50+00:00

There is an example code from the "Analyzing Web Domain Vulnerabilities" analysis that shows you how to access WARC files from Spark as Spark supports the Hadoop InputFormat interface. The code itself is hosted on GitHub.

We're hoping to provide an example in the Common Crawl GitHub repository soon, as we do for Hadoop using Python and Java.

TechQA.

How to open Commoncrawl.org WARC.GZ S3 Data in Spark

There are 1 answers

Related Questions in AMAZON-EC2

Related Questions in AMAZON-S3

Related Questions in APACHE-SPARK

Related Questions in COMMON-CRAWL

Popular Questions

Popular Tags

Trending Questions