Deploying pyspark CommonCrawl repo to EMR

292 views Asked by willwrighteng At 28 September 2020 at 07:09

I'm trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and instructions, however, I don't understand the instructions to get things going. How do I deploy this repo to my cluster? Should this be a part of my bootstrap script?

The end goal is to process the text in the WET files via a spark job. So far I've been using the hosted notebooks to try and download WET files with boto3, with no success.

Here is the code I used to bootstrap EMR with the additional python packages.

Original Q&A

TechQA.

Deploying pyspark CommonCrawl repo to EMR

There are 0 answers

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in AMAZON-EMR

Related Questions in COMMON-CRAWL

Popular Questions

Popular Tags

Trending Questions