Deploying pyspark CommonCrawl repo to EMR

288 views Asked by At

I'm trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and instructions, however, I don't understand the instructions to get things going. How do I deploy this repo to my cluster? Should this be a part of my bootstrap script?

The end goal is to process the text in the WET files via a spark job. So far I've been using the hosted notebooks to try and download WET files with boto3, with no success.

Here is the code I used to bootstrap EMR with the additional python packages.

0

There are 0 answers