I'm trying to extract WET files from the public CommonCrawl data hosted on S3 from my EMR cluster. To do this, CommonCrawl has a cc-pyspark repo where they provide examples and instructions, however, I don't understand the instructions to get things going. How do I deploy this repo to my cluster? Should this be a part of my bootstrap script?
The end goal is to process the text in the WET files via a spark job. So far I've been using the hosted notebooks to try and download WET files with boto3, with no success.
Here is the code I used to bootstrap EMR with the additional python packages.