I am trying to set up a trivial EMR job to perform word counting of massive text files, stored in s3://__mybucket__/input/
. I am unable to correctly add the first of the two required streaming steps (the first is map input to wordSplitter.py
, reduce with an IdentityReducer
to temporary storage; second step is map the contents of this secondary storage using /bin/wc/
, and reduce with an IdentityReducer
yet again).
This is the (failure) description of the first step:
Status:FAILED
Reason:S3 Service Error.
Log File:s3://aws-logs-209733341386-us-east-1/elasticmapreduce/j-2XC5AT2ZP48FJ/steps/s-1SML7U7CXRDT5/stderr.gz
Details:Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 7799087FCAE73457), S3 Extended Request ID: nQYTtW93TXvi1G8U4LLj73V1xyruzre+uSt4KN1zwuIQpwDwa+J8IujOeQMpV5vRHmbuKZLasgs=
JAR location: command-runner.jar
Main class: None
Arguments: hadoop-streaming -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -mapper wordSplitter.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input s3://__mybucket__/input/ -output s3://__mybucket__/output/
Action on failure: Continue
This is the command being sent to the hadoop cluster:
JAR location : command-runner.jar
Main class : None
Arguments : hadoop-streaming -mapper s3a://elasticmapreduce/samples/wordcount/wordSplitter.py -reducer aggregate -input s3a://__my_bucket__/input/ -output s3a://__my_bucket__/output/
I think the solution here is likely very easy.
Instead of
s3://
uses3a://
as a scheme for your job accessing the bucket. See here, thes3://
scheme is deprecated and requires the bucket in question to be exclusive to your Hadoop data. Quote from the above doc link: