AWS EMR Cluster Streaming Step: Bad Request

2k views Asked by At

I am trying to set up a trivial EMR job to perform word counting of massive text files, stored in s3://__mybucket__/input/. I am unable to correctly add the first of the two required streaming steps (the first is map input to wordSplitter.py, reduce with an IdentityReducer to temporary storage; second step is map the contents of this secondary storage using /bin/wc/, and reduce with an IdentityReducer yet again).

This is the (failure) description of the first step:

Status:FAILED
Reason:S3 Service Error.
Log File:s3://aws-logs-209733341386-us-east-1/elasticmapreduce/j-2XC5AT2ZP48FJ/steps/s-1SML7U7CXRDT5/stderr.gz
Details:Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 7799087FCAE73457), S3 Extended Request ID: nQYTtW93TXvi1G8U4LLj73V1xyruzre+uSt4KN1zwuIQpwDwa+J8IujOeQMpV5vRHmbuKZLasgs=
JAR location: command-runner.jar
Main class: None
Arguments: hadoop-streaming -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -mapper wordSplitter.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input s3://__mybucket__/input/ -output s3://__mybucket__/output/
Action on failure: Continue

This is the command being sent to the hadoop cluster:

JAR location : command-runner.jar
Main class : None
Arguments : hadoop-streaming -mapper s3a://elasticmapreduce/samples/wordcount/wordSplitter.py -reducer aggregate -input s3a://__my_bucket__/input/ -output s3a://__my_bucket__/output/
1

There are 1 answers

4
Armin Braun On

I think the solution here is likely very easy.

Instead of s3:// use s3a:// as a scheme for your job accessing the bucket. See here, the s3:// scheme is deprecated and requires the bucket in question to be exclusive to your Hadoop data. Quote from the above doc link:

This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.