How to configure Stocator on Amazon EMR

175 views Asked by At

I am trying to configure Stocator on an Amazon EMR cluster to access data on Amazon s3. I have found resources that indicate that this should be possible, but very little detail on how to get this to work.

When I start my EMR cluster I use the following config:

{
    "classification": "core-site",
    "properties": {
        "fs.stocator.scheme.list": "cos",
        "fs.cos.impl": "com.ibm.stocator.fs.ObjectStoreFileSystem",
        "fs.stocator.cos.impl": "com.ibm.stocator.fs.cos.COSAPIClient",
        "fs.stocator.cos.scheme":"cos"
    }
}

I then try to access a file using cos://mybucket.service/myfile

This yields an error due to missing credentials.

I add the credentials, in spark-shell, to the properties using:

val credentials = new com.amazonaws.auth.DefaultAWSCredentialsProviderChain().getCredentials
sc.hadoopConfiguration.set("fs.cos.service.access.key",credentials.getAWSAccessKeyId)
sc.hadoopConfiguration.set("fs.cos.service.secret.key",credentials.getAWSSecretKey)

Now when I try to access cos://mybucket.service/myfile I get the error: org.apache.spark.sql.AnalysisException: Path does not exist:.

accessing the file using s3://mybucket/myfile works, as it doesn't use Stocator. Also accessing the file via the amazon CLI works.

Are there any online resources detailing how to get Stocator working on AWS?

Has anyone successfully done this themselves, and can you share your configuration?

1

There are 1 answers

1
stevel On
  1. You may want to just contact Gil Vernik and ask for advice. Do make sure that it works with the EMR S3 consistency semantics; I believe it should.
  2. Hadoop 3.1 has its own high performance committers, probably faster than Stocator. (but I would say that, wouldn't I?)
  3. And part of the source for that code came from the Netflix S3A committer.

I'd play with the netflix one as I'm confident it works well there.