Why Zeppelin notebook is not able to connect to S3

Question

Why Zeppelin notebook is not able to connect to S3

3.7k views Asked by rajnish At 17 June 2015 at 07:13

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.

Spark Version: Standalone: spark-1.2.1-bin-hadoop1.tgz

I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.

Code:

    sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
    sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
    val file = "s3n://<bucket>/<key>"
    val data = sc.textFile(file)
    data.count


file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)

I have build the Zeppelin by following command:

mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests

when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.

I have also tried -Phadoop-1 mentioned in this spark website. but got the same error. 1.x to 2.1.x hadoop-1

Please let me know what I am missing here.

Original Q&A

There are 2 answers

**D. Müller** · Answer 1 · 2015-07-25T13:50:05+00:00

The following installation worked for me (spent also many days with the problem):

Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster
git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)
installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):

mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests
Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)

After this installation steps my notebook worked with S3 files:

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first

I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.

Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)

**Mohammad Adnan** · Answer 2 · 2016-12-05T12:07:42+00:00

For me it worked in one two steps-

1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. - 
val performanceFactor = sqlContext.
      read.  parquet("s3n://<accessKey>:<secretKey>@mybucket/myfile/")

where access key and secret key you need to supply. in #2 I am using s3n protocol and access and secret keys in path itself.

TechQA.

Why Zeppelin notebook is not able to connect to S3

There are 2 answers

Related Questions in APACHE-SPARK

Related Questions in APACHE-ZEPPELIN

Popular Questions

Trending Questions