sparkR 1.4.0 : how to include jars

702 views Asked by At

I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:

SparkR:::callJStatic('java.lang.Class', 
                     'forName', 
                     'org.apache.hadoop.io.NullWritable')

But for others, it does not:

SparkR:::callJStatic('java.lang.Class', 
                     'forName', 
                     'org.elasticsearch.hadoop.mr.LinkedMapWritable')

Yielding the error:

java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat

It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.

Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.

Thanks! Ben

1

There are 1 answers

1
Alex Ioannides On

Here's how I've achieved it:

# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

library(SparkR)

# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)

(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)

The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.