SparkR and Packages

1.4k views Asked by At

How do one call packages from spark to be utilized for data operations with R?

example i am trying to access my test.csv in hdfs as below

Sys.setenv(SPARK_HOME="/opt/spark14")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020     /user/root/test.csv","com.databricks.spark.csv", header="true")

but getting error as below:

Caused by: java.lang.RuntimeException: Failed to load class for data  source: com.databricks.spark.csv

i tried loading the csv package by below option

Sys.setenv('SPARKR_SUBMIT_ARGS'='--packages com.databricks:spark-csv_2.10:1.0.3')

but getting the below error during loading sqlContext

Launching java with spark-submit command /opt/spark14/bin/spark-submit   --packages com.databricks:spark-csv_2.10:1.0.3 /tmp/RtmpuvwOky  /backend_port95332e5267b 
Error: Cannot load main class from JAR file:/tmp/RtmpuvwOky/backend_port95332e5267b

Any help will be highly appreciated.

1

There are 1 answers

2
Holden On BEST ANSWER

So it looks like by setting SPARKR_SUBMIT_ARGS you are overriding the default value, which is sparkr-shell. You could probably do the same thing and just append sparkr-shell to the end of your SPARKR_SUBMIT_ARGS. This is seems unnecessarily complex compared to depending on jars so I've created a JIRA to track this issue (and I'll try and a fix if the SparkR people agree with me) https://issues.apache.org/jira/browse/SPARK-8506 .

Note: another option would be using the sparkr command + --packages com.databricks:spark-csv_2.10:1.0.3 since that should work.