How do one call packages from spark to be utilized for data operations with R?
example i am trying to access my test.csv in hdfs as below
Sys.setenv(SPARK_HOME="/opt/spark14")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true")
but getting error as below:
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
i tried loading the csv package by below option
Sys.setenv('SPARKR_SUBMIT_ARGS'='--packages com.databricks:spark-csv_2.10:1.0.3')
but getting the below error during loading sqlContext
Launching java with spark-submit command /opt/spark14/bin/spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 /tmp/RtmpuvwOky /backend_port95332e5267b
Error: Cannot load main class from JAR file:/tmp/RtmpuvwOky/backend_port95332e5267b
Any help will be highly appreciated.
So it looks like by setting
SPARKR_SUBMIT_ARGS
you are overriding the default value, which issparkr-shell
. You could probably do the same thing and just append sparkr-shell to the end of your SPARKR_SUBMIT_ARGS. This is seems unnecessarily complex compared to depending on jars so I've created a JIRA to track this issue (and I'll try and a fix if the SparkR people agree with me) https://issues.apache.org/jira/browse/SPARK-8506 .Note: another option would be using the sparkr command +
--packages com.databricks:spark-csv_2.10:1.0.3
since that should work.