Spark-cloudant package 1.6.4 loaded by %AddJar does not get used by notebook

141 views Asked by At

I'm trying to use the latest spark-cloudant package with a notebook:

%AddJar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.4/cloudant-spark-v1.6.4-167.jar

Which outputs:

Starting download from https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.4/cloudant-spark-v1.6.4-167.jar
Finished download of cloudant-spark-v1.6.4-167.jar

Followed by:

val dfReader = sqlContext.read.format("com.cloudant.spark")
dfReader.option("cloudant.host", sourceDB.host)
if (sourceDB.username.isDefined && sourceDB.username.get.nonEmpty) dfReader.option("cloudant.username", sourceDB.username.get)
if (sourceDB.password.isDefined && sourceDB.password.get.nonEmpty) dfReader.option("cloudant.password", sourceDB.password.get)
val df = dfReader.load(sourceDB.database).cache()

Which outputs:

Use connectorVersion=1.6.3, dbName=ratingdb, indexName=null, viewName=null,jsonstore.rdd.partitions=5, + jsonstore.rdd.maxInPartition=-1,jsonstore.rdd.minInPartition=10, jsonstore.rdd.requestTimeout=900000,bulkSize=20, schemaSampleSize=1

The connector is 1.6.3. My notebook is:

Scala 2.10 with Spark 1.6

I've tried restarting the kernel but that didn't help.

Other debug information:

Server Information:

You are using Jupyter notebook.

The version of the notebook server is 4.2.0 and is running on:
Python 2.7.11 (default, Jun 24 2016, 12:41:03) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]

Current Kernel Information:

IBM Spark Kernel

Update

I tried the following:

import sys.process._

"test -d ~/data/libs/scala-2.10" #|| "mkdir -p ~/data/libs/scala-2.10" !
"wget -c -O ~/data/libs/scala-2.10/cloudant-spark-v1.6.4-167.jar https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.4/cloudant-spark-v1.6.4-167.jar" !
"ls ~/data/libs/scala-2.10/" !

println("Now restart the kernel")

Unfortunately, this didn't work - 1.6.3 is still being used.

Update 2

It appears that the tilda was not getting resolved to my HOME folder in the above code.

See the answer for the working solution.

2

There are 2 answers

0
Chris Snow On BEST ANSWER

Running the following code from a scala notebook worked for me:

import sys.process._

val HOME = sys.env("HOME")
val DESTDIR = s"${HOME}/data/libs/scala-2.10"

s"test -d ${DESTDIR}" #|| s"mkdir -p ${DESTDIR}" !
s"wget -q -c -O ${DESTDIR}/cloudant-spark-v1.6.4-167.jar https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.4/cloudant-spark-v1.6.4-167.jar" !
s"ls ${DESTDIR}/" !

I have also requested product management for the spark service to officially upgrade this library.

0
Sven Hafeneger On

Currently, for DSX Notebooks with Spark, the version 1.6.3 of cloudant-spark is supported out of box. That means the jar for this package is provided on the gpfs and the path to the jar was added to various environment variables, so that during starting the kernel it will be added to the runtime environment.

When you use the %AddJar magic it is possible that it will not work to overload the older version, due to the implementation of the magic and the location of the download path, see https://github.com/ibm-et/spark-kernel/wiki/List-of-Current-Magics-for-the-Spark-Kernel.

The setup of the runtime environment (including for Spark) includes the addition of various jars at different steps, so if you goal is it to use the version 1.6.4 of spark-cloudant, you would have to to try to find a location on the gpfs to dump the jar, so that it will be pulled at the right time (some guessing here, because I do not have the full picture on the setup!).

As long term solution, I would suggest that you reach out to the support for the spark service and submit a request to support the new version, so that it will be provided out of box.

As a short term solution (might not work when the setup steps for the runtime environment are changing), you can do the following:

  1. Open in your DSX project a Python notebook.
  2. Find out your USERID with

    !whoami

  3. Check your user specific directory for Spark libs with (USERID = output from step 2):

    !ls /gpfs/fs01/user/USERID/data/libs

You will notice that the spark-cloudant jar is not present there.

  1. Dump the newer version of spark-cloudant to the user specific directory for Spark libs (USERID = output from step 2):

    !wget https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.4/cloudant-spark-v1.6.4-167.jar -P /gpfs/fs01/user/USERID/data/libs

  2. Check your user specific directory for Spark libs with (USERID = output from step 2): See step 3

You will notice that spark-cloudant jar for version 1.6.4 is present there.

  1. Restart the kernel of your Scala notebook and try your code again.

The approach worked for me to patch the version of spark-cloudant, however it is a short term and temporally workaround, only!