How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

Question

How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

3.4k views Asked by seandavi At 07 September 2017 at 20:33

I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter notebook.

https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud

How can I include extra JAR files (spark-xml, for example) in the resulting SparkContext in Jupyter notebooks (particularly pyspark)?

Original Q&A

There are 1 answers

**Angus Davis** · Accepted Answer · 2017-09-07T22:38:00+00:00

The answer depends slightly on which jars you're looking to load. For example, you can use spark-xml with the following when creating a cluster:

$ gcloud dataproc clusters create [cluster-name] \
    --zone [zone] \
    --initialization-actions \
       gs://dataproc-initialization-actions/jupyter/jupyter.sh \ 
    --properties spark:spark.jars.packages=com.databricks:spark-xml_2.11:0.4.1

To specify multiple Maven coordinates, you will need to swap the gcloud dictionary separator character from ',' to something else (as we need to use that to separate the packages to install):

$ gcloud dataproc clusters create [cluster-name] \
    --zone [zone] \
    --initialization-actions \
       gs://dataproc-initialization-actions/jupyter/jupyter.sh \ 
    --properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3

Details on how escape characters are changed can be found in gcloud:

$ gcloud help topic escaping

TechQA.

How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in JUPYTER-NOTEBOOK

Related Questions in GOOGLE-CLOUD-DATAPROC

Popular Questions

Trending Questions