I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter notebook.
How can I include extra JAR files (spark-xml, for example) in the resulting SparkContext in Jupyter notebooks (particularly pyspark)?
The answer depends slightly on which jars you're looking to load. For example, you can use spark-xml with the following when creating a cluster:
To specify multiple Maven coordinates, you will need to swap the gcloud dictionary separator character from ',' to something else (as we need to use that to separate the packages to install):
Details on how escape characters are changed can be found in gcloud: