I've tried using /databricks/spark/bin/spark-shell --packages com.crealytics:spark-excel_2.13:3.4.1_0.19.0
in my init script, however I get the error Error: Could not find or load main class org.apache.spark.launcher.Main /databricks/spark/bin/spark-class: line 101: CMD: bad array subscript
.
I also tried using .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.15.0")
in my SparkSession initialization as below, but it looks like the config is getting ignored.
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
builder = (
SparkSession
.builder
.appName("oms-xml-streaming")
.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.15.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.databricks.delta.autoCompact.enabled", True)
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
Workspace Libraries have been deprecated so I cannot download the JAR to my workspace and copy it to /databricks/jars/
either.
Any ideas?
In Azure Data Factory, libraries for Databricks are specified on the task level, not on the linked service level. Create a task (notebook/jar/python) and then you'll be able to specify libraries for it in the "Settings" tab of the task properties, like this:
If you're using a connection to an existing cluster, then you need to install libraries to it.