Databricks Connect with Azure Event Hubs

1.3k views Asked by At

I'm facing issues while trying to run some Python code on Databricks using databricks-connect and depending on a Maven installed extension (in this case com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17 found on Databricks official documentation for integration with Azure EventHub

Concerning the connection with databricks-connect, it is all well set up (got "All test passed" with databricks-connect test). The Maven package com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.17 appears to be "Installed" in the Libraries section of my cluster).

The faulty code is this simple one liner:

encrypted_string = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(to_be_encrypted_string)

Producing the following error stack:

INFO - Receiving data from EventHub using Databricks' PySpark...
20/09/29 17:50:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/09/29 17:50:59 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
Traceback (most recent call last):
  File "C:\Users\my_user\Desktop\projectABC\src\my_folder\my_project\cli.py", line 86, in <module>
    connector()
  File "C:\Users\my_user\Desktop\projectABC\.venv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\my_user\Desktop\projectABC\.venv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "C:\Users\my_user\Desktop\projectABC\.venv\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\my_user\Desktop\projectABC\.venv\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\my_user\Desktop\projectABC\.venv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\my_user\Desktop\projectABC\.venv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "C:\Users\my_user\Desktop\projectABC\src\my_folder\my_project\cli.py", line 43, in test_data_process
    prediction_connector.process_upstream_data()
  File "c:\users\my_user\Desktop\projectABC\src\my_folder\my_project\command.py", line 224, in process_upstream_data
    df = eventhub_consumer.receive_data_with_pyspark()
  File "c:\users\my_user\Desktop\projectABC\src\my_folder\my_project\command.py", line 406, in receive_data_with_pyspark
    eventhub_config = self._populate_pyspark_eventhub_config_file(spark_context=sc)
  File "c:\users\my_user\Desktop\projectABC\src\my_folder\my_project\command.py", line 428, in _populate_pyspark_eventhub_config_file
    eventhub_config = {'eventhubs.connectionString': spark_context._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(self.config.connection_string)} 
TypeError: 'JavaPackage' object is not callable

Am I missing something obvious here about the Maven package installation ? Is there an extra step for using it with Python ? Thanks for your help !

1

There are 1 answers

2
nefo_x On BEST ANSWER

Functionality has some limitations:

The following Databricks features and third-party platforms are unsupported:

  • The following Databricks Utilities: credentials, library, notebook workflow, and widgets.
  • Structured Streaming (including Azure Event Hubs)
  • Running arbitrary code that is not a part of a Spark job on the remote cluster.
  • Native Scala, Python, and R APIs for Delta table operations (for example, DeltaTable.forPath). However, the SQL API (spark.sql(...)) with Delta Lake operations and the regular Spark API (for example, spark.read.load) on Delta tables are both supported.

Besides, with Databricks Connect you have to have all of the libraries in the local classpath as well. Typical scenario is when all non-Spark dependencies are packaged into jar-with-dependencies.