Spark-Kusto connector writestream writes only 1 batch and then goes idle

87 views Asked by At

We are trying to acheive a low latency strcutured streaming ingestion into ADX(Azure data explorer) from Databricks using PySpark writestream with open source Spark-Kusto connector

  • We stream a small volume of data <100 MB of data per second to ADX but the ingestion goes idle after processing 1 batch.
  • Sometimes, it does write, but the latency is in minutes. And these both behaviours doesn't seems to have a pattern.

Configurations we enabled & tests we performed so far

  1. Defined the low latency cofiguration in ADX tables (on both db and table level)
  2. Enabled streaming policy to be true
  3. Even increased the ADX cluster size to be sure
  4. Stream write to Object storage to make sure streaming is working, it works.
  5. However the managed ingestion "data ingest" native to ADX is ingesting the data from Event-Hub with ms latency.

Connector(Maven): com.microsoft.azure.kusto:kusto-spark_3.0_2.12:5.0.4

Writestream Code:

options = {
    "kustoCluster": f"{kusto_cluster}",
    "kustoDatabase": f"{kusto_db}",
    "kustoTable": f"{table}",
    "kustoAadAppId": f"{KUSTO_AAD_APP_ID}",
    "kustoAadAppSecret": f"{KUSTO_AAD_APP_SECRET}",
    "kustoAadAuthorityID": f"{KUSTO_AAD_AUTHORITY_ID}",
    "writeMode" : "Queued",
    "clientBatchingLimit":"100"
}

kust_stream = (df
                .writeStream
                .queryName("ADX_WRITE")
                .format("com.microsoft.kusto.spark.datasink.KustoSinkProvider")
                .options(**options)
               )

kust_stream.start().awaitTermination()

Expectation: Low latency write is expected using Spark-Kusto connector (ms latency)

What could be the reason behind this problem?

https://github.com/Azure/azure-kusto-spark/pull/301 - This PR says the functionality is yet to be accepted. Does anyone know the timeline for the release? or any beta version to try the write-mode "stream" ingest.

0

There are 0 answers