We are trying to acheive a low latency strcutured streaming ingestion into ADX(Azure data explorer) from Databricks using PySpark writestream with open source Spark-Kusto connector
- We stream a small volume of data <100 MB of data per second to ADX but the ingestion goes idle after processing 1 batch.
- Sometimes, it does write, but the latency is in minutes. And these both behaviours doesn't seems to have a pattern.
Configurations we enabled & tests we performed so far
- Defined the low latency cofiguration in ADX tables (on both db and table level)
- Enabled streaming policy to be true
- Even increased the ADX cluster size to be sure
- Stream write to Object storage to make sure streaming is working, it works.
- However the managed ingestion "data ingest" native to ADX is ingesting the data from Event-Hub with ms latency.
Connector(Maven): com.microsoft.azure.kusto:kusto-spark_3.0_2.12:5.0.4
Writestream Code:
options = {
"kustoCluster": f"{kusto_cluster}",
"kustoDatabase": f"{kusto_db}",
"kustoTable": f"{table}",
"kustoAadAppId": f"{KUSTO_AAD_APP_ID}",
"kustoAadAppSecret": f"{KUSTO_AAD_APP_SECRET}",
"kustoAadAuthorityID": f"{KUSTO_AAD_AUTHORITY_ID}",
"writeMode" : "Queued",
"clientBatchingLimit":"100"
}
kust_stream = (df
.writeStream
.queryName("ADX_WRITE")
.format("com.microsoft.kusto.spark.datasink.KustoSinkProvider")
.options(**options)
)
kust_stream.start().awaitTermination()
Expectation: Low latency write is expected using Spark-Kusto connector (ms latency)
What could be the reason behind this problem?
https://github.com/Azure/azure-kusto-spark/pull/301 - This PR says the functionality is yet to be accepted. Does anyone know the timeline for the release? or any beta version to try the write-mode "stream" ingest.