Debug Kafka pipeline by reading same topic with two different spark structured streams

Question

Debug Kafka pipeline by reading same topic with two different spark structured streams

418 views Asked by Harshit At 15 October 2020 at 21:37

I have a Kafka topic which is streaming data in my production. I want to use the same data stream for my debugging purpose and not impact the offsets for existing pipeline.

I remember using creating different consumer groups for this purpose in earlier versions but I am using Spark structured streaming to read data from kafka and it discourages to use groupID while reading data from Kafka.

Original Q&A

There are 1 answers

**Michael Heil** · Accepted Answer · 2020-10-16T04:55:22+00:00

Each Spark Structured stream will create a unique ConsumerGroup as you can see in the code:

// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"

As Spark manages the offsets in its own checkpoint files and is never committing any offset back to Kafka, your two structured streaming jobs will not interfer with each other regarding their offset. Both will run completely independent from each other and there is nothing for you to do. It might help to have separate checkpoint directories for each streaming job.

I have given a more detailed answer on offset management with spark structured streaming job reading from a Kafka topic here.

TechQA.

Debug Kafka pipeline by reading same topic with two different spark structured streams

There are 1 answers

Related Questions in APACHE-KAFKA

Related Questions in KAFKA-CONSUMER-API

Related Questions in SPARK-STRUCTURED-STREAMING

Related Questions in SPARK-STREAMING-KAFKA

Popular Questions

Popular Tags

Trending Questions