Debug Kafka pipeline by reading same topic with two different spark structured streams

390 views Asked by At

I have a Kafka topic which is streaming data in my production. I want to use the same data stream for my debugging purpose and not impact the offsets for existing pipeline.

I remember using creating different consumer groups for this purpose in earlier versions but I am using Spark structured streaming to read data from kafka and it discourages to use groupID while reading data from Kafka.

1

There are 1 answers

0
Michael Heil On BEST ANSWER

Each Spark Structured stream will create a unique ConsumerGroup as you can see in the code:

// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"

As Spark manages the offsets in its own checkpoint files and is never committing any offset back to Kafka, your two structured streaming jobs will not interfer with each other regarding their offset. Both will run completely independent from each other and there is nothing for you to do. It might help to have separate checkpoint directories for each streaming job.

I have given a more detailed answer on offset management with spark structured streaming job reading from a Kafka topic here.