Apache Beam Kafka IO uses a Single Consumer Thread ignoring Flink Parallelism

89 views Asked by At

I have a data pipeline built using Apache Beam (with Flink runner) which sources data from a single Kafka topic and processes it. The Kafka topic has 20 partitions and we have the same amount of Flink task slots/threads as well. The pipeline is not performing optimally, it is very slow to process the messages resulting in a huge backlog. There are 2 issues that I saw;

  1. There seems to be only one worker/thread/consumer consuming messages from Kafka, the remaining are marked as complete (please refer to the screenshot below). I believe if we have enough workers Flink/Beam should create multiple consumers, one for each Kafka partition. This also causes a larger issue where the consumer is not fetching from partitions where there is more data, it sometimes does not consume any data, which I assume is because of the consumer affinity on Kafka side where it is assigning a partition where don't have any data, even though there are other partitions that do.
  2. Since there is only one thread consuming from Kafka, the subsequent step in the DAG is only getting records in one of the workers and the remaining are idle, causing it to be busy and building back pressure.

Can anyone please help debugging this issue, I am looking at ways to optimize this pipeline.

0

There are 0 answers