I'm using the kafka-node HighLevelConsumer, and am having problems where I always receive duplicate messages on startup.
In order to maintain processing sequence, my consumer simply appends messages to a work queue, and I process the events serially. I pause the consumer if I hit a queue high-water mark, I have auto-commit disabled, and I commit "manually" after my client code fully processes each event.
Despite committing, on startup, I always get the last (previously committed) message from one or more partitions (depending on how many other HLCs are running in my group). I was a little surprised that the HLC wouldn't give me (committed+1) but I decided to just "ignore" messages that had an offset earlier than the offset committed. As a quick test,
offset.fetchCommits('fnord', [{topic:'test', partition: 0},
{topic:'test', partition: 1},
{topic:'test', partition: 2},
{topic:'test', partition: 3}], ...
This works if my payload list matches the number of partitions defined. If I exceed the number of partitions, I get a [BrokerNotAvailableError: Could not find the leader]
error.
- Am I correct that I can't auto-commit if I want to have a stronger guarantee that I won't lose messages if my message processing is asynchronous and may fail (i.e. ETL job)? kafka-node just emits a 'message' event, there's no way to confirm that it was successfully handled.
- Is it expected behavior that the HighLevelConsumer will read the message of the last committed offset (i.e. a duplicate) rather than the next offset?
- What is the best way to get the number of partitions for a topic?
I dug into the kafka-node source, and there's an undocumented call I was able to use to get the partition info:
(I don't love calling something that doesn't appear to be a documented part of the public API, and I'm uncomfortable with the rather raw-feeling mixed array nature of the returned results, but it solves my problem for the moment.)