Kafka Cluster doesn't elect new leader after broker goes down

215 views Asked by At

I'm trying to setup a cluster locally. Mostly so I can learn a Cluster's behavior before sending to production. I'm running everything bare metal

For context:

My cluster consists of 3 controllers and 3 brokers, all in Kraft Mode. I used the sample files for Kraft provided with the default download on config/kraft/broker.properties and config/kraft/controller.properties, and moved them to config/server.properties

My controllers are setup as

node.id=1 # Each controller with its own id (1 to 3)
controller.quorum.voters=1@localhost:9193,2@localhost:9293,3@localhost:9393
listeners=CONTROLLER://:9393
num.partitions=10
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=2

And my brokers are setup as

node.id=4 # Each broker with its own id (4 to 6)
controller.quorum.voters=1@localhost:9193,2@localhost:9293,3@localhost:9393
listeners=PLAINTEXT://localhost:9492
inter.broker.listener.name=PLAINTEXT
advertised.listeners=PLAINTEXT://localhost:9492
controller.listener.names=CONTROLLER
num.partitions=10
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=2

Everything else is using its default settings.

For the application I am using KafkaJS

For each of those nodes I am using a simple script to start the server

bin/kafka-storage.sh format -t ffQlsx4mQn-ipKduywm2Ig -c config/server.properties --ignore-formatted
bin/kafka-server-start.sh config/server.properties

Everything works as expected with this settings. However, if I shut any of the broker nodes down (regardless of being leader or not), everything stops working. I get some expected Connection error: connect ECONNREFUSED 127.0.0.1:9492 and tons of There is no leader for this topic-partition as we are in the middle of a leadership election. But this election doesn't seem to complete. I let it run for close to 10 minutes and a new leader was never elected and the errors kept pilling up.

My code is spamming messages to the cluster specifically to test a close to real scenario where user won't wait for an election.

setInterval(async () => {
  await producer.send({
    topic: 'test-topic',
    messages: [ { value: 'text' } ],
  })
}, 500);

Error as is goes down

{
  namespace: 'Producer',
  label: 'ERROR',
  log: {
    timestamp: '2023-09-26T16:35:56.593Z',
    message: 'Failed to send messages: This server is not the leader for that topic-partition',
    broker: undefined,
    clientId: undefined,
    error: undefined,
    logLevel: 1
  }
}

After a few seconds it turns into this

{
  namespace: 'Connection',
  label: 'ERROR',
  log: {
    timestamp: '2023-09-26T16:36:05.807Z',
    message: 'Response Metadata(key: 3, version: 6)',
    broker: 'localhost:9692',
    clientId: '27a1ab62-7bd5-42e9-9c73-debfd284ac02',
    error: 'There is no leader for this topic-partition as we are in the middle of a leadership election',
    logLevel: 1
  }
}

and

{
  namespace: 'Producer',
  label: 'ERROR',
  log: {
    timestamp: '2023-09-26T16:36:06.098Z',
    message: 'Failed to send messages: Connection error: connect ECONNREFUSED 127.0.0.1:9492',
    broker: undefined,
    clientId: undefined,
    error: undefined,
    logLevel: 1
  }
}

After all that preamble, my questions are:

  • Would I have to manually handle producer.send while the election is going on?
  • Kafka was supposed to be fault resilient, so I'm assuming that I missed something when setting up the environment. Is that correct?
  • Having all nodes running locally like that might be polluting my tests?
  • I suppose that adding a load balancer between my NodeJS app and the cluster would help, but in this situation it just seems like a workaround for a misconfiguration
0

There are 0 answers