I'm trying to setup a cluster locally. Mostly so I can learn a Cluster's behavior before sending to production. I'm running everything bare metal
For context:
My cluster consists of 3 controllers and 3 brokers, all in Kraft Mode. I used the sample files for Kraft provided with the default download on config/kraft/broker.properties and config/kraft/controller.properties, and moved them to config/server.properties
My controllers are setup as
node.id=1 # Each controller with its own id (1 to 3)
controller.quorum.voters=1@localhost:9193,2@localhost:9293,3@localhost:9393
listeners=CONTROLLER://:9393
num.partitions=10
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=2
And my brokers are setup as
node.id=4 # Each broker with its own id (4 to 6)
controller.quorum.voters=1@localhost:9193,2@localhost:9293,3@localhost:9393
listeners=PLAINTEXT://localhost:9492
inter.broker.listener.name=PLAINTEXT
advertised.listeners=PLAINTEXT://localhost:9492
controller.listener.names=CONTROLLER
num.partitions=10
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=2
Everything else is using its default settings.
For the application I am using KafkaJS
For each of those nodes I am using a simple script to start the server
bin/kafka-storage.sh format -t ffQlsx4mQn-ipKduywm2Ig -c config/server.properties --ignore-formatted
bin/kafka-server-start.sh config/server.properties
Everything works as expected with this settings. However, if I shut any of the broker nodes down (regardless of being leader or not), everything stops working. I get some expected Connection error: connect ECONNREFUSED 127.0.0.1:9492 and tons of There is no leader for this topic-partition as we are in the middle of a leadership election. But this election doesn't seem to complete. I let it run for close to 10 minutes and a new leader was never elected and the errors kept pilling up.
My code is spamming messages to the cluster specifically to test a close to real scenario where user won't wait for an election.
setInterval(async () => {
await producer.send({
topic: 'test-topic',
messages: [ { value: 'text' } ],
})
}, 500);
Error as is goes down
{
namespace: 'Producer',
label: 'ERROR',
log: {
timestamp: '2023-09-26T16:35:56.593Z',
message: 'Failed to send messages: This server is not the leader for that topic-partition',
broker: undefined,
clientId: undefined,
error: undefined,
logLevel: 1
}
}
After a few seconds it turns into this
{
namespace: 'Connection',
label: 'ERROR',
log: {
timestamp: '2023-09-26T16:36:05.807Z',
message: 'Response Metadata(key: 3, version: 6)',
broker: 'localhost:9692',
clientId: '27a1ab62-7bd5-42e9-9c73-debfd284ac02',
error: 'There is no leader for this topic-partition as we are in the middle of a leadership election',
logLevel: 1
}
}
and
{
namespace: 'Producer',
label: 'ERROR',
log: {
timestamp: '2023-09-26T16:36:06.098Z',
message: 'Failed to send messages: Connection error: connect ECONNREFUSED 127.0.0.1:9492',
broker: undefined,
clientId: undefined,
error: undefined,
logLevel: 1
}
}
After all that preamble, my questions are:
- Would I have to manually handle
producer.sendwhile the election is going on? - Kafka was supposed to be fault resilient, so I'm assuming that I missed something when setting up the environment. Is that correct?
- Having all nodes running locally like that might be polluting my tests?
- I suppose that adding a load balancer between my NodeJS app and the cluster would help, but in this situation it just seems like a workaround for a misconfiguration