Let's say I have a cheap and less reliable datacenter A, and an expensive and more reliable datacenter B. I want to run Kafka in the most cost-effective way, even if that means risking data loss and/or downtime. I can run any number of brokers in either datacenter, but remember that costs need to be as low as possible.
For this scenario, assume that no costs are incurred if brokers are not running. Also assume that producers/consumers run completely reliably with no concern for their cost.
Two ideas I have are as follows:
- Provision two completely separate Kafka clusters, one in each datacenter, but keep the cluster in the more expensive datacenter (B) powered off. Upon detecting an outage in A, power on the cluster in B. Producers/consumers will have logic to switch between clusters.
- Run the Zookeeper cluster in B, with powered on brokers in A, and powered off brokers in B. If there is an outage in A, then brokers in B come online to pick up where A left off.
Option 1 would be cheaper, but requires more complexity in the producers/consumers. Option 2 would be more expensive, but requires less complexity in the producers/consumers.
Is Option 2 even possible? If there is an outage in A, is there any way to have brokers in B come online, get elected as leaders for the topics and have the producers/consumers seamlessly start sending to them? Again, data loss is okay and so is switchover downtime. But whatever option needs to not require manual intervention.
Is there any other approach that I can consider?
Neither is feasible.
Topics and their records are unique to each cluster. Only one leader partition can exist for any Kafka partition in a cluster.
With these two pieces of information, example scenarios include:
bootstrap.servers
at any time.Another thing to keep in mind is that topic creation/update/delete events aren't automatically synced across clusters, so Kafka Streams apps, especially, will all be unable to maintain state with this approach.
You can use tools like MirrorMaker or Confluent Replicator / Cluster Linking to help with all this, but the client failover piece I've personally never seen handled very well, especially when record order and idempotent processing matters
Ultimately, this is what availability zones are for. From what I understand, the chances of a cloud provider losing more than one availability zone at a time is extremely rare. So, you'd setup one Kafka cluster across 3 or more availability zones, and configure "rack awareness" for Kafka to account for its installation locations.