What is the rationale behind requiring at least 3 ActiveMQ instances and 3 ZooKeeper servers for running master/slave setup with replicated LevelDB storage? If the requirement is imposed by the usage of ZooKeeper which requires at least 3 servers, what is the rationale for ZooKeeper to require at least 3 servers to provide reliability?
- Is it for guaranteeing consistency in cases of network partitions (by sacrificing availability on the smaller smaller partition) as in a 2-node primary backup configuration it is impossible distinguish between a failed peer or both nodes being in different network partitions?
- Is it for providing tolerance against Byzantine failures where you need 2f+1 nodes to survive f faulty nodes (considering ONLY crash failures requires only f+1 nodes to survive f faults)?
Or is there any other reason?
Thanks!
Zookeeper requires at least 3 servers because of how it elects a new Activemq Master. Zookeeper requires a majority (n/2+1) to elect a new master. If it does not have that majority, no master will be selected and the system will fail. This is the same reason for why you use an odd number of Zookeepers servers. (EG. 3 servers gives you the same failure rate as 4 because of majority, can still only lose 1 server.)
For Activemq, the necessity of at least 3 servers is derived from how the messages are synced, and the fact that when a new master is elected, it requires atleast a quorum of nodes (N/2+1) to be able to identify the latest updates. ActiveMQ will sync messages with 1 slave, and then respond with an OK. It will then sync asynchronously with all other slaves. If a quorum is not present when a node fails, then Zookeeper has no way to distinguish which node is the most currently updated. This is what happens when you have only 2 nodes originally, so at least 3 is recommended.
From ActiveMQ site, under How it Works: