I am going through this lecture series by Martin Kleppman. In this video at around 1:25, he says you can manually configure the distributed nodes to chose a leader.
If that's the case can't we just automate the process by having a different process running that just checks for the health of the leader and chooses a new leader after the leader's failure or network partition.
Why is this problem actually so hard? Why can't we solve the consensus problem by enforcing a new leader without the nodes having to actually come to an agreement What am I missing?
Let's say we have an active leader and a passive one. The passive one listens for active's heartbeat. When the heartbeat is not heard, the passive one switches to active mode and, maybe, tell everyone - "I am the leader...".
The problem is that just because the passive one hears no heartbeat, it does not mean that the true leader is off - maybe there is a network issue in between these two boxes?
Another option - the leader may get offline for a short period of time - enough for the passive one to detect; but later, the original leader comes back online - now there are two leaders.
The general problem to resolve here is how to build a failure detector. It is tricky. In the last example, the old leader comes back, thinks it is the leader; but that is not true.