MongoDB: When Primary fails

1.8k views Asked by At

I would like to understand what guarantees, if any does MongoDB provide as it relates to durability of data in a scenarios when Primary fails permanently or temporarily, or when it becomes separated at the network level from the rest of the replica set.

I understand what happens with w:1 write concern. I understand the role of journaling.

I do not understand how MongoDB decides which writes to keep and which to discard when a new primary is selected. In a 4-node (+arbiter) cluster, with N1 being primary and N2, N3, N4 secondaries, under this scenario:

  1. {w:majority, j:true} write hits the Primary.
  2. Secondaries poll for changes, primary waits for majority to confirm.
  3. N2 confirms the change.
  4. N3 has received the change and is in process of applying it.
  5. Primary goes down.
  6. N3 is unable to confirm to Primary it has applied the change.
  7. Election is forced.

Questions:

  • Will the write be available after election completes?
  • Does it matter what node will become a new primary?
  • Are results different if step #4 did not occur?
  • Are results different if Primary has received confirmation from N3?
  • Are results different if Primary has acknowledged the write, and N2 replicated the fact the write was confirmed by majority?
  • Are results different if both N2 and N3 replicated the fact that the write was confirmed by majority?
1

There are 1 answers

1
Joe On BEST ANSWER

First some background:

A mongodb replica set uses an operations log for replication. Each write at the primary is appended to the oplog.

Each secondary node queries its sync source for oplog entries, retreiving and applying them in batches.

Oplog entries are idempotent, and must be applied in order.

Before writing entries to the oplog itself, they are first written to the journal.

If a node crashes while applying a batch, it is safe to replay the entire batch from the journal, since each entry is idempotent.

Members report to each other the identifier of the most recent oplog event they have applied.

Since oplog events must be applied in order, this means that each node has applied all entries prior to the reported entry, and none since.

In a 5-node replica set, the majority is 3 nodes. The primary will not acknowledge to the client/application that a w:majority write has completed until a majority of the nodes report their oplog has reached or passed the point where the write was added.

Each node in a replica set monitors each of the other members.

If the primary realizes that it cannot communicate with a majority of the voting nodes, it will step down to secondary.

If a secondary does not have any communication to/from the primary for several seconds (10s by default), it will call for an election.

The node calling for election will solicit votes from every replica set member.

Each member will usually vote "Yes", a couple of reason they may vote "No" are:

  • The most recent oplog entry of the candidate is not equal to or more recent than the other nodes
  • The node has current communication with a primary node that is equal or greater priority than the candidate

If the candidate node receives enough 'Yes' votes to constitute a majority of the voting nodes in the replica set, it transitions to primary and beings accepting writes.

Each election increments the term value of the replica set. If the primary receives a heartbeat or other message from one of the other nodes that contains a higher term than when it was elected, it immediately steps down to secondary.

In your scenario:

{w:majority, j:true} write hits the Primary.

The primary will:

  • apply the write to its local copy of the data set
  • write the operation to the journal
  • write the operation to the oplog

Let's say the primary's oplog now contains:

Time operation
T1 create collection
T2 create index
T3 insert document 1
T4 insert document 2
T5 update document 1
T6 insert document 3

I will use the update at T5 as the w:majority write.

Secondaries poll for changes, primary waits for majority to confirm.

  • secondaries may not receive the data at the same time
  • each secondary may have a different last applied entry
  • secondary nodes do not respond differently to oplog events based on write concern (I don't think secondary nodes are even aware of the write concern)

N1 was primary, and N5 is the arbiter. Prior to the write at T5, the remaining nodes show the following most recently applied oplog event:

N2: T4
N3: T4
N4: T2

N2 confirms the change.

The most recently applied oplog events are now:

N2: T5
N3: T4
N4: T3

N3 has received the change and is in process of applying it.

This means that N3 will not have reported a new last applied oplog time to any other node, and the primary will still not have acknowledged success or failure to the application process that submitted the write.

Primary goes down

The moment prior to the primary failing, the last applied optime for each member was:

N1: T6
N2: T5
N3: T4
N4: T3
N5: N/A

N3 is unable to confirm to Primary it has applied the change.

Reasonable under the circumstance.

Election is forced.

After the appropriate timeout, one or more of the secondary nodes will call for an election.

Let's assume that N3 actually did complete applying the operation from T5.

If N4 happens to be the one to hit the timeout first, it will call for an election, and request all members to vote for it. The voting results look like:

N4: YES (always vote for yourself)
N5: NO - I can see a node with a more recent oplog event than you have
N3: NO - I have a more recent oplog event than you have
N2: NO - ditto

At this point either N2 or N3 will stand for election next, and the votes look like:

Candidate N3: YES - self
N5: YES - you have the most recent event
N4: YES - you are more current than me
N2: YES - you are at least as updated as me

This election is successful, and the node transitions to primary.

This sequence of events can occur if N3 has applied the event from T5 regardless of whether or not it was reported to the primary. Since all nodes monitor all other nodes, N3 will be reporting that to N2, N4, and N5.

Note that if #4 had not happened, N3 would still be at T4, and would not be elected since N2 had a more recent event.

In either case, the most recent event applied by any secondary participating in the election will be retained.

When N1 recovers, and attempts to rejoin the replica set, it is quite likely that the oplog on the new primary has moved beyond T6. Since the current primary has no knowledge of the write at T6, and the former primary cannot become a secondary node unless it can replay the oplog entries from the current primary, the nodes compare their oplog history until a common point is found.

In this case, that will be T5.

The former primary then rolls back the change made by the operation at T6, and writes the rolled back data to a local disk file so it can be recovered later, if necessary. It then requests the oplog entries that came after T5 from the new primary, and begins replicating as a secondary node.

In that scenario the only write that is lost is one that was processed by the primary, but not replicated to any other node.

Are results different if Primary has acknowledged the write, and N2 replicated the fact the write was confirmed by majority? Are results different if both N2 and N3 replicated the fact that the write was confirmed by majority?

This fact does not actually need to be replicated. Each node is monitoring each other node, so each secondary node will be well aware of which oplog entries have been seen by a majority of the replica set without being told.

The only difference would be that if N3 had also reported to the primary that it had replicated the operation, the primary could have reported to the application that the majority write was successful.

Alternate scenario

IF:

  • #4 did not occur
  • the failure was a network partition rather than a node failure
  • N1 and N2 were on one side of the partition, N3, N4, N5 on the other

In this case N1 would realize that it could not contact a majority of the replica set, and would step down.

If N1 or N2 attempted to call for an election, they would be unable to acquire the required 3 votes

N4 would still not be elected due to being behind N3

When N3 calls for election N4 and N5 would both vote 'Yes' so N3 would become the primary.

This means that when the network partition is healed, both N1 and N2 would find that they have writes that current primary has no knowledge of.

In this scenario, both the T5 and T6 writes would be rolled back.