Replica election and promotion
Replica election and promotion is handled by replica nodes, with the help of master nodes that vote for the replica to promote. A replica election happens when a master is in
FAIL state from the point of view of at least one of its replicas that has the prerequisites in order to become a master.
In order for a replica to promote itself to master, it needs to start an election and win it. All the replicas for a given master can start an election if the master is in
FAIL state, however only one replica will win the election and promote itself to master.
A replica starts an election when the following conditions are met:
- The replica’s master is in
- The master was serving a non-zero number of slots.
- The replica replication link was disconnected from the master for no longer than a given amount of time, in order to ensure the promoted replica’s data is reasonably fresh. This time is user configurable.
In order to be elected, the first step for a replica is to increment its
currentEpoch counter, and request votes from master instances.
Votes are requested by the replica by broadcasting a
FAILOVER_AUTH_REQUEST packet to every master node of the cluster. Then it waits for a maximum time of two times the
NODE_TIMEOUT for replies to arrive (but always for at least 2 seconds).
Once a master has voted for a given replica, replying positively with a
FAILOVER_AUTH_ACK, it can no longer vote for another replica of the same master for a period of
NODE_TIMEOUT * 2. In this period it will not be able to reply to other authorization requests for the same master. This is not needed to guarantee safety, but useful for preventing multiple replicas from getting elected (even if with a different
configEpoch) at around the same time, which is usually not wanted.
A replica discards any
AUTH_ACK replies with an epoch that is less than the
currentEpoch at the time the vote request was sent. This ensures it doesn’t count votes intended for a previous election.
Once the replica receives ACKs from the majority of masters, it wins the election. Otherwise if the majority is not reached within the period of two times
NODE_TIMEOUT (but always at least 2 seconds), the election is aborted and a new one will be tried again after
NODE_TIMEOUT * 4 (and always at least 4 seconds).