Redis cluster 官方文档(选摘)

Redis cluster tutorial

This tutorial tries to provide information about the availability and consistency characteristics of Redis Cluster from the point of view of the final user, stated in a simple to understand way.

Note this tutorial requires Redis version 3.0 or higher.


Redis Cluster 101

Redis Cluster provides a way to run a Redis installation where data is automatically sharded across multiple Redis nodes.

Redis Cluster also provides some degree of availability during partitions, that is in practical terms the ability to continue the operations when some nodes fail or are not able to communicate. However the cluster stops to operate in the event of larger failures (for example when the majority of masters are unavailable).

So in practical terms, what do you get with Redis Cluster?

  • The ability to automatically split your dataset among multiple nodes.(数据分片存于集群节点中)
  • The ability to continue operations when a subset of the nodes are experiencing failures or are unable to communicate with the rest of the cluster.(集群中部分节点[]失效时仍可继续操作)

Redis Cluster TCP ports

Every Redis Cluster node requires two TCP connections open. The normal Redis TCP port used to serve clients, for example 6379, plus the second port named cluster bus port. The cluster bus port will be derived by adding 10000 to the data port, 16379 in this example, or by overiding it with the cluster-port config.

This second high port is used for the Cluster bus, that is a node-to-node communication channel using a binary protocol. The Cluster bus is used by nodes for failure detection, configuration update, failover authorization and so forth. Clients should never try to communicate with the cluster bus port, but always with the normal Redis command port, however make sure you open both ports in your firewall, otherwise Redis cluster nodes will be not able to communicate.

Note that for a Redis Cluster to work properly you need, for each node:

  1. The normal client communication port (usually 6379) used to communicate with clients to be open to all the clients that need to reach the cluster, plus all the other cluster nodes (that use the client port for keys migrations).
  2. The cluster bus port must be reachable from all the other cluster nodes.

If you don’t open both TCP ports, your cluster will not work as expected.

The cluster bus uses a different, binary protocol, for node to node data exchange, which is more suited to exchange information between nodes using little bandwidth and processing time.

Redis Cluster and Docker

Currently Redis Cluster does not support NATted (NAT – Network Address Translation) environments and in general environments where IP addresses or TCP ports are remapped.

Docker uses a technique called port mapping: programs running inside Docker containers may be exposed with a different port compared to the one the program believes to be using. This is useful in order to run multiple containers using the same ports, at the same time, in the same server.

In order to make Docker compatible with Redis Cluster you need to use the host networking mode of Docker. Please check the --net=host option in the Docker documentation for more information.

Redis Cluster data sharding

Redis Cluster does not use consistent hashing, but a different form of sharding where every key is conceptually part of what we call a hash slot.(没使用一致性哈希,使用哈希槽)

There are 16384 hash slots in Redis Cluster, and to compute what is the hash slot of a given key, we simply take the CRC16 of the key modulo 16384.

Every node in a Redis Cluster is responsible for a subset of the hash slots, so for example you may have a cluster with 3 nodes, where:

  • Node A contains hash slots from 0 to 5500.
  • Node B contains hash slots from 5501 to 11000.
  • Node C contains hash slots from 11001 to 16383.

This allows to add and remove nodes in the cluster easily. For example if I want to add a new node D, I need to move some hash slots from nodes A, B, C to D. Similarly if I want to remove node A from the cluster I can just move the hash slots served by A to B and C. When the node A will be empty I can remove it from the cluster completely. (可动态分配哈希槽,将某节点哈希槽全部转移后可移除该节点)

Because moving hash slots from a node to another does not require stopping operations, adding and removing nodes, or changing the percentage of hash slots hold by nodes, does not require any downtime.(移动哈希槽不需要停止操作、不需要增加或移除节点;修改各个节点的哈希槽比例也也不需要停止服务器)

Redis Cluster supports multiple key operations as long as all the keys involved into a single command execution (or whole transaction, or Lua script execution) all belong to the same hash slot. The user can force multiple keys to be part of the same hash slot by using a concept called hash tags. (用户可以通过哈希标签来强制部分键分配到同个哈希槽)

Hash tags are documented in the Redis Cluster specification, but the gist is that if there is a substring between {} brackets in a key, only what is inside the string is hashed, so for example this{foo}key and another{foo}key are guaranteed to be in the same hash slot, and can be used together in a command with multiple keys as arguments. (哈希标签的要点是如果键包含使用大括号包括起来的字符串,那么只有这部分字符串会进行哈希计算)

Redis Cluster master-replica model

In order to remain available when a subset of master nodes are failing or are not able to communicate with the majority of nodes, Redis Cluster uses a master-replica model where every hash slot has from 1 (the master itself) to N replicas (N-1 additional replica nodes). (集群使用主从模型使得每个哈希槽可以有1~N个复制: 集群的每个节点为主节点,每个主节点可以有多个从节点)

In our example cluster with nodes A, B, C, if node B fails the cluster is not able to continue, since we no longer have a way to serve hash slots in the range 5501-11000.

However when the cluster is created (or at a later time) we add a replica node to every master, so that the final cluster is composed of A, B, C that are master nodes, and A1, B1, C1 that are replica nodes. This way, the system is able to continue if node B fails.

Node B1 replicates B, and B fails, the cluster will promote node B1 as the new master and will continue to operate correctly.

However, note that if nodes B and B1 fail at the same time, Redis Cluster is not able to continue to operate.

Redis Cluster consistency guarantees

Redis Cluster is not able to guarantee strong consistency. (不能保证强一致性)In practical terms this means that under certain conditions it is possible that Redis Cluster will lose writes that were acknowledged by the system to the client. (在特定条件下会丢失写信息)

The first reason why Redis Cluster can lose writes is because it uses asynchronous replication. This means that during writes the following happens(丢失写信息第一个原因是:集群使用异步复制):

  • Your client writes to the master B.
  • The master B replies OK to your client.
  • The master B propagates the write to its replicas B1, B2 and B3.

As you can see, B does not wait for an acknowledgement from B1, B2, B3 before replying to the client, since this would be a prohibitive latency penalty for Redis, so if your client writes something, B acknowledges the write, but crashes before being able to send the write to its replicas, one of the replicas (that did not receive the write) can be promoted to master, losing the write forever. (节点接收到写信息,并回复客户端写成功后,同步到多个从节点时挂掉,如果此时有个从节点未能收到该写信息,且由于主节点挂掉后该节点被提升为主节点,那么前面客户端的写信息就永远丢失了,因为其他从节点需要同步主节点的数据)

This is very similar to what happens with most databases that are configured to flush data to disk every second, so it is a scenario you are already able to reason about because of past experiences with traditional database systems not involving distributed systems. Similarly you can improve consistency by forcing the database to flush data to disk before replying to the client, but this usually results in prohibitively low performance. That would be the equivalent of synchronous replication in the case of Redis Cluster.

Basically, there is a trade-off to be made between performance and consistency.

Redis Cluster has support for synchronous writes when absolutely needed, implemented via the WAIT command. This makes losing writes a lot less likely. (当确实需要是,集群可通过WAIT命令支持同步写,这确实会大幅减少写丢失发生 ) However, note that Redis Cluster does not implement strong consistency even when synchronous replication is used: it is always possible, under more complex failure scenarios, that a replica that was not able to receive the write will be elected as master. (即使使用同步复制也不能实现强一致性:因为总是可能出现这样的情况,一个不能接收写操作的复制节点在一些复杂的失败场景中被提升为主节点)

There is another notable scenario where Redis Cluster will lose writes, that happens during a network partition where a client is isolated with a minority of instances including at least a master. (另外一个较为出名丢失写信息的场景是当出现网络分区后,客户端所在分区拥有少于另外分区的节点,并且至少包含一个主节点,那么另外分区可能出现重新选主情况,那么客户端在当前分区写到主节点的数据就有可能丢失)

Take as an example our 6 nodes cluster composed of A, B, C, A1, B1, C1, with 3 masters and 3 replicas. There is also a client, that we will call Z1.

After a partition occurs, it is possible that in one side of the partition we have A, C, A1, B1, C1, and in the other side we have B and Z1.

Z1 is still able to write to B, which will accept its writes. If the partition heals in a very short time, the cluster will continue normally. However, if the partition lasts enough time for B1 to be promoted to master on the majority side of the partition, the writes that Z1 has sent to B in the meantime will be lost.

Note that there is a maximum window to the amount of writes Z1 will be able to send to B: if enough time has elapsed for the majority side of the partition to elect a replica as master, every master node in the minority side will have stopped accepting writes.

This amount of time is a very important configuration directive of Redis Cluster, and is called the node timeout. (集群有个节点超时时间设置,出现网络分区后,在这段时间内,主节点仍然可以接收写信息,但超过这个时间后就会停止接收)

After node timeout has elapsed, a master node is considered to be failing, and can be replaced by one of its replicas. Similarly, after node timeout has elapsed without a master node to be able to sense the majority of the other master nodes, it enters an error state and stops accepting writes.

Redis Cluster Specification

Main properties and rationales of the design

Redis Cluster goals

Redis Cluster is a distributed implementation of Redis with the following goals, in order of importance in the design:

  • High performance and linear scalability up to 1000 nodes. There are no proxies, asynchronous replication is used, and no merge operations are performed on values.
  • Acceptable degree of write safety: the system tries (in a best-effort way) to retain all the writes originating from clients connected with the majority of the master nodes. Usually there are small windows where acknowledged writes can be lost. Windows to lose acknowledged writes are larger when clients are in a minority partition.
  • Availability: Redis Cluster is able to survive partitions where the majority of the master nodes are reachable and there is at least one reachable replica for every master node that is no longer reachable. Moreover using replicas migration, masters no longer replicated by any replica will receive one from a master which is covered by multiple replicas. (可用性: 在出现网络分区后,如果存在分区大部分主节点可以连接,并且不能连接的主节点都至少有一个复制节点在该分区,那么集群仍然可以正常使用)

What is described in this document is implemented in Redis 3.0 or greater.

Implemented subset

Redis Cluster implements all the single key commands available in the non-distributed version of Redis. Commands performing complex multi-key operations like Set type unions or intersections are implemented as well as long as the keys all hash to the same slot.

Redis Cluster implements a concept called hash tags that can be used in order to force certain keys to be stored in the same hash slot. However during manual resharding, multi-key operations may become unavailable for some time while single key operations are always available.

Redis Cluster does not support multiple databases like the standalone version of Redis. There is just database 0 and the SELECT command is not allowed.

Clients and Servers roles in the Redis Cluster protocol

In Redis Cluster nodes are responsible for holding the data, and taking the state of the cluster, including mapping keys to the right nodes. Cluster nodes are also able to auto-discover other nodes, detect non-working nodes, and promote replica nodes to master when needed in order to continue to operate when a failure occurs.

To perform their tasks all the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. (所有集群节点都通过连接到Redis集群总线来实现它们的功能)Every node is connected to every other node in the cluster using the cluster bus.(集群每个节点也通过集群总线相连。) Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. (使用流言协议来传递信息)The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly).

Since cluster nodes are not able to proxy requests, clients may be redirected to other nodes using redirection errors -MOVED and -ASK. The client is in theory free to send requests to all the nodes in the cluster, getting redirected if needed, so the client is not required to hold the state of the cluster. However clients that are able to cache the map between keys and nodes can improve the performance in a sensible way.

Write safety

Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. (最后被选举为主节点的数据会作为最新数据,且最终会被其他复制节点所同步) There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.

Redis Cluster tries harder to retain writes that are performed by clients connected to the majority of masters, compared to writes performed in the minority side. The following are examples of scenarios that lead to loss of acknowledged writes received in the majority partitions during failures:

  1. A write may reach a master, but while the master may be able to reply to the client, the write may not be propagated to replicas via the asynchronous replication used between master and replica nodes. If the master dies without the write reaching the replicas, the write is lost forever if the master is unreachable for a long enough period that one of its replicas is promoted. This is usually hard to observe in the case of a total, sudden failure of a master node since masters try to reply to clients (with the acknowledge of the write) and replicas (propagating the write) at about the same time. However it is a real world failure mode.
  2. Another theoretically possible failure mode where writes are lost is the following:
  • A master is unreachable because of a partition.
  • It gets failed over by one of its replicas.
  • After some time it may be reachable again.
  • A client with an out-of-date routing table may write to the old master before it is converted into a replica (of the new master) by the cluster.

The second failure mode is unlikely to happen because master nodes unable to communicate with the majority of the other masters for enough time to be failed over will no longer accept writes, and when the partition is fixed writes are still refused for a small amount of time to allow other nodes to inform about configuration changes. This failure mode also requires that the client’s routing table has not yet been updated.

Writes targeting the minority side of a partition have a larger window in which to get lost. For example, Redis Cluster loses a non-trivial number of writes on partitions where there is a minority of masters and at least one or more clients, since all the writes sent to the masters may potentially get lost if the masters are failed over in the majority side.

Specifically, for a master to be failed over it must be unreachable by the majority of masters for at least NODE_TIMEOUT, so if the partition is fixed before that time, no writes are lost. When the partition lasts for more than NODE_TIMEOUT, all the writes performed in the minority side up to that point may be lost. However the minority side of a Redis Cluster will start refusing writes as soon as NODE_TIMEOUT time has elapsed without contact with the majority, so there is a maximum window after which the minority becomes no longer available. Hence, no writes are accepted or lost after that time.

Availability

Redis Cluster is not available in the minority side of the partition. In the majority side of the partition assuming that there are at least the majority of masters and a replica for every unreachable master, the cluster becomes available again after NODE_TIMEOUT time plus a few more seconds required for a replica to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds).

This means that Redis Cluster is designed to survive failures of a few nodes in the cluster, but it is not a suitable solution for applications that require availability in the event of large net splits.

In the example of a cluster composed of N master nodes where every node has a single replica, the majority side of the cluster will remain available as long as a single node is partitioned away, and will remain available with a probability of 1-(1/(N*2-1)) when two nodes are partitioned away (after the first node fails we are left with N*2-1 nodes in total, and the probability of the only master without a replica to fail is 1/(N*2-1)).

For example, in a cluster with 5 nodes and a single replica per node, there is a 1/(5*2-1) = 11.11% probability that after two nodes are partitioned away from the majority, the cluster will no longer be available.

Thanks to a Redis Cluster feature called replicas migration the Cluster availability is improved in many real world scenarios by the fact that replicas migrate to orphaned masters (masters no longer having replicas). So at every successful failure event, the cluster may reconfigure the replicas layout in order to better resist the next failure.

Performance

In Redis Cluster nodes don’t proxy commands to the right node in charge for a given key, but instead they redirect clients to the right nodes serving a given portion of the key space.

Eventually clients obtain an up-to-date representation of the cluster and which node serves which subset of keys, so during normal operations clients directly contact the right nodes in order to send a given command.

Because of the use of asynchronous replication, nodes do not wait for other nodes’ acknowledgment of writes (if not explicitly requested using the WAIT command).

Also, because multi-key commands are only limited to near keys, data is never moved between nodes except when resharding.

Normal operations are handled exactly as in the case of a single Redis instance. This means that in a Redis Cluster with N master nodes you can expect the same performance as a single Redis instance multiplied by N as the design scales linearly. At the same time the query is usually performed in a single round trip, since clients usually retain persistent connections with the nodes, so latency figures are also the same as the single standalone Redis node case.

Very high performance and scalability while preserving weak but reasonable forms of data safety and availability is the main goal of Redis Cluster.

Why merge operations are avoided

Redis Cluster design avoids conflicting versions of the same key-value pair in multiple nodes as in the case of the Redis data model this is not always desirable. Values in Redis are often very large; it is common to see lists or sorted sets with millions of elements. Also data types are semantically complex. Transferring and merging these kind of values can be a major bottleneck and/or may require the non-trivial involvement of application-side logic, additional memory to store meta-data, and so forth.

There are no strict technological limits here. CRDTs or synchronously replicated state machines can model complex data types similar to Redis. However, the actual run time behavior of such systems would not be similar to Redis Cluster. Redis Cluster was designed in order to cover the exact use cases of the non-clustered Redis version.

Overview of Redis Cluster main components

Keys distribution model

Keys hash tags

Cluster nodes attributes

The Cluster bus

Cluster topology

Nodes handshake

Redirection and resharding

MOVED Redirection

Cluster live reconfiguration

ASK redirection

Clients first connection and handling of redirections

Multiple keys operations

Scaling reads using replica nodes

Fault Tolerance

Heartbeat and gossip messages

Redis Cluster nodes continuously exchange ping and pong packets. Those two kind of packets have the same structure, and both carry important configuration information. The only actual difference is the message type field. We’ll refer to the sum of ping and pong packets as heartbeat packets.(心跳包:包含重要的配置信息)

Usually nodes send ping packets that will trigger the receivers to reply with pong packets. However this is not necessarily true. It is possible for nodes to just send pong packets to send information to other nodes about their configuration, without triggering a reply. This is useful, for example, in order to broadcast a new configuration as soon as possible.

Usually a node will ping a few random nodes every second so that the total number of ping packets sent (and pong packets received) by each node is a constant amount regardless of the number of nodes in the cluster.

However every node makes sure to ping every other node that hasn’t sent a ping or received a pong for longer than half the NODE_TIMEOUT time. Before NODE_TIMEOUT has elapsed, nodes also try to reconnect the TCP link with another node to make sure nodes are not believed to be unreachable only because there is a problem in the current TCP connection.

The number of messages globally exchanged can be sizable if NODE_TIMEOUT is set to a small figure and the number of nodes (N) is very large, since every node will try to ping every other node for which they don’t have fresh information every half the NODE_TIMEOUT time.

For example in a 100 node cluster with a node timeout set to 60 seconds, every node will try to send 99 pings every 30 seconds, with a total amount of pings of 3.3 per second. Multiplied by 100 nodes, this is 330 pings per second in the total cluster.

There are ways to lower the number of messages, however there have been no reported issues with the bandwidth currently used by Redis Cluster failure detection, so for now the obvious and direct design is used. Note that even in the above example, the 330 packets per second exchanged are evenly divided among 100 different nodes, so the traffic each node receives is acceptable.

Heartbeat packet content

Failure detection

Configuration handling, propagation, and failovers

Cluster current epoch

Redis Cluster uses a concept similar to the Raft algorithm “term”. In Redis Cluster the term is called epoch instead, and it is used in order to give incremental versioning to events. When multiple nodes provide conflicting information, it becomes possible for another node to understand which state is the most up to date.

The currentEpoch is a 64 bit unsigned number.

At node creation every Redis Cluster node, both replicas and master nodes, set the currentEpoch to 0.

Every time a packet is received from another node, if the epoch of the sender (part of the cluster bus messages header) is greater than the local node epoch, the currentEpoch is updated to the sender epoch.

Because of these semantics, eventually all the nodes will agree to the greatest currentEpoch in the cluster.

This information is used when the state of the cluster is changed and a node seeks agreement in order to perform some action.

Currently this happens only during replica promotion, as described in the next section. Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch.

Configuration epoch

Every master always advertises its configEpoch in ping and pong packets along with a bitmap advertising the set of slots it serves.

The configEpoch is set to zero in masters when a new node is created.

A new configEpoch is created during replica election. replicas trying to replace failing masters increment their epoch and try to get authorization from a majority of masters. When a replica is authorized, a new unique configEpoch is created and the replica turns into a master using the new configEpoch.

As explained in the next sections the configEpoch helps to resolve conflicts when different nodes claim divergent configurations (a condition that may happen because of network partitions and node failures).

replica nodes also advertise the configEpoch field in ping and pong packets, but in the case of replicas the field represents the configEpoch of its master as of the last time they exchanged packets. This allows other instances to detect when a replica has an old configuration that needs to be updated (master nodes will not grant votes to replicas with an old configuration).

Every time the configEpoch changes for some known node, it is permanently stored in the nodes.conf file by all the nodes that receive this information. The same also happens for the currentEpoch value. These two variables are guaranteed to be saved and fsync-ed to disk when updated before a node continues its operations.

The configEpoch values generated using a simple algorithm during failovers are guaranteed to be new, incremental, and unique.

Replica election and promotion

Replica election and promotion is handled by replica nodes, with the help of master nodes that vote for the replica to promote. A replica election happens when a master is in FAIL state from the point of view of at least one of its replicas that has the prerequisites in order to become a master.

In order for a replica to promote itself to master, it needs to start an election and win it. All the replicas for a given master can start an election if the master is in FAIL state, however only one replica will win the election and promote itself to master.

A replica starts an election when the following conditions are met:

  • The replica’s master is in FAIL state.
  • The master was serving a non-zero number of slots.
  • The replica replication link was disconnected from the master for no longer than a given amount of time, in order to ensure the promoted replica’s data is reasonably fresh. This time is user configurable.

In order to be elected, the first step for a replica is to increment its currentEpoch counter, and request votes from master instances.

Votes are requested by the replica by broadcasting a FAILOVER_AUTH_REQUEST packet to every master node of the cluster. Then it waits for a maximum time of two times the NODE_TIMEOUT for replies to arrive (but always for at least 2 seconds).

Once a master has voted for a given replica, replying positively with a FAILOVER_AUTH_ACK, it can no longer vote for another replica of the same master for a period of NODE_TIMEOUT * 2. In this period it will not be able to reply to other authorization requests for the same master. This is not needed to guarantee safety, but useful for preventing multiple replicas from getting elected (even if with a different configEpoch) at around the same time, which is usually not wanted.

A replica discards any AUTH_ACK replies with an epoch that is less than the currentEpoch at the time the vote request was sent. This ensures it doesn’t count votes intended for a previous election.

Once the replica receives ACKs from the majority of masters, it wins the election. Otherwise if the majority is not reached within the period of two times NODE_TIMEOUT (but always at least 2 seconds), the election is aborted and a new one will be tried again after NODE_TIMEOUT * 4 (and always at least 4 seconds).

Replica rank

As soon as a master is in FAIL state, a replica waits a short period of time before trying to get elected. That delay is computed as follows:

DELAY = 500 milliseconds + random delay between 0 and 500 milliseconds +
        REPLICA_RANK * 1000 milliseconds.

The fixed delay ensures that we wait for the FAIL state to propagate across the cluster, otherwise the replica may try to get elected while the masters are still unaware of the FAIL state, refusing to grant their vote.

The random delay is used to desynchronize replicas so they’re unlikely to start an election at the same time.

The REPLICA_RANK is the rank of this replica regarding the amount of replication data it has processed from the master. Replicas exchange messages when the master is failing in order to establish a (best effort) rank: the replica with the most updated replication offset is at rank 0, the second most updated at rank 1, and so forth. In this way the most updated replicas try to get elected before others.

Rank order is not strictly enforced; if a replica of higher rank fails to be elected, the others will try shortly.

Once a replica wins the election, it obtains a new unique and incremental configEpoch which is higher than that of any other existing master. It starts advertising itself as master in ping and pong packets, providing the set of served slots with a configEpoch that will win over the past ones.

In order to speedup the reconfiguration of other nodes, a pong packet is broadcast to all the nodes of the cluster. Currently unreachable nodes will eventually be reconfigured when they receive a ping or pong packet from another node or will receive an UPDATE packet from another node if the information it publishes via heartbeat packets are detected to be out of date.

The other nodes will detect that there is a new master serving the same slots served by the old master but with a greater configEpoch, and will upgrade their configuration. Replicas of the old master (or the failed over master if it rejoins the cluster) will not just upgrade the configuration but will also reconfigure to replicate from the new master. How nodes rejoining the cluster are configured is explained in the next sections.

Masters reply to replica vote request

In the previous section it was discussed how replicas try to get elected. This section explains what happens from the point of view of a master that is requested to vote for a given replica.

Masters receive requests for votes in form of FAILOVER_AUTH_REQUEST requests from replicas.

For a vote to be granted the following conditions need to be met:

  1. A master only votes a single time for a given epoch, and refuses to vote for older epochs: every master has a lastVoteEpoch field and will refuse to vote again as long as the currentEpoch in the auth request packet is not greater than the lastVoteEpoch. When a master replies positively to a vote request, the lastVoteEpoch is updated accordingly, and safely stored on disk.
  2. A master votes for a replica only if the replica’s master is flagged as FAIL.
  3. Auth requests with a currentEpoch that is less than the master currentEpoch are ignored. Because of this the master reply will always have the same currentEpoch as the auth request. If the same replica asks again to be voted, incrementing the currentEpoch, it is guaranteed that an old delayed reply from the master can not be accepted for the new vote.

Example of the issue caused by not using rule number 3:

Master currentEpoch is 5, lastVoteEpoch is 1 (this may happen after a few failed elections)

  • Replica currentEpoch is 3.
  • Replica tries to be elected with epoch 4 (3+1), master replies with an ok with currentEpoch 5, however the reply is delayed.
  • Replica will try to be elected again, at a later time, with epoch 5 (4+1), the delayed reply reaches the replica with currentEpoch 5, and is accepted as valid.
  1. Masters don’t vote for a replica of the same master before NODE_TIMEOUT * 2 has elapsed if a replica of that master was already voted for. This is not strictly required as it is not possible for two replicas to win the election in the same epoch. However, in practical terms it ensures that when a replica is elected it has plenty of time to inform the other replicas and avoid the possibility that another replica will win a new election, performing an unnecessary second failover.
  2. Masters make no effort to select the best replica in any way. If the replica’s master is in FAIL state and the master did not vote in the current term, a positive vote is granted. The best replica is the most likely to start an election and win it before the other replicas, since it will usually be able to start the voting process earlier because of its higher rank as explained in the previous section.
  3. When a master refuses to vote for a given replica there is no negative response, the request is simply ignored.
  4. Masters don’t vote for replicas sending a configEpoch that is less than any configEpoch in the master table for the slots claimed by the replica. Remember that the replica sends the configEpoch of its master, and the bitmap of the slots served by its master. This means that the replica requesting the vote must have a configuration for the slots it wants to failover that is newer or equal the one of the master granting the vote.


Practical example of configuration epoch usefulness during partitions

This section illustrates how the epoch concept is used to make the replica promotion process more resistant to partitions.

  • A master is no longer reachable indefinitely. The master has three replicas A, B, C.
  • Replica A wins the election and is promoted to master.
  • A network partition makes A not available for the majority of the cluster.
  • Replica B wins the election and is promoted as master.
  • A partition makes B not available for the majority of the cluster.
  • The previous partition is fixed, and A is available again.

At this point B is down and A is available again with a role of master (actually UPDATE messages would reconfigure it promptly, but here we assume all UPDATE messages were lost). At the same time, replica C will try to get elected in order to fail over B. This is what happens:

  1. C will try to get elected and will succeed, since for the majority of masters its master is actually down. It will obtain a new incremental configEpoch.
  2. A will not be able to claim to be the master for its hash slots, because the other nodes already have the same hash slots associated with a higher configuration epoch (the one of B) compared to the one published by A.
  3. So, all the nodes will upgrade their table to assign the hash slots to C, and the cluster will continue its operations.

As you’ll see in the next sections, a stale node rejoining a cluster will usually get notified as soon as possible about the configuration change because as soon as it pings any other node, the receiver will detect it has stale information and will send an UPDATE message.

Comments are closed