https://redis.io/topics/cluster-spec

Implemented subset

Redis Cluster implements a concept called hash tags that can be used in order to force certain keys to be stored in the same hash slot. However during manual resharding, multi-key operations may become unavailable for some time while single key operations are always available.

Clients and Servers roles in the Redis Cluster protocol

In Redis Cluster nodes are responsible for holding the data, and taking the state of the cluster, including mapping keys to the right nodes. Cluster nodes are also able to auto-discover other nodes, detect non-working nodes, and promote replica nodes to master when needed in order to continue to operate when a failure occurs.

To perform their tasks all the cluster nodes are connected using a TCP bus and a binary protocol, called the Redis Cluster Bus. Every node is connected to every other node in the cluster using the cluster bus. Nodes use a gossip protocol to propagate information about the cluster in order to discover new nodes, to send ping packets to make sure all the other nodes are working properly, and to send cluster messages needed to signal specific conditions. The cluster bus is also used in order to propagate Pub/Sub messages across the cluster and to orchestrate manual failovers when requested by users (manual failovers are failovers which are not initiated by the Redis Cluster failure detector, but by the system administrator directly).

Since cluster nodes are not able to proxy requests, clients may be redirected to other nodes using redirection errors -MOVED and -ASK. The client is in theory free to send requests to all the nodes in the cluster, getting redirected if needed, so the client is not required to hold the state of the cluster. However clients that are able to cache the map between keys and nodes can improve the performance in a sensible way.

Write safety

Redis Cluster uses asynchronous replication between nodes, and last failover wins implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters.

脑裂问题：

Redis Cluster tries harder to retain writes that are performed by clients connected to the majority of masters, compared to writes performed in the minority side. The following are examples of scenarios that lead to loss of acknowledged writes received in the majority partitions during failures:

A write may reach a master, but while the master may be able to reply to the client, the write may not be propagated to replicas via the asynchronous replication used between master and replica nodes. If the master dies without the write reaching the replicas, the write is lost forever if the master is unreachable for a long enough period that one of its replicas is promoted. This is usually hard to observe in the case of a total, sudden failure of a master node since masters try to reply to clients (with the acknowledge of the write) and replicas (propagating the write) at about the same time. However it is a real world failure mode.
Another theoretically possible failure mode where writes are lost is the following:

A master is unreachable because of a partition.
It gets failed over by one of its replicas.
After some time it may be reachable again.
A client with an out-of-date routing table may write to the old master before it is converted into a replica (of the new master) by the cluster.

Availability

Redis Cluster is not available in the minority side of the partition. In the majority side of the partition assuming that there are at least the majority of masters and a replica for every unreachable master, the cluster becomes available again after NODE_TIMEOUT time plus a few more seconds required for a replica to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds).

This means that Redis Cluster is designed to survive failures of a few nodes in the cluster, but it is not a suitable solution for applications that require availability in the event of large net splits.

Thanks to a Redis Cluster feature called replicas migration the Cluster availability is improved in many real world scenarios by the fact that replicas migrate to orphaned masters (masters no longer having replicas). So at every successful failure event, the cluster may reconfigure the replicas layout in order to better resist the next failure.

Overview of Redis Cluster main components

Keys distribution model

The key space is split into 16384 slots, effectively setting an upper limit for the cluster size of 16384 master nodes (however the suggested max size of nodes is in the order of ~ 1000 nodes).

Each master node in a cluster handles a subset of the 16384 hash slots. The cluster is stable when there is no cluster reconfiguration in progress (i.e. where hash slots are being moved from one node to another). When the cluster is stable, a single hash slot will be served by a single node (however the serving node can have one or more replicas that will replace it in the case of net splits or failures, and that can be used in order to scale read operations where reading stale data is acceptable).

The base algorithm used to map keys to hash slots is the following (read the next paragraph for the hash tag exception to this rule):

HASH_SLOT = CRC16(key) mod 16384

Keys hash tags

There is an exception for the computation of the hash slot that is used in order to implement hash tags. Hash tags are a way to ensure that multiple keys are allocated in the same hash slot. This is used in order to implement multi-key operations in Redis Cluster.

In order to implement hash tags, the hash slot for a key is computed in a slightly different way in certain conditions. If the key contains a “{…}” pattern only the substring between { and } is hashed in order to obtain the hash slot.

Then instead of hashing the key, only what is between the first occurrence of { and the following first occurrence of } is hashed.

Examples:

The two keys {user1000}.following and {user1000}.followers will hash to the same hash slot since only the substring user1000 will be hashed in order to compute the hash slot.
For the key foo{}{bar} the whole key will be hashed as usually since the first occurrence of { is followed by } on the right without characters in the middle.
For the key foo{{bar}}zap the substring {bar will be hashed, because it is the substring between the first occurrence of { and the first occurrence of } on its right.
For the key foo{bar}{zap} the substring bar will be hashed, since the algorithm stops at the first valid or invalid (without bytes inside) match of { and }.
What follows from the algorithm is that if the key starts with {}, it is guaranteed to be hashed as a whole. This is useful when using binary data as key names.

Cluster nodes attributes

Every node has a unique name in the cluster. The node name is the hex representation of a 160 bit random number, obtained the first time a node is started (usually using /dev/urandom). The node will save its ID in the node configuration file, and will use the same ID forever, or at least as long as the node configuration file is not deleted by the system administrator, or a hard reset is requested via the CLUSTER RESET command.

The node ID is used to identify every node across the whole cluster. It is possible for a given node to change its IP address without any need to also change the node ID. The cluster is also able to detect the change in IP/port and reconfigure using the gossip protocol running over the cluster bus.

The node ID is not the only information associated with each node, but is the only one that is always globally consistent. Every node has also the following set of information associated. Some information is about the cluster configuration detail of this specific node, and is eventually consistent across the cluster. Some other information, like the last time a node was pinged, is instead local to each node.

Every node maintains the following information about other nodes that it is aware of in the cluster: The node ID, IP and port of the node, a set of flags, what is the master of the node if it is flagged as replica, last time the node was pinged and the last time the pong was received, the current configuration epoch of the node (explained later in this specification), the link state and finally the set of hash slots served.

A detailed explanation of all the node fields is described in the CLUSTER NODES documentation.

The CLUSTER NODES command can be sent to any node in the cluster and provides the state of the cluster and the information for each node according to the local view the queried node has of the cluster.

The following is sample output of the CLUSTER NODES command sent to a master node in a small cluster of three nodes.

$ redis-cli cluster nodes
d1861060fe6a534d42d8a19aeb36600e18785e04 127.0.0.1:6379 myself - 0 1318428930 1 connected 0-1364
3886e65cc906bfd9b1f7e7bde468726a052d1dae 127.0.0.1:6380 master - 1318428930 1318428931 2 connected 1365-2729
d289c575dcbc4bdd2931585fd4339089e461a27d 127.0.0.1:6381 master - 1318428931 1318428931 3 connected 2730-4095

In the above listing the different fields are in order: node id, address:port, flags, last ping sent, last pong received, configuration epoch, link state, slots. Details about the above fields will be covered as soon as we talk of specific parts of Redis Cluster.

The Cluster bus

Every Redis Cluster node has an additional TCP port for receiving incoming connections from other Redis Cluster nodes. This port will be derived by adding 10000 to the data port or it can be specified with the cluster-port config.

Example 1:

If a Redis node is listening for client connections on port 6379, and you do not add cluster-port parameter in redis.conf, the Cluster bus port 16379 will be opened.

Cluster topology

Redis Cluster is a full mesh where every node is connected with every other node using a TCP connection.

In a cluster of N nodes, every node has N-1 outgoing TCP connections, and N-1 incoming connections.

These TCP connections are kept alive all the time and are not created on demand. When a node expects a pong reply in response to a ping in the cluster bus, before waiting long enough to mark the node as unreachable, it will try to refresh the connection with the node by reconnecting from scratch.

While Redis Cluster nodes form a full mesh, nodes use a gossip protocol and a configuration update mechanism in order to avoid exchanging too many messages between nodes during normal conditions, so the number of messages exchanged is not exponential.

Nodes handshake

Nodes always accept connections on the cluster bus port, and even reply to pings when received, even if the pinging node is not trusted. However, all other packets will be discarded by the receiving node if the sending node is not considered part of the cluster.

A node will accept another node as part of the cluster only in two ways:

If a node presents itself with a MEET message. A meet message is exactly like a PING message, but forces the receiver to accept the node as part of the cluster. Nodes will send MEET messages to other nodes only if the system administrator requests this via the following command:

CLUSTER MEET ip port
A node will also register another node as part of the cluster if a node that is already trusted will gossip about this other node. So if A knows B, and B knows C, eventually B will send gossip messages to A about C. When this happens, A will register C as part of the network, and will try to connect with C.

This means that as long as we join nodes in any connected graph, they’ll eventually form a fully connected graph automatically. This means that the cluster is able to auto-discover other nodes, but only if there is a trusted relationship that was forced by the system administrator.

This mechanism makes the cluster more robust but prevents different Redis clusters from accidentally mixing after change of IP addresses or other network related events.

Redirection and resharding

MOVED Redirection

A Redis client is free to send queries to every node in the cluster, including replica nodes. The node will analyze the query, and if it is acceptable (that is, only a single key is mentioned in the query, or the multiple keys mentioned are all to the same hash slot) it will lookup what node is responsible for the hash slot where the key or keys belong.

If the hash slot is served by the node, the query is simply processed, otherwise the node will check its internal hash slot to node map, and will reply to the client with a MOVED error, like in the following example:

GET x
-MOVED 3999 127.0.0.1:6381

The error includes the hash slot of the key (3999) and the ip:port of the instance that can serve the query. The client needs to reissue the query to the specified node’s IP address and port. Note that even if the client waits a long time before reissuing the query, and in the meantime the cluster configuration changed, the destination node will reply again with a MOVED error if the hash slot 3999 is now served by another node. The same happens if the contacted node had no updated information.

So while from the point of view of the cluster nodes are identified by IDs we try to simplify our interface with the client just exposing a map between hash slots and Redis nodes identified by IP:port pairs.

The client is not required to, but should try to memorize that hash slot 3999 is served by 127.0.0.1:6381. This way once a new command needs to be issued it can compute the hash slot of the target key and have a greater chance of choosing the right node.

An alternative is to just refresh the whole client-side cluster layout using the CLUSTER NODES or CLUSTER SLOTS commands when a MOVED redirection is received. When a redirection is encountered, it is likely multiple slots were reconfigured rather than just one, so updating the client configuration as soon as possible is often the best strategy.

Note that when the Cluster is stable (no ongoing changes in the configuration), eventually all the clients will obtain a map of hash slots -> nodes, making the cluster efficient, with clients directly addressing the right nodes without redirections, proxies or other single point of failure entities.

A client must be also able to handle -ASK redirections that are described later in this document, otherwise it is not a complete Redis Cluster client.

Cluster live reconfiguration

Redis Cluster supports the ability to add and remove nodes while the cluster is running. Adding or removing a node is abstracted into the same operation: moving a hash slot from one node to another. This means that the same basic mechanism can be used in order to rebalance the cluster, add or remove nodes, and so forth.

To add a new node to the cluster an empty node is added to the cluster and some set of hash slots are moved from existing nodes to the new node.
To remove a node from the cluster the hash slots assigned to that node are moved to other existing nodes.
To rebalance the cluster a given set of hash slots are moved between nodes.

The core of the implementation is the ability to move hash slots around. From a practical point of view a hash slot is just a set of keys, so what Redis Cluster really does during resharding is to move keys from an instance to another instance. Moving a hash slot means moving all the keys that happen to hash into this hash slot.

To understand how this works we need to show the CLUSTER subcommands that are used to manipulate the slots translation table in a Redis Cluster node.

The following subcommands are available (among others not useful in this case):

CLUSTER ADDSLOTS slot1 [slot2] … [slotN]
CLUSTER DELSLOTS slot1 [slot2] … [slotN]
CLUSTER ADDSLOTSRANGE start-slot1 end-slot1 [start-slot2 end-slot2] … [start-slotN end-slotN]
CLUSTER DELSLOTSRANGE start-slot1 end-slot1 [start-slot2 end-slot2] … [start-slotN end-slotN]
CLUSTER SETSLOT slot NODE node
CLUSTER SETSLOT slot MIGRATING node
CLUSTER SETSLOT slot IMPORTING node

The first four commands, ADDSLOTS, DELSLOTS, ADDSLOTSRANGE and DELSLOTSRANGE, are simply used to assign (or remove) slots to a Redis node. Assigning a slot means to tell a given master node that it will be in charge of storing and serving content for the specified hash slot.

After the hash slots are assigned they will propagate across the cluster using the gossip protocol, as specified later in the configuration propagation section.

The ADDSLOTS and ADDSLOTSRANGE command are usually used when a new cluster is created from scratch to assign each master node a subset of all the 16384 hash slots available.

The DELSLOTS and DELSLOTSRANGE are mainly used for manual modification of a cluster configuration or for debugging tasks: in practice it is rarely used.

The SETSLOT subcommand is used to assign a slot to a specific node ID if the SETSLOT <slot> NODE form is used. Otherwise the slot can be set in the two special states MIGRATING and IMPORTING. Those two special states are used in order to migrate a hash slot from one node to another.

When a slot is set as MIGRATING, the node will accept all queries that are about this hash slot, but only if the key in question exists, otherwise the query is forwarded using a -ASK redirection to the node that is target of the migration.
When a slot is set as IMPORTING, the node will accept all queries that are about this hash slot, but only if the request is preceded by an ASKING command. If the ASKING command was not given by the client, the query is redirected to the real hash slot owner via a -MOVED redirection error, as would happen normally.

ASK redirection

In the previous section we briefly talked about ASK redirection. Why can’t we simply use MOVED redirection? Because while MOVED means that we think the hash slot is permanently served by a different node and the next queries should be tried against the specified node, ASK means to send only the next query to the specified node.

This is needed because the next query about hash slot 8 can be about a key that is still in A, so we always want the client to try A and then B if needed. Since this happens only for one hash slot out of 16384 available, the performance hit on the cluster is acceptable.

127.0.0.1:7000> cluster slots
1) 1) (integer) 5461
   2) (integer) 10922
   3) 1) "127.0.0.1"
      2) (integer) 7001
   4) 1) "127.0.0.1"
      2) (integer) 7004
2) 1) (integer) 0
   2) (integer) 5460
   3) 1) "127.0.0.1"
      2) (integer) 7000
   4) 1) "127.0.0.1"
      2) (integer) 7003
3) 1) (integer) 10923
   2) (integer) 16383
   3) 1) "127.0.0.1"
      2) (integer) 7002
   4) 1) "127.0.0.1"
      2) (integer) 7005

Fault Tolerance

Heartbeat and gossip messages

Redis Cluster nodes continuously exchange ping and pong packets. Those two kind of packets have the same structure, and both carry important configuration information. The only actual difference is the message type field. We’ll refer to the sum of ping and pong packets as heartbeat packets.

Usually nodes send ping packets that will trigger the receivers to reply with pong packets. However this is not necessarily true. It is possible for nodes to just send pong packets to send information to other nodes about their configuration, without triggering a reply. This is useful, for example, in order to broadcast a new configuration as soon as possible.

Usually a node will ping a few random nodes every second so that the total number of ping packets sent (and pong packets received) by each node is a constant amount regardless of the number of nodes in the cluster.

However every node makes sure to ping every other node that hasn’t sent a ping or received a pong for longer than half the NODE_TIMEOUT time. Before NODE_TIMEOUT has elapsed, nodes also try to reconnect the TCP link with another node to make sure nodes are not believed to be unreachable only because there is a problem in the current TCP connection.

The number of messages globally exchanged can be sizable if NODE_TIMEOUT is set to a small figure and the number of nodes (N) is very large, since every node will try to ping every other node for which they don’t have fresh information every half the NODE_TIMEOUT time.

For example in a 100 node cluster with a node timeout set to 60 seconds, every node will try to send 99 pings every 30 seconds, with a total amount of pings of 3.3 per second. Multiplied by 100 nodes, this is 330 pings per second in the total cluster.

There are ways to lower the number of messages, however there have been no reported issues with the bandwidth currently used by Redis Cluster failure detection, so for now the obvious and direct design is used. Note that even in the above example, the 330 packets per second exchanged are evenly divided among 100 different nodes, so the traffic each node receives is acceptable

Failure detection

Redis Cluster failure detection is used to recognize when a master or replica node is no longer reachable by the majority of nodes and then respond by promoting a replica to the role of master. When replica promotion is not possible the cluster is put in an error state to stop receiving queries from clients.

As already mentioned, every node takes a list of flags associated with other known nodes. There are two flags that are used for failure detection that are called PFAIL and FAIL. PFAIL means Possible failure, and is a non-acknowledged failure type. FAIL means that a node is failing and that this condition was confirmed by a majority of masters within a fixed amount of time.

PFAIL flag:

A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and replica nodes can flag another node as PFAIL, regardless of its type.

The concept of non-reachability for a Redis Cluster node is that we have an active ping (a ping that we sent for which we have yet to get a reply) pending for longer than NODE_TIMEOUT. For this mechanism to work the NODE_TIMEOUT must be large compared to the network round trip time. In order to add reliability during normal operations, nodes will try to reconnect with other nodes in the cluster as soon as half of the NODE_TIMEOUT has elapsed without a reply to a ping. This mechanism ensures that connections are kept alive so broken connections usually won’t result in false failure reports between nodes.

FAIL flag:

The PFAIL flag alone is just local information every node has about other nodes, but it is not sufficient to trigger a replica promotion. For a node to be considered down the PFAIL condition needs to be escalated to a FAIL condition.

As outlined in the node heartbeats section of this document, every node sends gossip messages to every other node including the state of a few random known nodes. Every node eventually receives a set of node flags for every other node. This way every node has a mechanism to signal other nodes about failure conditions they have detected.

A PFAIL condition is escalated to a FAIL condition when the following set of conditions are met:

Some node, that we’ll call A, has another node B flagged as PFAIL.
Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.
The majority of masters signaled the PFAIL or FAIL condition within NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT time. (The validity factor is set to 2 in the current implementation, so this is just two times the NODE_TIMEOUT time).

Configuration handling, propagation, and failovers

Cluster current epoch

Redis Cluster uses a concept similar to the Raft algorithm “term”. In Redis Cluster the term is called epoch instead, and it is used in order to give incremental versioning to events. When multiple nodes provide conflicting information, it becomes possible for another node to understand which state is the most up to date.

The currentEpoch is a 64 bit unsigned number.

At node creation every Redis Cluster node, both replicas and master nodes, set the currentEpoch to 0.

Every time a packet is received from another node, if the epoch of the sender (part of the cluster bus messages header) is greater than the local node epoch, the currentEpoch is updated to the sender epoch.

Because of these semantics, eventually all the nodes will agree to the greatest currentEpoch in the cluster.

This information is used when the state of the cluster is changed and a node seeks agreement in order to perform some action.

Currently this happens only during replica promotion, as described in the next section. Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch.

Replica election and promotion

replica election and promotion is handled by replica nodes, with the help of master nodes that vote for the replica to promote. A replica election happens when a master is in FAIL state from the point of view of at least one of its replicas that has the prerequisites in order to become a master.

In order for a replica to promote itself to master, it needs to start an election and win it. All the replicas for a given master can start an election if the master is in FAIL state, however only one replica will win the election and promote itself to master.

A replica starts an election when the following conditions are met:

The replica’s master is in FAIL state.
The master was serving a non-zero number of slots.
The replica replication link was disconnected from the master for no longer than a given amount of time, in order to ensure the promoted replica’s data is reasonably fresh. This time is user configurable.

Replica migration

For example:

Master A has a single replica A1.
Master A fails. A1 is promoted as new master.
Three hours later A1 fails in an independent manner (unrelated to the failure of A). No other replica is available for promotion since node A is still down. The cluster cannot continue normal operations.

If the map between masters and replicas is fixed, the only way to make the cluster more resistant to the above scenario is to add replicas to every master, however this is costly as it requires more instances of Redis to be executed, more memory, and so forth.

An alternative is to create an asymmetry in the cluster, and let the cluster layout automatically change over time. For example the cluster may have three masters A, B, C. A and B have a single replica each, A1 and B1. However the master C is different and has two replicas: C1 and C2.

Replica migration is the process of automatic reconfiguration of a replica in order to migrate to a master that has no longer coverage (no working replicas). With replica migration the scenario mentioned above turns into the following:

Master A fails. A1 is promoted.
C2 migrates as replica of A1, that is otherwise not backed by any replica.
Three hours later A1 fails as well.
C2 is promoted as new master to replace A1.
The cluster can continue the operations.

Replica migration algorithm

https://redis.io/topics/cluster-spec#replica-migration-algorithm

Ref

https://redis.io/topics/cluster-tutorial

Chat