Raft and its three subproblems

7 min readMar 21, 2022

This article comes from a question that is often confused: what is the difference between Quorum and Paxos, Raft and other consensus protocols? The answer to this question itself is very simple: most consensus protocols use the Quorum mechanism, but only Quorum (R+W >N) The mechanism cannot guarantee consistency . This article plans to extend this question, taking Raft as an example to answer that a complete consensus protocol has those mechanisms including Quorum, and try to explain the completeness of these mechanisms and the indispensableness of each of them.

consistency

To answer this question, we first need to explain what Raft does. Paxos and Raft are called consensus protocols. As the name implies, they solve the problem of multi-node consistency. It should be noted that the consistency mentioned here does not require all Nodes are in the exact same state at any time. Instead, make sure that:

Even if there is a network partition or machine node abnormality, the entire cluster can still provide consistent services like a single machine, that is, every operation can see that all previous successful operations are completed in order.

There are two points to note here:

It is emphasized that when the network partition or node is abnormal, it is because if such abnormal conditions are not considered, consistency is very easy to guarantee, and a single node is sufficient. All a consensus protocol has to do is to guarantee consistency while tolerant of exceptions.
The consistency here is for users outside the cluster, and the entire cluster is regarded as a whole.

Each operation on the Raft cluster is called a proposal. It is hoped that the Raft cluster will shield the internal network or node anomalies from the outside, and respond to each proposal in turn. Successfully submitted proposals can be continuously visible in subsequent operations. The proposals here need to be idempotent, i.e. repeated executions will not result in a different cluster state.

Next we will see how Raft implements this consistency guarantee. Raft splits the consistency problem into three sub-problems, each of which is broken, making its implementation simple and easy to understand. This paper will first briefly introduce the content of the three sub-problems and how to achieve them; then prove that the three sub-problems are sufficient conditions for achieving consistency; finally, try to explain that the guarantees of these three sub-problems are indispensable.

Subproblems of Raft

1. Leader Election

At least three machines are required to form a Raft cluster, and Raft limits that only one node can initiate proposals at a time. This restriction greatly simplifies the implementation of consistency. This node that can initiate proposals is called Leader. So the first problem to be solved is:

How to ensure that there is at most one Leader node at any time
And how to select a new leader node as soon as possible when the leader node is abnormal .

As shown in FIG:

All nodes start with the role of Follower;
Leader periodically sends heartbeats to other nodes;
Followers that do not receive heartbeat packets within the timeout period become Candidates, increase their Term by one, broadcast a Vote request, and initiate a new round of elections;
Election ends:
After receiving the votes from the majority of nodes, it becomes a Leader and sends AppendEntry of its own Term to other nodes. In a Term, the same Server will only give one vote, on a first-come-first-served basis;
Receive an AppendEntry of the same or larger Term, recognize the other party as a Leader, and become a Follower;
Time out, restart a new election, and reduce this by randomizing the timeout period.

2. Log Replication

As can be seen from the above discussion of Raft state transition, any non-leader node may become a leader in the future. In order to ensure that the entire cluster can remain consistent with the outside world after subsequent leader node changes, the Log Replication mechanism needs to be used to Solve the following two problems:

The Follower executes each successful proposal in turn in the same order as the Leader node ;
Each successfully submitted proposal must have enough successful copies to ensure consistent subsequent access

The diagram above describes the execution of a Raft proposal:

The Leader receives the Client’s request, writes the local Log, and then sends the Log Entry to all Followers in parallel through the AppendEntry request;
The Follower verifies the received Entry, including verifying whether the previous Log Entry item is the same as the Leader. After the verification is successful, it writes to the local Log and returns the Leader success;
After the leader receives more than half of the follower replies successfully, it writes the current Log Commit (such as writing to the state machine), and then returns to the client successfully;
Subsequent AppendEntry and HeartBeat will carry the main Commit position, and Follower will submit all Log Entry before this position.

When the Follower accepts the AppendEntry, it will check whether the Log of its previous entry is the same as the Leader. Using mathematical induction, it is easy to prove that the Logs on the Leader and Follower are consistent. In addition, since only more than half of the nodes are required to return successfully, it is possible to improve the availability of the cluster on the premise of ensuring consistency.

W > N/2 & R > N/2 => W + R > N

It should be noted here that proposals submitted by the Leader Commit will return success to the user, so the Raft cluster needs to ensure that these proposals will always exist .

3. Safety

Most of the hard problems have been solved by the above two sub-problems, except for the following two details:

After the Leader Crash, the new node becomes the Leader. In order to prevent data loss, we hope that the new Leader contains all the entries that have been committed. In order to avoid the complexity caused by the reverse flow of data from the follower to the leader, Raft restricts the new leader to be the latest node of the current log, that is, the log entry with the most and the largest term .
Usually, the Commit method for Log is whether the Leader counts more than half of the nodes that successfully AppendEntry. In the scenario where nodes frequently crash, only the Log Entry of the old Leader Commit may be overwritten by the subsequent Leader with a different Log Entry, resulting in data loss. The root cause of this error is that the leader suddenly crashes after the Commit, and the node with this entry may not necessarily win the subsequent leader election. This situation is described in detail in the paper. Raft cleverly restricts the leader to only commit the proposal of his own term by adopting a statistical majority method , while the proposal of the old term is submitted by the mechanism of “all logs before the commit log are committed in order”, thus solving this problem. question. There is more detail on this issue in another blog post Why Raft never commits log entries from previous terms directly

sufficiency of subproblems

Through the solution of the above three sub-problems, we have obtained a perfect consensus algorithm. The paper gives a detailed and rigorous proof. It first assumes that the committed proposal will be lost in the future, and then deduces the contradiction and then successfully disproves it. Here No longer. The key point of this proof is: Leader Election requires the new leader to obtain more than half of the node votes, and each Commit in Log Replication has more than half of the nodes agree, so there is at least one common node in the two majorities, and this node has both Commits that agree to a proposal vote for a new leader.

Indispensability of subproblems

The sufficiency of the three sub-problems for consistency has been discussed above, and the next discussion is that under the Raft framework, the absence of any one of the sub-problems will lead to serious inconsistency consequences:

The Leader Election is missing. It is assumed that at a certain moment there are two nodes acting as Leaders to accept user requests at the same time and initiate proposals. The entire cluster cannot determine the context of the two proposals, resulting in a conflict. Although Dynamo uses Quorum’s write strategy, it still needs to handle conflicts through vector clock or even to users.
Log Replication is missing, assuming that the submission of the proposal cannot guarantee that R + W > N, that is, there may be no intersection between the nodes involved in the two submissions of read and write. Obvious will cause successfully submitted requests to be invisible on subsequent visits.
Safety is missing, assuming that the new Leader cannot guarantee to have the longest Log, it may not have the latest Commit data, which makes the previous successful commit invisible;

From the above discussion, it can be seen that a complete consensus protocol has made many efforts including Quorum. Raft divides three sub-problems to make the implementation of the entire consensus algorithm simple and easy to understand. We also implemented our own consistency library Floyd based on Raft to meet the requirements for consistency in functions such as Zeppelin meta-information cluster and Pika cross-machine room functions.

Resources

In Search of an Understandable Consensus Algorithm

https://github.com/Qihoo360/floyd

http://catkang.github.io/2017/06/30/raft-subproblem.html