Consensus, FLP and Paxos
October 2023
This post is a draft. Content may be incomplete or missing.
To understand the algorithms that underpin distributed systems, one must first accept this fundamental truth of all distributed systems:
Stuff is broken all the time, and we have no hope of fixing it all. ¯\_(ツ)_/¯
Why? Think about it this way:
How reliable is the device you’re using to read this? I mean, it probably works fine most of the time, but occasionally I’m sure you run into snags: freezes, crashes, overheating, dead batteries, random network disconnects, etc. In distributed systems, these kinds of problems are called faults. So how often does your device fault? Would you say it’s on the order hours, days, weeks?
Well, that’s just with one device. What if you had to manage two thousand of them? What if you had to keep all of them working all the time?
Let’s say you get hit by one of these little snags once every two weeks. Then we’re going a smidge over 1 million seconds between faults:
Not too bad. But now let’s say we have 2,000 devices to manage. We’re going to start seeing random device faults about 2,000 times more often, reducing the average time between faults to:
\[1,209,600 \div 2,000 = 604.8 \; seconds\]That’s one new fault every 10 minutes . . . 24 hours a day, 7 days a week, forever. See the problem? Even if we do ever manage to get on top of all the weird stuff going on in our network, we’ll never be done for more than 10 minutes at a time.
The random crashes, freezes and disconnects that didn’t seem like a big deal before are insurmountable at scale. Cloud providers have this problem times 100: they operate huge networks with many thousands of computers distributed across the world. Every minute, there’s new nonsense cropping up somewhere in the network; it happens so much because the network is so large. They can spend as much money as they want and hire as many people as they like, and still never get ahead of all the problems constantly starting up.
So that’s that: the hardware running your code is not 100% reliable, the network is not 100% reliable, operating systems are not 100% reliable, and we have no path for to 100% reliability for any of these things. Does that sound terrible? Because distributed systems people know all this, and they’re pretty zen about it.
[ this is fine dog meme ]
That’s because we’ve figured out how to write software that papers over these reliability problems. Today, large systems everywhere are underpinned by fault-tolerant software algorithms, which work even when the infrastructure they run on doesn’t.
But how can code work when the computer running it doesn’t? There’s no magic here; fault tolerant software works because systems fail in predictable, manageable ways. Any time your code asks the system to do something, one of three basic things will happen:
- The system does what you asked it to do
- It does what you asked, but it takes a reeeally long time to do it
- The thing you asked for just never happens at all
However, it’s generally safe to assume the system will not go rogue and start doing random things you didn’t ask it to. All the things that will happen are things you coded to happen … you just can’t be sure how soon anything will happen, if ever.
It’s also safe to assume faults aren’t very widespread: obviously if all our computers have crashed and there is nothing left to run our code, then there’s nothing our code can do about it; but if just a few machines are having problems, the other computers can compensate via two basic strategies:
- Keep backup copies of all data, in case a machine storing it crashes
- Keep retrying things until they happen, to deal with delays and dropped requests
In this post we’re going to take the first baby step on our journey writing fault tolerant code: we are going to reinvent the concept of variables for the world of fault-tolerant programs.
Fault Tolerant Variables
Say we have a network of computers. Call each computer a node. We might picture the network of nodes like this:
DIAGRAM
Neither the nodes nor the network are completely reliable. At a low rate, the network randomly delays messages, and sometimes drops them so they are never delivered at all. The nodes also crash sometimes.
Can we implement a fault tolerant variable in software for this network? By this I mean I want a software primitive with the following properties:
- There is one variable
- Any node can get it
- Any node can set it
. . . and all of these properties hold even if the network is having problems, or some nodes have crashed. The variables are supposed to be fault tolerant, after all.
DIAGRAM: nodes a network, thought bubble question mark in the middle for a variable
Oftentimes the hardest part of a problem is figuring out where to start. Let’s attack it this way: we’ll start with a simple approach we know doesn’t work, and then try to fix it into working. Maybe that’ll yield a good solution, or maybe along the way we’ll learn something important about the problem.
A Simple (But Wrong) Approach
We already know how to make regular, non-distributed variables, so let’s make a plain old variable on one of our nodes, and set up an RPC server on that node so the other nodes can get and set the variable remotely.
DIAGRAM
The node we picked to store the variable is now special; let’s call it the leader. For the leader, the variable is just a plain old local variable like any other. The other nodes, which we’ll call followers, access the variable by sending RPC messages to the leader.
The leader’s logic will roughly be . . .
leader {
variable := null;
get {
return variable;
}
set(value) {
variable := value;
}
on client get(request) {
request.respond(variable);
}
on client set(request) {
variable := request.value;
}
}
The followers:
client {
get {
return leader.rpc(get);
}
set(value) {
leader.rpc(set, value);
}
}
Now we have one variable any node can get or set. That was easy!
. . . too easy.
What happens to our algorithm if a node crashes?
DIAGRAM: another copy of the RPC diagram, last one in the previous section
If a random node crashes, most likely that node will be a follower, since there are so many followers and only one leader. If a follower goes offline, the variable should be safe and sound: it’s stored on the leader, which is still online, and all the other followers are still in contact with the leader:
DIAGRAM: same diagram with a random follower Xed out
But the leader is not immune to problems. If a random node crashes, it very well could be the leader. So what if it’s the leader that goes down?
DIAGRAM: same diagram, but with the leader Xed out
Now we have a problem. With the leader gone, so is the variable. All the follower nodes are still up and running, but they’re only programmed to send RPCs to the leader, and the leader isn’t going to respond now that it’s offline. Our variable has vanished along with the leader; and since it only took one fault (the leader crashing) to do it, we have to accept our variable is not fault tolerant.
That’s fine though, this was the plan all along: now that we have a basic sketch of a design, we just need to figure out how to fix it into something fault tolerant. Let’s give it a shot.
Broadcast Replication
The single-leader design didn’t work because there is no safe quarter for our variable: no matter what node we put the variable on, it’s possible we could lose that node, and the variable with it. The only way to definitely survive one node crash is to have at least two copies of the variable on different nodes. That way, if any one node crashes, no matter which node, we’ll still have one live backup copy we can switch over to. But what if two nodes crash? Or three?
Heck, to be maximally safe, let’s just put a copy of the variable on every node:
DIAGRAM
Every copy of the variable is called a replica. The process of creating and updating replicas is called replication.
In this new setup, getting a variable is simple: every node already has its own local replica of the variable, so to get the variable, just read to your local replica like any other variable. To set the variable, let’s use a broadcast protocol: the node that wants to update the variable sends a “set” RPC to every other node, and each other node in turn updates its local replica to reflect the update it just received:
DIAGRAM
One good thing I can say about this new design: it’s definitely fault tolerant. Each node has its own replica of the variable, so as long as we have at least one live node which has not faulted, we also have a live replica of the variable. All remaining nodes can still send set RPCs to one another, so the variable keeps working even if some nodes crash. It is definitely fault tolerant.
Unfortunately, that’s just about the only good thing I can say about this idea. This design is majorly flawed.
Even with fast computers and fast networks, it still takes some amount of time for a broadcast to reach every node. What if two nodes happen to do an update simultaneously?
DIAGRAM
Both sets of RPCs race, so it’s possible for different nodes to see the set messages in different orders:
DIAGRAM
If the updates are processed in different orders, then different nodes will end up with different values for their respective replicas:
DIAGRAM
This means the different nodes disagree what the current value of the variable is. That’s pretty messed up! I’m not sure what kind of software programs you can write on top of variables that can get all confused like this. We’re going to have to find a way to keep them in sync.
However, keeping replicas of a variable in a sync turns out to be a very hard problem. So tricky, in fact, that we probably don’t want to tackle it right away. Let’s choose a slightly easier problem: how can get the replicas in sync just once?
If we can solve that problem, and apply it to this broadcasting replication algorithm, the result will be a kind of “write-once” variable: one that starts out null, can be set once, and from then on is immutable. To implement that, we just start out with all replicas initialized to null, and then anytime someone wants to set the variable, we run this “get replicas in sync just once” algorithm to decide what the permanent value for the variable will be, and stick with that forever. A real implementation of fault tolerant variables will of course need to support any number of set() calls, but maybe once we’ve figured out write-once variables we can extend the idea into many-write variables.
So for now, let’s focus on the next step: if multiple nodes try to set the variable at the same time, how do we get the replicas to get in sync just once?
This question has a name. It’s called consensus.
Consensus
Consensus is the problem of getting a group of nodes to agree — just once! — on the value of some variable. In our case, the value we’re trying to agree upon is what value our fault tolerant write-once variable should be set to. This is easy if only one node calls set(); it’s harder if multiple nodes call set() simultaneously.
So, what do we need a consensus algorithm to do?
To start, the basic reason we want a consensus algorithm is to ensure all nodes agree on the value of the variable by the time the algorithm finishes. In “the literature,” this most basic property is sometimes called Agreement. When we say all nodes should agree “by the time the algorithm finishes,” we’re also assuming the algorithm should finish in the first place. That is called Termination.
If you like technicalities, there technically is a way to achieve Agreement and Termination without really solving the problem: you just hardcode an answer. For example, if you’re trying to write a consensus algorithm for integers, “always return 4” is technically a valid consensus algorithm according to our definition so far: all nodes will agree the value is 4, and the algorithm will terminate very quickly. But algorithms like this are useless for the fault tolerant variable problem; we’d end up with a fault tolerant constant instead!
We want the agreed-upon value to be the one somebody wanted. If only one node calls set(), the variable should be set to that value. If two nodes call set() at the same time with different values, the algorithm should pick one of those two values. If both set() calls specify the same value, the algorithm should always pick that value. This idea is called Integrity.
And, of course, we can’t forget the uber-goal of ending up with something Fault Tolerant.
Then it’s settled: we believe a consensus algorithm should provide:
Termination: The algorithm exits.
Agreement: When the algorithm exits, all replicas agree on a value.
Integrity: That value is one somebody proposed.
Fault Tolerance: A single fault cannot violate any of the properties above.
Let’s try to invent an algorithm that accomplishes all of this.
Do you know of any real-life algorithms for a group of people to come to an agreement?
For example, think of a group of friends that want to out to eat somewhere. To do that, first they need to pick where to go; that’s an agreement problem. What might happen next? Maybe someone throws out an idea, someone throws out another idea, some people agree, some disagree, eventually a group opinion starts to form. The tide starts to turn when someone finally says
Oh yeah, voting! Majority-rules voting is an algorithm that results in a group agreement. Maybe we could code up something like that? Let’s try it.
Majority-Rules Voting
Let’s code up an algorithm where nodes throw out proposals and vote on them, just like in the restaurant example above. However, unlike messy real life where people have preferences, we’ll make it so each node has no preference whatsoever: each node will vote for whichever option it heard about first, and never change its mind.
To keep things as simpile as possible (for now), let’s only allow two values to be proposed. Since I’m American, we’ll call those options red and blue; but these options can stand in for anything: 0 and 1, yes and no, apples and oranges, etc. Supporting only two options is too simplistic to implement a real consensus algorithm, but in computing you usually end up being able to have exactly zero of something, exactly one of something, or any number of something. If we can figure out how to have exactly two options, there’s probably a way to extend it from 2 to $N$ options.
With that, let’s start by having each node track its own vote, initially null:
DIAGRAM
To get things going, some client of our system needs to call set() — a.k.a. throw out a proposal. We’ll implement that by broadcasting a message to all other nodes, telling them all to vote for a specific proposed value, which can either be red or blue.
DIAGRAM one proposal
Then we’ll have each node vote for the first proposal it hears about. Since there is only one proposal in this example, every node hears about the same proposal first, so they all end up in agreement immediately:
DIAGRAM all nodes colored in the same color
Lucky us! We won’t always be so lucky though. There’s no central coordination involved in creating proposals, so we have to anticipate multiple proposals could be thrown out at the same time. If we have multiple racing proposals, they might agree, but they also might not:
DIAGRAM three proposals, two blue and one red
With multiple competing proposals, different nodes can receive different proposals first, and end up voting for different things, just like when we were talking about our broadcast replication algorithm before:
DIAGRAM
Now the votes don’t agree. But that’s okay, we don’t need the votes to agree; we just need to figure out which proposal got the most votes. For that, we can have every node tell every other node who they voted for:
DIAGRAM
Once every node knows what the final votes were, they can each independently count the votes to pick the winner. Since it’s the same votes no matter who’s counting, all nodes end up picking the same winner:
DIAGRAM
As long as some value reaches a majority of votes, that value will be the one and only value every node picks as the winner.
Here’s the full algorithm, for reference:
consensus {
vote: Red | Blue; // this node's vote
nodes: Node[]; // all nodes, including self
init {
vote := null // haven't voted yet
}
// a caller wants to propose a value
propose(value) {
nodes.all.send(proposal, value)
}
// received a proposal from another node
on received proposal(value) {
// accept the first proposal, ignore others
if (vote == null) {
vote := value
}
}
// a caller wants to read the final value.
// returns null if decision still in progress.
get() {
counts: map{ from proposal to int }
foreach node in nodes {
counts.add(node.get_current_value())
}
// returns null if there is no majority yet
return get_majority_proposal(counts)
}
on received get_current_value() {
return vote
}
}
Now we have a complete algorithm, that pretty much works:
- Agreement: ✅ — only one value can reach a majority, and all nodes will see that value is the one that reached a majority
- Integrity: ✅ — nodes only vote for a value someone proposed; so the value that got the most votes was proposed by somebody
- Termination: . . . oh wait, does this algorithm always terminate?
In our new algorithm, agreement is reached once a value reaches a majority; so we can only terminate if some value reaches a majority. But what if we’re unlucky, and we end up with a split vote?
DIAGRAM 3 v 3 split vote
Now no value has reached a majority, and since there are no more nodes left to vote, no value ever will reach a majority. Our algorithm doesn’t terminate!
Well, there’s a simple workaround for that: just require an odd number of nodes. If there are only two values you can propose, and there’s an odd number of nodes, some proposal has to have reached a majority once all votes are in!
DIAGRAM
Great, let’s assume an odd number of nodes and check again:
- Agreement: ✅
- Integrity: ✅
- Termination: ✅ — with only two values and an odd number of nodes, some value will have reached a majority once all votes are in
- Fault Tolerance: . . . we might have a problem here
In a fault tolerant algorithm, it should be okay for one node to crash. But we just said it was really important for us to have an odd number of nodes . . . if a node crashes, we’re left with an even number of nodes again, and split votes become possible again:
DIAGRAM
Uh oh, the wheels are really starting to fall off again, aren’t they? Having an odd number of nodes isn’t sufficient to protect from split votes; we need to find another solution.
Okay, okay, hear me out. I have another idea:
Tiebreaks and Takebacks
There’s no way to prevent split votes, so there’s no way to ensure some value always reaches a majority. Maybe it’s not as bad as it sounds. Currently, once all the votes are in, we just pick whichever value reached majority; maybe we can just tweak that rule again in the case where the votes tie.
Let’s add this tiebreak rule: if a color has a majority of the votes, we pick that color, but in the event of a tie, we pick red (chosen fairly by asking my 4-year-old his favorite color). Now if a node crashes and leaves us with a split vote, we still get an answer:
DIAGRAM
But . . . no, wait, hold on now, we can’t do this. We’re making a decision before all the votes are in! Right now we have a tie vote and we picked red, but how do we know that last node is actually offline? What if it’s completely healthy, and just happens to be the last node to vote? What if it comes back moments later and votes blue?
DIAGRAM
We changed our mind! At one point in time, we had a 3v3 split vote, and all nodes agreed on red, per the tiebreak rule. But then that last vote came in, and everybody changed their mind to blue. You can’t do that! What if somebody already saw that the outcome was red, and someone sees the new outcome is blue? That doesn’t sound like Agreement to me.
Agreement is total; it requires all nodes to always agree on the same value. Once you declare done, you have to stick with that value forever.
Things get worse from here. The problem we just uncovered isn’t specific to the tiebreaking rule we chose; it’s going to be a problem with any tiebreaking rule. Using a tiebreaker is totally fine if the final node is offline and is guaranteed never to come back. But how do we know the final node is offline and not coming back? No matter how long we wait for the last node to enter its vote, it is always still possible for it to do so some time in the future. That means, no matter how much time has passed, it’s never safe to run the tiebreaker rule. If it’s never safe to run the tiebreaker rule, we simply can’t have one.
So we can’t prevent split votes, and we can’t resolve split votes either. This is looking an awful lot like a dead end.
Hm.
Something Has Gone Very Wrong
It would seem we have jumped down a rabbit hole much deeper than we at first imagined.
Lets recap how we got here:
We started with the simple goal of inventing a fault tolerant variable — kind of the most basic thing you would need if you intend to write fault tolerant software.
Then, in about 30 seconds, we invented a single-leader algorithm that worked, except it made no attempt whatsoever at fault tolerant. To add fault tolerance, we then added backup copies of our variable. But more replicas meant more leaders, and that lead to a new problem: how do we keep those replicas in sync? That’s how we got started talking about consensus.
We came up with a pretty good base idea for consensus, one even rooted in metaphor for consensus in real life. But rather unexpectedly, it turned out to be a complete dead end.
We just keep layering on the complexity, and still we don’t have a solution in sight. Did we take a misstep somewhere? This seemed so simple at the start; how did we get so stuck?
If we’re looking for consolation, at least we’re in good company. At this stage, a generation of distributed systems researchers put an awful lot of thought into this problem. They were able to answer lots of related questions: things that don’t work work, properties any solution must have, different ways of simplifying the problem and then solving the simplified problem. But for years, nobody had an answer to the main question: how does one design a fault-tolerant consensus algorithm?
I would like to invite you to join in the tradition by mulling it over yourself. What is wrong with the approaches we’ve tried so far? Can we fix them? If not, why not?
If you want a little bit of direction, consider the dead end we just ran into with split votes. We were left with two unacceptable choices:
- Include a tiebreaker, which caused the algorithm to Terminate but does not provide Agreement
- No tiebreaker, which caused the algorithm not to Terminate but does uphold Agreement
Why do Agreement and Termination seem to be at odds with one another?
Think about it! This page will still be here when you get back.
Welcome back! How did it go? I’m guessing you’re still stuck, but don’t worry — everyone else gets stuck here too.
During the mid-1980s, in their paper Impossibility of Distributed Consensus with One Faulty Process, Fischer, Lynch and Paterson (the “FLP” in what later became known as the “FLP result”) explained exactly why nobody could come up with a fault-tolerant consensus algorithm: it’s impossible!
Wait, what??
That’s right, fault tolerant consensus is impossible. You cannot create an algorithm which satisfies Agreement, Integrity, Termination, and Fault Tolerance, all at the same time.
Strange … lots of stuff is built on the cloud, using fault tolerant programming techniques, which rely on fault tolerant consensus algorithms — which exist. But we also have this FLP proof, a well respected proof that fault tolerant consensus algorithms do not exist. 🤔
There’s no conundrum here: what FLP proves impossible is fault tolerant consensus as we have formulated the problem so far. Remember when we came up with our four properties, Agreement, Integrity, Termination and Fault Tolerance? We made a subtle error at that point, and ended up with a set of requirements that cannot all be satisfied simultaneously by one algorithm; that is what FLP proves. But there is also a loophole, a way of tweaking those properties to reformulate the consensus problem as something that can indeed be done. Paxos and Raft are both built on this loophole. So the next step in our mission to build a fault tolerant variable is to find that loophole ourselves. To do that, we had better sit down and work through FLP.