Consensus, FLP and Paxos

October 2023

This post is a draft. Content may be incomplete or missing.

Say we have a network of computers. Call each computer a node. We might picture the network of nodes like this:

DIAGRAM

Let’s build ourselves a little software abstraction to support programs running on these nodes. The one I have in mind is something you might call a “distributed variable.” Basically,

There should be one variable
Any node can get it
Any node can set it

DIAGRAM: nodes a network, thought bubble question mark in the middle for a variable

Such a thing seems rather useful, don’t you think? If we want to make our pile of nodes act like a single cohesive service for users, at some point we’ll probably need to share state across the nodes somehow. Distributed variables would be a straightforward way to do that.

So how do we make a distributed variable? Well, we already know how to make regular, non-distributed variables; so let’s make a plain old variable on one of our nodes, and set up an RPC server on that node so the other nodes can get and set the variable remotely.

DIAGRAM

The node we picked to store the variable is now special; let’s call it the leader. For the leader, the variable is just a plain old local variable like any other. The other nodes, we’ll call followers. The followers access the variable by sending RPC messages to the leader.

To put it a little more formally, the leader’s logic is roughly . . .

leader {
  variable := null;
  
  get {
    return variable;
  }
  
  set(value) {
    variable := value;
  }
  
  on client get(request) {
    request.respond(variable);
  }
  
  on client set(request) {
    variable := request.value;
  }
}

The followers are:

client {
  get {
    return leader.rpc(get);
  }
  
  set(value) {
    leader.rpc(set, value);
  }
}

Now we have one variable any node can get or set. That was easy! But, alas, we are not done.

Although we have a design that meets our requirements, we might not have thought hard enough about those requirements in the first place. As is sadly so often the case in life, things have been simple thus far only because we missed a major aspect of the problem.

Whose Fault is it Anyways?

“How can you make a reliable computer service?” the presenter will ask in an innocent voice before continuing, “It may be difficult if you can’t trust anything and the entire concept of happiness is a lie designed by unseen overlords of endless deceptive power.”

Look at the device you’re using to read this page. Have you ever had problems with this thing? Does it freeze up, crash, overheat, disconnect randomly from the network for no discernible reason? In distributed systems, these kinds of problems are called faults. So how often does your device fault? It hopefully doesn’t happen often enough to be a major day-to-day disruption, but I’m still betting it happens. How often would you say it does — on the order hours, days, weeks?

Well, that’s just with one device. What if you had to manage two thousand of them? What if you had to keep all of them working all the time?

Let’s say your device normally glitches out on you once every two weeks. Then we’re going a smidge over 1 million seconds between faults:

$$2 \; weeks \times 7 \; \dfrac{days}{week} \times 24 \; \dfrac{hours}{day} \times 60 \; \dfrac{minutes}{hour} \times 60 \; \dfrac{seconds}{minute} = 1,209,600 \; seconds$$

Not too bad. But now let’s say we have 2,000 devices to manage. We’re going to start seeing random device faults about 2,000 times more often, reducing the average time between faults to:

\[1,209,600 \div 2,000 = 604.8 \; seconds\]

That’s one new fault every 10 minutes . . . 24 hours a day, 7 days a week, until the day we decommision the system. This is going to be a problem! By this estimate, even if we do ever manage to get on top of all the weird stuff going wrong in our network, we’ll never be done for more than 10 minutes at a time.

So the random crashes, freezes and disconnects that didn’t seem like a big deal before are now insurmountable thanks to scale. This is what makes distributed systems kind of an odd environment to work in. The platform running our code provides fewer guarantees than a reasonable person would expect, and even the guarantees we get on paper don’t always hold up in practice. Distributed systems is a world where anything that can go wrong will go wrong, is going wrong, and has been going wrong for weeks unnoticed. It’s like playing a perverse game of Simon Says, where you think you’ve checked your assumptions and covered your bases, only to find out — Simon didn’t say! — there’s one more thing you didn’t realize can break.

As software people, it’s tempting to write code that assumes the underlying platforms and systems always work, and when things break, it’s tempting to just tell the ops people it’s their problem — just fix the hardware! But the ops people are managing a huge fleet, and they’re being inundated by problem after unexplainable problem. They’re never going to catch up, and neither would you in their shoes. The best way forward for everyone is for us to code around the problems instead of asserting they shouldn’t happen. In other words, we ought to make our code fault tolerant: it should tolerate (continue working despite) faults in the underlying system.

(Besides, it’s never a good idea to yell at the ops people. Make friends with your ops people. They tell the best war stories.)

Fault tolerance is the major aspect of the problem that we were missing before. It’s not enough to just want “distributed variables that any node can get or set,” we also need fault tolerance: the variable should keep working even if a node crashes, or a network connection goes down, and so on.

So what happens to our algorithm if a node crashes?

DIAGRAM: another copy of the RPC diagram, last one in the previous section

If a random node crashes, most likely that node will be a follower, since there are so many followers and only one leader. If a follower goes offline, the variable should be safe and sound: it’s stored on the leader, which is still online, and all the other followers are still in contact with the leader:

DIAGRAM: same diagram with a random follower Xed out

But the leader is not immune to problems. If a random node crashes, it very well could be the leader. So what if it’s the leader that goes down?

DIAGRAM: same diagram, but with the leader Xed out

Now we have a problem. With the leader gone, so is the variable. All the follower nodes are still up and running, but they’re only programmed to send RPCs to the leader, and the leader isn’t going to respond now that it’s offline. Our entire distributed variable is now offline! And since a single node crash was enough to bring down the variable, we have to admit that our variable was not fault tolerant. That’s no good; we need to fix this.

Broadcast Replication

There is no safe quarter for our variable: no matter what node we put the variable on, it’s possible we could lose that node, and the variable with it. The only way to definitely survive a node crash is to have copies of the variable on multiple nodes. To be maximally safe, let’s put a copy of the variable on every node:

DIAGRAM

Every copy of the variable is called a replica. The process of creating and updating replicas is called replication.

In this new setup, getting a variable is simple: every node already has its own local replica of the variable, so to get the variable, just read the local replica like any other variable. To set the variable, let’s use a broadcast protocol: the node that wants to update the variable sends a “set” RPC to every other node, and each other node in turn updates its local replica to reflect the update it just received:

DIAGRAM

We’ve definitely managed to create something fault-tolerant this time: each node has its own replica of the variable, so as long as we have at least one live node which has not faulted, we also have a live replica of the variable, so the variable is never lost. Even if a node crashes, the remaining nodes can still broadcast updates to one another and read their local replicas. So we can still get and set the variable any time we have at least one node online.

So, now we have a variable that we can get and set from any node, and it’s fault-tolerant. What’s not to love?

Here’s a problem: even with fast computers and fast networks, it still takes some amount of time for a broadcast to reach every node. What if two nodes do an update in parallel?

DIAGRAM

Now it’s possible for the two broadcasts to arrive in different orders on different nodes:

DIAGRAM

Once all updates have been processed, we’ll end up with different values for different replicas of the variable:

DIAGRAM

That’s kind of terrible! Now the different nodes in our network disagree as to the current variable of our variable — something that certainly could never happen with a “normal,” non-distributed variable. We have to fix this. How can we make sure, if two conflicting broadcasts happen at the same time, the nodes come to an agreement on what the final value of the variable should be? This question is called the consensus problem.

Consensus

Consensus is the problem of getting a group of nodes to agree on the value of some variable. In our case, the value we’re trying to agree on is the next value our distributed variable should be set to. To make our distributed variable, it would seem consensus is the next problem we need to tackle.

So, what do we need a consensus algorithm to do?

To start, the basic purpose of a consensus algorithm is to ensure all nodes agree on the value of the variable by the time the algorithm exits. This most basic property is sometimes called Agreement. When we say all nodes should agree “by the time the algorithm exits,” we’re also assuming the algorithm should exit in the first place. That is called Termination.

Putting on our rules-lawyer hats for a minute, there’s a way to achieve Agreement and Termination without really solving the problem: you just hardcode an answer. For example, if you’re trying to write a consensus algorithm for integers, “always return 4” is technically a valid consensus algorithm according to our definition so far: all nodes will agree the value is 4, and the algorithm will terminate very quickly. But algorithms like this are useless for the distributed variable problem; we’d end up with a distributed constant instead!

We want the value of the variable to always be the one proposed in a recent set() call. If two nodes call set() at the same time with different values, the algorithm should end up picking one of those two values. If both set() calls specify the same value, the algorithm should always pick that value. This idea is called Integrity.

And, of course, we can’t forget how important it is to make every algorithm Fault Tolerant.

In conclusion, a consensus algorithm should provide:

Termination: The algorithm exits.

Agreement: When the algorithm exits, all replicas agree on a value.

Integrity: That value is one somebody wanted.

Fault Tolerance: A single fault cannot violate any of the properties above.

Let’s try to invent an algorithm that accomplishes all of this.

Single-Leader Replication

The easiest way to resolve conflicts during replication, and thereby solve the consensus problem, is to pass all set() calls through a single node. That node decides the order the set() calls should take effect, and coordinates the work of replicating the results to all other nodes:

DIAGRAM

This works, but we once again have a leader node and follower nodes, so we know this scheme isn’t fault tolerant on its own. We will have to add some kind of failover mechanism, so that if the current leader faults, a new leader takes its place.

Whatever failover mechanism we invent cannot itself rely on a leader, because the whole point is that the leader might be offline. Instead, let’s come up with a scheme where each node independently makes a decision who the leader should be, and set things up so that all nodes independently end up making the same leader decision. In principle, if we can make it so every node has the same local information and runs the same deterministic algorithm on that information as input, they should end up choosing the same leader.

To get there, we have to solve two subproblems:

We need a way for each node to determine which other nodes are still online
We need a deterministic rule for deciding which of those nodes should be leader

Let’s start with the first subproblem.

Detecting Node Faults

Heartbeating is a simple approach for detecting node faults.

If you’ve ever ued the ping command to check if you’re online, heartbeating is pretty much the same concept. A heartbeat is a request/response pair. To start a heartbeat, one node sends a request, asking “Are you still online?” As soon as it can, the other node responds, “Yes, I am!” If a node has gone offline, it will not have the opportunity to send a response to the heartbeat request, so getting a response to your heartbeat request is pretty strong evidence the other node is still online.

By having each node periodically send heartbeats to all other nodes and track who did and did not respond, we can build local information on each node tracking which peers are online or offline. Assuming each node either is online and responding to all heartbeats, or offline and not responding to any heartbeats, all nodes should end up with an identical faulted/non-faulted map in local memory. So that’s one subproblem checked off.

Selecting a Leader

Once every node has built an identical map of online vs faulted nodes using the heartbeating system, we just need a deterministic rule every node can run on that map to pick a leader.

The simplest approach is to use some kind of bully algorithm. Bully algorithms follow the principle, “the biggest guy wins” — you sort the list of candidates by some metric, and then pick the element that’s first or last in the sorted list. If every node starts with same list, they’ll all come up with the same sort order, and so they’ll all pick the same element out of that list.

To apply that idea here, we start by assigning every node a numerical ID:

DIAGRAM

We’ll assume every node knows every other node’s ID. The list of node IDs need not be runtime-configurable; it can be hardcoded, or provided via a config file.

Once we know the ID of every node that’s still online, all we need is a rule for selecting a leader from that list. There are many rules that would work; let’s go with this one:

Pick the node with the smallest numerical node ID

The Full Algorithm

We’ve already sketched it out, but nonetheless let’s put together all the pieces of our failover algorithm. Succinctly, a our leader selection algorithm is:

Take the set of node IDs of all peer nodes

Eliminate any peer which isn’t responding to heartbeat requests

Pick the lowest remaining node ID. That node is the leader

Time to check whether this works. Let’s say we’re in the initial state where all nodes are booting up for the first time, and nobody has figured out who the leader is yet:

DIAGRAM

All nodes start exchanging heartbeats with one another. Say at this point no node has faulted, so every heartbeat request gets a timely response. Each node now executes the algorithm:

Take the set of node IDs of all peer nodes. Every node starts with the full set of node IDs: $(1, 2, 3, 4, 5)$.

Eliminate any peer which isn’t responding to heartbeat requests. All nodes are online and heartbeat requests, so no IDs are eliminated; every node finishes this step with the full original list: $(1, 2, 3, 4, 5)$

Pick the lowest remaining node ID. Every node picks $1$

Now node 1 is the leader. It will manage all get and set calls and take care of replicating the variable to all followers:

DIAGRAM

All looks well so far. Now, let’s try the case that broke our algorithm before. Say the leader crashes or gets disconnected. Now node 1 is offline:

DIAGRAM

Soon afterward, node 1 starts missing heartbeats. A new leader is needed. Each node now runs the same algorithm as before to select a leader:

Take the set of node IDs of all peer nodes. That’s still $(1, 2, 3, 4, 5)$.

Eliminate any peer which isn’t responding to heartbeat requests. Only node $1$ is missing hearbeats, so the remaining node IDs are: $(1, 2, 3, 4, 5)$

Pick the lowest remaining node ID. Every node picks $2$

And just like that, every node has switched over to node 2 as the leader:

DIAGRAM

This is looking pretty good! But does it really always work?

Split-Brain

Alas, outright crashes are not the only way nodes in a distributed system can fail. In the grand scheme of things, crashes are some of the cleaner, clear-cut problems we need to worry about: a node is either alive or it is not. Other kinds of faults are more insidious.

Wind back to the point where every node was healthy and node $1$ was still the leader:

DIAGRAM copied from before

We’ve been looking at this network a little too abstractly. In a real network, nodes aren’t wired directly together; they’re connected through a network of intermediary devices called switches:

DIAGRAM

Nodes aren’t the only things that can get disconnected; switches can too! What happens if the switches that bridge two portions of our network get disconnected?

DIAGRAM

Now our system has been split into two network partitions. Nodes within the same partition (i.e. connected to the same switch) can communicate with one another, but nodes cannot communicate across partitions (because the two switches are disconnected).

DIAGRAM

What is our leader selection algorithm going to do now?

On the left-hand partition, nodes 1-3 can still heartbeat with each other, but not nodes 4-5. So when they select a leader, they pick from the node ID set $(1, 2, 3)$, and decide that node $1$ is still the leader.

On the right-hand partition, however, nodes 4-5 can heartbeat with each other but not nodes 1-3. So when these nodes select a leader, they pick from the set $(4, 5)$ and choose $4$.

That’s right — our system has two leaders!

DIAGRAM

This situation, where the system is only supposed to have one leader but accidentally now has two, is called split-brain.

With two distinct subnetworks following two different leaders, we have the potential for each subnetwork to replicate a different value:

DIAGRAM

We have violated the Agreement property, so what we have is not a consensus algorithm.

It might still be possible to salvage this approach, but it seems we have a really big problem on our hands: safe failover itself is a consensus problem. Every node needs to agree on who is the leader. To implement safe failover, we need a consensus algorithm. Single-leader replication was supposed to itself be a consensus algorithm, but we can’t use it to implement failover, since we don’t have a leader during failover. Whatever algorithm we use to implement safe failover, would have to itself be a leaderless consensus algorithm; if we had that, we’d also have a working consensus algorithm, so we probably wouldn’t need single-leader replication at all.

So let’s abandon single-leader replication. It doesn’t work, but it taught us something important: a workable solution to the consensus algorithm must be leaderless. There cannot be any one special mode coordinating the algorithm.

Do you know of any real-life algorithms for a group of people to come to an agreement, even without someone being in charge?

For example, think of a group of friends that want to out to eat somewhere. In order to go somewhere, they need to pick where to go first; that’s an agreement problem. What might happen next? Maybe someone throws out an idea, someone throws out another idea, some people agree, some disagree, eventually a group opinion starts to form. The tide starts to turn when someone finally says

"I vote we (blah blah blah . . .)"

Oh yeah, voting! Majority-rules voting is a leaderless algorithm that results in a group agreement. Maybe we could code up something like that?

Majority-Rules Voting

Let’s code up an algorithm where nodes throw out proposals and vote on them, just like in the restaurant example above. However, unlike in real life where people have preferences, we’ll make it so each node has no preference whatsoever. Each node will vote for whichever option it heard about first, and never change its mind.

To keep things as simpile as possible (for now), let’s only allow two values to be proposed. Since I’m American, we’ll call those options red and blue; but these options can stand in for anything: 0 and 1, yes and no, apples and oranges, etc. Supporting only two options is too simplistic to implement a real consensus algorithm, but in computing you usually end up being able to have exactly zero of something, exactly one of something, or any number $N$ of something. If we can figure out how to have exactly two options, there’s probably a way to extend it from 2 to $N$ options.

With that, let’s start by having each node track its own vote, initially null:

DIAGRAM

To get things going, some client of our system needs to throw out a proposal. We’ll implement proposals by broadcasting a message to all other nodes, telling them all to vote for a specific proposed value, which can either be red or blue.

DIAGRAM one proposal

Then we’ll have each node vote for the first proposal it hears about. Since there is only one proposal in this example, every node hears about the same proposal first, so they all end up in agreement immediately:

DIAGRAM all nodes colored in the same color

Lucky us! We won’t always be so lucky though. There’s no central coordination involved in creating proposals, so we have to anticipate multiple proposals could be thrown out at the same time. If we have multiple racing proposals, they might agree, but they also might not:

DIAGRAM three proposals, two blue and one red

With multiple competing proposals, different nodes can receive different proposals first, and end up voting for different things:

DIAGRAM

Now the votes don’t agree. But that’s okay, we don’t need the votes to agree. Remember earlier, when we wanted every node to obtain the same information, and then run the same deterministic algorithm on said information so they all come to the same conclusion? We can have every node tell every other node who they voted for:

DIAGRAM

Once every node knows what the final votes were, they can each run the same deterministic rule to pick the winner. For us, that rule will be: “pick the value that received a majority of the votes.”

DIAGRAM

As long as some value reaches a majority of votes, that value will be the one and only value every node picks as the winner.

Here’s the full algorithm, for reference:

consensus {
  vote: Red | Blue; // this node's vote
  nodes: Node[]; // all nodes, including self
  
  init {
    vote := null // haven't voted yet
  }
  
  // a caller wants to propose a value
  propose(value) {
    nodes.all.send(proposal, value)
  }
  
  // received a proposal from another node
  on received propose(value) {
    // accept the first proposal, ignore others
    if (vote == null) {
      vote := value
    }
  }
  
  // a caller wants to read the final value.
  // returns null if decision still in progress.
  get() {
    counts: map{ from proposal to int }
    foreach node in nodes {
      counts.add(node.get_current_value())
    }
    
    // returns null if there is no majority yet
    return get_majority_proposal(counts) 
  }
  
  on received get_current_value() {
    return vote
  }
}

Now we have a complete algorithm, that pretty much works:

Agreement: ✅ — only one value can reach a majority, and any value that reaches a majority will always be the majority
Integrity: ✅ — nodes only vote for a value someone proposed; so the value that got the most votes was proposed by somebody
Termination: oh wait, does this algorithm always terminate?

In our new algorithm, agreement is reached once a value reaches a majority; so we can only terminate if some value reaches a majority. But what if we’re unlucky, and we end up with a split vote?

DIAGRAM 3 v 3 split vote

Now no value has reached a majority, and since there are no more nodes left to vote, no value ever will reach a majority. Our algorithm doesn’t terminate!

Well, there’s a simple solution for that: just require an odd number of nodes. If there are only two values you can propose, and there’s an odd number of nodes, some proposal has to have reached a majority once all votes are in!

DIAGRAM

Great, let’s assume an odd number of nodes and check again:

Agreement: ✅ — only one value can reach a majority, and any value that reaches a majority will always be the majority
Integrity: ✅ — nodes only vote for a value someone proposed; so the value that got the most votes was proposed by somebody
Termination: ✅ — with only two values and an odd number of nodes, some value will have reached a majority once all votes are in
Fault Tolerance: hmm . . . we might have a problem here

In a fault tolerant algorithm, it should be okay for one node to crash. But we just said it was really important for us to have an odd number of nodes . . . if a node crashes, we’re left with an even number of nodes again, and split votes become possible again:

DIAGRAM

Uh oh, the wheels are really starting to fall off again, aren’t they? Having an odd number of nodes isn’t sufficient to protect from split votes; we need to find another solution.

Tiebreaks and Takebacks

Okay, there’s no way to prevent split votes, so there’s no way to ensure some value always reaches a majority. Maybe it’s not as bad as it sounds. Remember, our consensus algorithm rests on a deterministic rule that takes the final vote counts as input. If we can’t prevent the vote from splitting, maybe we can tweak the rule to handle split votes gracefully.

Let’s add this tiebreak rule: if a color has a majority of the votes, we pick that color, but in the event of a tie, we pick red (chosen fairly by asking my 4-year-old his favorite color). Now if there’s a split vote, we still get an answer:

DIAGRAM

But . . . no, wait, hold on now, we can’t do this. We’re making a decision before all the votes are in! Right now we have a tie vote and we picked red, but how do we know that last node is actually offline? What if it’s completely healthy, and just happens to be the last node to vote? What if it comes back later and votes blue?

DIAGRAM

Well, this is kind of a disaster! We have violated Agreement once again. At one point in time, we had a 3v3 split vote, and all nodes agreed on red, per the tiebreak rule. But then that last vote came in, and everybody changed their mind to blue. You can’t change your mind! Agreement is total; it requires all nodes to always agree on the same value.

Things get worse. The problem we just uncovered isn’t a problem with the specific tiebreaking rule we chose; it’s going to be a problem with any tiebreaking rule. Using a tiebreaker is totally fine if the final node is offline and is guaranteed never to come back. But how do we know the final node is offline and not coming back? No matter how long we wait for the last node to enter its vote, it is always still possible for it to do so some time in the future. That means, no matter how much time has passed, it’s never safe to run the tiebreaker rule. If it’s never safe to run the tiebreaker rule, we simply can’t have one.

So we can’t prevent split votes, and we can’t run tiebreaking rules for split votes either.

Another dead end.

Something Has Gone Very Wrong

Let’s step back for a minute.

We started out with such a simple goal: all we wanted was to make a distributed variable, and it only took us about 30 seconds to come up with first stab at a design. Our first try was simple and pretty robust; the one measly thing it was missing was fault tolerance. But as soon as we started trying to make our variable fault-tolerant, all of a sudden everything was like “heartbeat this,” “split brain that,” broken failover algorithms, split votes, broken tiebreaking rules . . . we ended up in a labrynth of dead ends in a sea of ever-growing complexity, and yet we still don’t have a solve.

This seemed so very straightforward at the beginning. How did we get so stuck?

A generation of distributed systems researchers put an awful lot of thought into this problem. They were able to answer lots of related questions: things that don’t work work, properties any solution must have, different ways of simplifying the problem and then solving the simplified problem. But for years, nobody had an answer to the main question. How does one design a fault-tolerant consensus algorithm?

At this point I would like to invite you to join in the tradition by mulling it over yourself. What is wrong with the approaches we’ve tried so far? Can we fix them? If not, why not?

If you want some food for thought, here’s some:

We’ve come up with quite a few attempts at a consensus algorithm. None of them worked, but looking at what didn’t work, a pattern is starting to emerge. Some of our algorithms are perfectly workable consensus algorithms, but aren’t fault tolerant; examples include . . .

Our “single leader replication” algorithm, before we added failover
Our “majority rules voting” algorithm, before we added a tiebreak

Other attempts were fault tolerant, but could violate the Agreement property in some cases:

Single-leader replication, once we added our failover algorithm
Majority-rules voting, after we added our tiebreaking rule
While it wasn’t meant to be a consensus algorithm in the first place, our “broadcast replication” algorithm fits into this bucket

Isn’t it weird that we keep hitting the same two dead ends? Why does it seem Agreement and Fault Tolerant don’t want to coexist?

Think about it! This page will still be here when you get back.

* * *

Welcome back! How did it go? I’m guessing you’re still stuck, but don’t worry — we’ll figure out what the problem is soon enough.

If you repeatedly find yourself unable to solve a problem, and especially if all your solutions keep hitting the same set of dead ends, the next thing to do is to try proving the problem is impossible to solve in the first place. This is exactly what three researchers managed to do in the mid-1980s. In their paper Impossibility of Distributed Consensus with One Faulty Process, Fischer, Lynch and Paterson (the “FLP” in what later became known as the “FLP result”) explained exactly why nobody could come up with a fault-tolerant consensus algorithm.

Much of the time, big breakthroughs in hard problems come from finding an interesting new way to look at and analyze the problem; FLP is an example of this. Let’s see how they look at consensus algorithms:

Decisions, Decisions

Consensus is easy when only one value is proposed. The hard part is resolving conflicts: if one node proposes red and the other blue, only one of those can end up being the agreed-upon value. How does an algorithm decide between them?

Forget the mechanics of how the algorithm makes the decision; after all, every algorithm can do so in its own unique and special way. But still, there’s a general, overall shape to how these decisions get made.

Say we’re at the start of a consensus algorithm — any consensus algorithm — and say one node is proposing red and another is proposing blue:

DIAGRAM

I claim it is possible for the algorithm to pick either red or blue at this point. This is because the algorithm needs to be fault tolerant, and so it’s possible for either the red proposer or the blue proposer to crash without halting the algorithm. Say the red proposer crashes at this point; then it dies being the only node that ever knew red had been proposed. The remaining nodes only know that blue was proposed, so by Integrity the algorithm must decide blue:

DIAGRAM - red proposer X’ed out, all other nodes blue

But if the blue proposer were instead the one to crash without getting the word out, the rest of the system will only see that red was proposed, and therefore the algorithm must decide red:

DIAGRAM - blue proposer X’ed out, all other nodes red

So at the beginning of the algorithm, before either node crashed, both proposed values were still on the table. So every algorithm starts with every proposed value still in the table.

Let’s start drawing a timeline of the algorithm’s execution:

DIAGRAM - timeline, with “both still possible” pencilled in at the start

What else can we add to the timeline? Well, we know by the end of the algorithm we must reach agreement, so at the end we must have picked either red or blue, and eliminated the other:

DIAGRAM - same timeline, with end pencilled in “one value chosen”

Somewhere in the middle, there must have been a decision, taking us from “all values are still on the table” to “one and only one value has been chosen.” And since the agreement property requires us to never change our minds after deciding upon a value, the decision must be a single step, which starts with multiple values still possible and ends up with a single value having been decided. We can draw this step a single point in time on our timeline:

DIAGRAM complete timeline with start, decision step, end as points which create two line segments labeled “both values possible” and “one value chosen”

Note that every step of a distributed algorithm runs on a single nod, and as a step of the algorithm, the decision step too must happen on one, and only one node. This is interesting, don’t you think? The algorithm must include a critical step, which must happen on exactly one node, yet that node could potentially crash before, during or after that critical step.

Anyways, we may not have the whole story here, but we learned something interesting:

In every consensus algorithm, there is a decision step, running on a single node, that decides the final value on behalf of the entire system

What else can we say about every consensus algorithm?

Nondeterministically Yours

By our own analysis, both red and blue are still options at the beginning of any consensus algorithm’s execution. That means two runs of the same algorithm, starting in the same initial state, do not always have the same outcome. So consensus algorithms are nondeterministic. All of the algorithms we have drafted so far are indeed nondeterministic.

What’s weird is, all of our draft algorithms so far consist entirely of deterministic steps. We don’t generate random numbers, start timers or spin up racing threads. So where is the nondeterminism coming from?

It’s coming from the network. Variances in exact timings and other factors mean that the set of messages being received by a node can arrive in any order. That results in nondeterministic behavior in the overall algorithm.

And since the decision step picks red or blue nondeterministically, the delivery order in network messages must somehow influence what value the decision step chooses — in every consensus algorithm!

Now we know enough to analyze consensus algorithms the FLP way: we simply ask ourselves

What is the decision step?
How is it influenced by the order network messages are processed?
What happens if a node running the decision step fails?

We can try this out on a few of the example algorithms we already came up with before. Then maybe we can spot the reason we kept running into dead ends.

Our Examples, Revisited

Let’s play spot the pattern.

Dave Kilian's Blog

About

Blog

CV

GitHub

Consensus, FLP and Paxos

This post is a draft. Content may be incomplete or missing.

Whose Fault is it Anyways?

Broadcast Replication

Consensus

Single-Leader Replication

Detecting Node Faults

Selecting a Leader

The Full Algorithm

Split-Brain

Majority-Rules Voting

Tiebreaks and Takebacks

Something Has Gone Very Wrong

Decisions, Decisions

Nondeterministically Yours

Our Examples, Revisited

One-Way, Two-Way, Red-Way, Blue-Way