What FLP Tells Us, and Why It Matters

June 2023

This post is a draft. Content may be incomplete or missing.

If you ask some rando off the street, “Hey, what are some foundational problems in the field of distributed systems?” they’d probably say something like, “What? Who are you? Get away from me!” Others might suggest the problem of distributed consensus — namely, getting a bunch of nodes in a network to agree upon some value. That’s the problem you solve with fancy algorithms like Raft and Paxos.

If you’re going to do deep distributed systems work, sooner or later you’re probably going to need to get a feel for consensus — what it is, why it’s hard, and basically how one goes about solving it. Sadly, this is no mean feat. This problem space is notoriously difficult to understand. Where does one even begin?

Start with the FLP result.

Fischer, Lynch, Paterson

Named after the three researchers (Fischer, Lynch and Paterson) who first published it in the 1980s, FLP was a key stepping stone on the path to modern consensus algorithms. Today we already have those algorithms, but FLP remains a useful stepping stone for learning how consensus algorithms work.

Back when FLP was published, there were plenty of consensus algorithms, but none of them worked reliably. Every known algorithm had a ‘window of vulnerability’ — a critical point in the algorithm where, if the wrong machine crashed at exactly the wrong time, the whole algorithm would get wedged. The only way out at that point would be to send someone in to manually repair the algorithm’s internal state — a process that certainly could never scale. From the start, it was clear we needed fault-tolerant algorithms: ones which were entirely free of a window of vulnerability.

Researchers kept trying and trying to find a fault-tolerant consensus algorithm — kept trying to close the window of vulnerability — and kept coming up empty. Fischer, Lynch and Paterson found a common pattern, a reason which explained why these windows of vulnerability kept reappearing no matter what we tried. This insight ended up being key to the invention of Paxos, the first practical fault-tolerant consensus algorithm.

Just as FLP was the insight we needed to create something like Paxos, so it remains a useful insight for understanding today. If you want to know why modern consensus algorithms like Raft and Paxos work the way they do, start with FLP.

The formal mathematical proof you’ll find in the original FLP paper is not very approachable, to say the least; but luckily for us, there’s a straightforward line of reasoning wrapped up in all that notation! The crux of their idea falls right out of the design requirements for a consensus algorithm, so that’s a good place for us to start.

What guarantees would a consensus algorithm need to provide? Let’s think of an example where consensus might be useful:

Example: Splitting the Check

Let’s say you’re making an app that lets users split the check at restaurants. The restaurant sends your app a set of users and a bill; you send your users the bill, and let them decide how to split the payment. Then you withdraw money from each user’s account to pay the restaurant.

How do you write the core payment logic for this app? Seems simple enough — here’s a first stab:

public class Transaction {
  // ...
  public void pay(Bill bill, Share[] shares) {
    Money total = Money.none();
    for (Share share : shares) {
      Money amount = share.amount();
      if (!share.account().has(amount)) {
        throw new Exception("Insufficient funds!");
      }
      total.add(amount);
    }
    
    if (total != bill.total()) {
      throw new Exception("Check your math!");
    }
    
    for (Share share : shares) {
      Money amount = share.amount();
      share.account().debit(amount);
      bill.account().credit(amount);
    }
  }
}

The basic idea seems sound, doesn’t it? Make sure every user has enough money in their account to pay their part of the bill, make sure you’ve paid the entire bill, and then move funds to the restaurant’s bank account. What could go wrong?

Things go wrong

Actually, a lot can go very wrong! For example, what if a user’s account gets debited (by someone else) after the first for loop, but before the second? Then we’ll think they have sufficient funds to cover their share — at the time we checked, they really did! — but by the time we’re actually crediting and debiting accounts, the money’s gone. Uh oh!

There are a variety of different ways to solve this problem. One option is to use a very simple, general-purpose strategy called two-phase commit (often abbreviated 2PC).

2PC is useful when you’re updating multiple things in tandem, and you need an all-or-nothing guarantee: either all the updates go through, or none of them do. In this example, we’re updating multiple bank account balances in tandem, and we need the all-or-nothing guarantee that either the bill gets paid (everyone pays their share) or the transaction gets rejected, with no payments going through.

The idea behind 2PC is, unsurprisingly, to fit the work into two phases:

During the prepare phase, you contact each thing you’re going to update and tell it about the update you’d like to perform. The resource validates the operation, and if it’s valid, locks itself so that the operation cannot become invalid in the future. In our example, this means we contact each user’s bank and tell the bank how much money we’d like to debit; the bank either sets aside that much money so nobody else can withdraw it, or it returns an insufficient-funds error.

Once this is done, we start a second commit phase. This phase can go one of two ways: if something went wrong during the prepare phase, we reach out to each resource and say “never mind;” the resource then releases its lock without doing anything. Or, if everything looks good, then we contact each resource and say “do it;” the update goes through, and then the lock is released. Either way, this releases all locks that were taken during the prepare phase.

Things go wrong-er

Properties of a (Useful) Consensus Algorithm

TODO

  • Coherence: eventually, everybody agrees on the same value
  • Conflict-resolution: if multiple conflicting values are proposed, one is chosen arbitrairly
  • No-decoherence: the moment one of the conflicting values is chosen, it is never possible to go back
  • Fault-tolerance: the algorithm must have no window of vulnerability

“Obvious” requirements like termination. Termination seems to imply we need deterministic.

FLP: You can’t have it all!

Historical Impact

Why It’s Useful Today