This post is part of a series. Click here to start from the beginning.

Some of the background story

• algorithm is presented as a story about an ancient parliament
• some initial reviewers (ACM?) thought it was a joke

Story

• pretend archaeology
• “fragmentary history:” we will omit parts of the algorithm for generality
• island of Paxos
• part-time parliament
• legislators like processes, leaving the chamber like failing
• but while they’re in the chambers they behave correctly and promptly: non-byzantine failures
• “mutual trust” – legislators pass any proposed decree
• “ledgers with indelible ink” provide durable storage (HDDs)
• also state on “slips of paper” providing volatile storage (RAM)
• messengers carry messages, forgetful: zero- or multiple-delivery, arbitrary delay
• sequence of decrees as the consensus variable
• legislators and messengers are “scrupulously honest” so they exhibit non-Byzantine failures
• no idea how they handled cheating, so we don’t handle byzantine failure modes
• multiple times we say ‘manuscript is lost’ meaning you provide this when implementing paxos

Synod algorithm built in steps

• a preliminary protocol from consensus constraints
• a basic protocol guaranteeing consistency but not progress
• a synod protocol

Mathemtical results (2.1)

• build up some tools we will need to discuss algorithms later
• notation
• ballots: a ballot number, decree, quorum of priests, and a set of priests who voted for it
• a ballot is successful if there exists some quorum of priests who voted for the ballot
• ballot numbers give a total order, but are not the chronological order voting carried out
• each ballot number is unique
• any two quorums have at least one priest in common
• the negotiators’ value-selection rule
• a vote has a decree, a ballot number, and the priest who voted
• partial order of votes given by the ballot number of each vote
• total order over votes needed, but how is not prescribed (‘manuscript lost’)
• ‘max’ vote according to total order is the value propagated forward later
• define three properties as follows:
• P1: each ballot has a unique ballot number
• P2: the quorums of any two ballots overlap by at least one priest
• P3: the decree of a ballot is the decree of the ‘max’ vote in its quorum
• Lemma 1: if properties, then if a ballot reaches quorum, later ballot will have the same decree
• Theorem 1: if two ballots reach quorum, both have the same decree, trivial based on lemma
• Theorem 2: if you have a set of ballots which satisfy P1-P3, can always add new ballots which do too

The preliminary protocol (2.2)

• in summary, this is the protocol we discussed in post 1 (steps 1-4) + ‘learning’ (steps 5-6)
• designed solely to produce ballots satisfying P1-P3
• the set of ballots is distributed; it’s likely no one priest will see all ballots produced
• a priest initiated a ballot with a ballot number, decree and quorum
• ballot initiation and voting algorithms constructued to satisfy P1-P3
• P1: ballot numbers are (priest ID + sequence number) to ensure never a collision
• P2: quorums must be a majority set
• prefer priests with better attendance (uptime)
• P3: determine ‘max’ vote by simply asking each priest in the quorum via messages
• this leads us to the first two steps of the preliminary protocol
• S1: priest p chooses ballot number and sends NextBallot(b) to some priests
• S2: priests q receiving NextBallot(b) return LastVote(b, v) with most recent decree + ballot no.
• LastVote contains that priest’s MaxVote value
• to maintain LastVote is MaxVote, promises: don’t vote on earlier ballots
• sending LastVote implicitly is that promise; if we can’t promise, do not send
• promise written to ledger (durable storage)
• next steps of the preliminary protocol
• S3: send BeginBallot with new number, decree of max vote, to quorum of priests who sent LastVote
• S4: priests q vote for ballot unless promised not to do so, responding with Voted
• all steps in the protocol are optional: prevents progress but does not violate safety
• all steps are either idempotent, or have a mechanism to prevent replays from causing problems
• S5: if the vote is successful, p sends a Success(decree) message to every priest
• S6: upon receiving a Success message, q enters decree in durable storage
• S5-6 are the ‘learning’ phase I chose not to cover in post 1

The basic protocol (2.3)

• the preliminary protocol, but with some restrictions to reduce storage requirements
• we only really need to store three variables
• lastTried: number of the last ballot p tried to initiate
• prevVote: the vote cast by q in the highest-nmbered ballot q voted in
• nextBal: the ballot number which q promised via a LastVote() message
• in preliminary protocol, p could conduct any number of ballots concurrently
• in basic protocol, p must conduct one ballot at a time
• p ignores messages for past ballots after starting a new ballot
• ballot may abruptly end if p goes offline (‘loses slip of paper’)
• on PTP paper page 12, we go into modified 6 steps based on new variables

The Synod protocol (2.4)

• now focus on ensuring progress
• description of the livelock problem of the basic protocol
• to address, first define timeouts during which some large percentile of operations complete
• choose a president to initiate ballots
• okay to temporarily have multiple presidents: impedes progress, not correctness
• req: if no priest enters or leaves after T minutes, exactly one priest will be a president
• how to satisfy this requirement?
• define some total order over the priests (maybe sort by name)
• a priest is a president after receiving no messages from any higher-named priest for T minutes
• a priest stops being a president immediately after receiving such a message
• ‘missing parts of manuscript’ for optimal algorithm for sorting priests

Multi-decree (3)

• Now the focus is on passing more than one decree as the algorithm progresses
• Start by electing one president
• Then all proposals proceed by running Synod in parallel
• One NextBallot message with one ballot number for each active Synod instance
• One LastVote message with all the max votes for all Synod instances
• One value-selection rule / BeginBallot for each Synod instance
• Etc
• When a new decree is requested, the president starts a new Synod instance
• Once the president learns the result, it notifies the client success
• No need to continue rerunning Synod after the value is in the president’s ledger
• If a president takes over to find a gap in the ledger, they pass a ‘null’ decree and move on
• This way an old node (Bob) can’t complete a very old timed-out request?
• This also gives a primitive way to NACK client requests?
• It also enforces ordering: if A proposed then B, A has a lower decree no. than B
• If the presidency is stable, you can reduce to BeginBalot-Voted-Success to save on latency

Miscellaneous concerns and optimizations (3.3)

• 3.3.1: How to elect a president?
• ID-based ordering can cause disruptions (if node AAAA comes back online)
• Additionally can cause delays while AAAA gets 6 months of ledger data
• No provably optimal solution, but there are many options
• Option: once a president is singularly elected, no new president until that one times out
• Option: allow disruption, but sort using a uptime-based metric
• Option: allow disruption, but sort based on available bandwidth
• 3.3.2: How to bound the size of the ledger to keep things performant?
• Our ‘decrees’ log is really an append-only file system
• Periodically collapse that down into the end state of the system
• State is (last collapsed state) + (reply latest log messages)
• Can extend this to always updating collapsed state once there are no gaps in the log
• 3.3.3: Use Paxos to assign roles instead of data
• Put a particular node in charge of some piece of data (e.g. ‘cheese fit for sale’)
• The hard part of this is handing over control atomically
• Solution: use leases, assign a role for a fixed amount of time everyone agrees upon
• Renew the lease early to prevent unnecessary turnover
• Can also force quit by passing a decree which ends the lease immediately
• Deal with desynced clocks by adding some delay before actually starting delegation
• 3.3.4: Notifying clients the outcome of state replication
• Can’t examine just any ledger, because any individual ledger could be missing newer decrees
• Naive solution: pass a dummy decree and wait until it passes, then read up until that decree
• Problem with this is it doesn’t scale
• Better solution: clients keep track of the decree number they’re up to date with
• Say clients A and B are working together, A is up to date and B is out of date
• A notifies B it is out of date, and B contacts a legislator and catches up to the point of A
• Then if A and B agree they can work together
• Facebook likes: say the user has seen data up to ledger pos 250, and they select refresh
• When you refresh, the new data has to be from ledger pos 250 or newer, but never before
• Trick is you can track the ledger position client-side
• Alternately, you can set up designated learners for different kinds of data
• Like designated learners for particular users’ facebook feeds
• Whenever a client needs data for a user’s facebook feed, it contacts the designated learner
• Designated learners assigned through leases, like in 3.3.3
• 3.3.5: Retracting errors, e.g. due to bugs in the implementation (or disk corruption0
• A simple solution is to re-pass the same decree redundantly every once in a while
• 3.3.6: Adding and removing nodes from the network
• This isn’t fully solved by the paper
• There are some basic ideas but nothing concrete
• Particularly, if you ever pass a decree which unenlists everybody .. gg!

Relating all this to real systems (4)

• Can have a ‘fast’ read if stale cache is fine, ‘slow’ read if you need the real data now
• Lamport suggests his earlier papers for more expensive real-time algorithms (1978, 1984)
• Paxos does not have fixed time bounds on decrees passing, those other papers do

Standardize terminology

• negotiators are really called proposers
• storage nodes are really called acceptors
• majorities are called quorums