Paxos Made Complete

Back in 2007 I collected some notes I had written on Paxos over the years into a survey paper and published it to Wikipedia under the Paxos Protocol article.  In the years since, there has been an explosion of Paxos-related papers, blogs, and projects.  I am honored to find quotes, graphs and commentary on my article as I read through the new papers and projects in fault-tolerant distributed computing.

Unfortunately, only the most basic discussion of the protocol and system-related design decisions seem to be covered. As an expert in the protocol and its implementation, I find the lack of community discussion regarding advanced, systemic uses of Paxos concerning.

I hope this blog series will help correct that. I will build a fault-tolerant distributed execution engine for the state machine approach, covering all aspects of the craft. Each article will add a small step, describing the tradeoffs and selecting one of the many paths available on the way to a complete solution.

Paxos Made Trivial

In order to set the stage for this series, we will need a Classic Paxos implementation with nothing but the bare essentials. I’ll be writing in Haskell, but practitioners of the art should recognize the patterns I’m using and find easy translation to other languages. I’m not using any Haskell magic — just basic data structures and simple functions such as less-than, max, and increment.

If you’re new to Paxos, I recommend my Wikipedia article, or any of the other papers, blogs, etc. which cover the core protocol (Google it!).

Let’s get started!

There are only a few data types used in Paxos:

  • We need a way to tell two nodes apart. Let’s call this a NodeID. It’s generally represented as an Integer. NodeIDs are unique and totally ordered.
  • We need a way to specify what we want to agree on. Let’s call this a Value. We’ll ignore the content of the Value for now (it’s irrelevant), but it is often a Byte Array.  Additionally, Values need to be compared to each other for equality, and there’s a couple of special Values we need; NoValue and NoOp.
  • Finally, we need a way to tell which Value is currently winning. This is called a Round in the literature, and may be most easily expressed as a pair of an Integer and a NodeID. Rounds are compared lexicographically; the Integer first, then the NodeID.
  • Every Value is bound to a Round, and they must never be separated or changed!  We’ll call this a Vote.

In Haskell:

data NodeID = NodeA | NodeB | NodeC deriving (Eq, Ord)
data Value  = NoValue | NoOp | ValueA | ValueB | ValueC
data Round  = Round Int NodeID deriving (Eq, Ord)
data Vote   = Vote Round Value

This says “A NodeID can be the value ‘NodeA’ or the value ‘NodeB’ or …”.  Likewise for the Value type. For the Round, it says “A Round is a data structure constructed using the reserved name ‘Round’ followed by an Integer and a Node ID”. The ‘deriving’ clause tells the compiler to write a couple of useful functions for us, in this case equivalence and ordering.

I’m using enumerations for the NodeIDs and Values to reduce the state space in this example code. This will allow us to test deeper into the protocol when we build the test framework. It should be clear that these may contain arbitrarily complex data as needed for the system you are building. Typical choices might be IP addresses for NodeIDs, and byte arrays for the Values.

That’s it! Four very simple data types. These might be classes or structs in other languages, but they would look about the same.

Roles : Leader, Acceptor, Learner

There are three distinct Roles in Paxos. While these may be combined in various ways, it will help the discussion to keep them separate for now.

Here they are:

data Node = Leader
            { preparedRound :: Round
            , promiseQuorum :: Set NodeID
            , highestVote   :: Vote }
          | Acceptor
            { promisedRound :: Round
            , vote          :: Vote }
          | Learner
            { learnQuorum   :: Set NodeID
            , learnedVote   :: Vote }

You should note the lack of complexity here — each Role holds just a few variables! More advanced versions of Paxos add just a fraction more complexity to this basic form.

They all keep one Vote, although its meaning differs slightly between the Roles. The Leader must also maintain its latest Prepared round and a quorum of Promises. The Acceptor just keeps its latest Promise. The Learner needs a quorum to track Votes.

I want to point out that this is essentially the minimal set of data required for any system which could solve distributed consensus. It could be combined or split along different lines, but every unit serves a necessary purpose in the protocol — there is no cruft.

Messages

We will be exchanging messages between the Roles:

data Message
 = Timeout
 | Request  { requestValue :: Value  }
 | Prepare  { prepareRound :: Round  }
 | Promise  { promiseRound :: Round
            , lastVote     :: Vote   }
 | Propose  { proposeVote  :: Vote   }
 | Accepted { acceptVote   :: Vote   }

I won’t go over each of these, as they are thoroughly covered in the literature. I’ve given (arguably) readable names to the fields so we can distinguish them in the code.

Protocol

Now we reach the meat of the protocol. In the following, we will define a function for each role. The input is the current state and an input message. The output is the new state and an output message.

-- Functions defining each Role 
leader   :: (Node, Message) -> (Node, Message)
acceptor :: (Node, Message) -> (Node, Message)
learner  :: (Node, Message) -> (Node, Message)

This pattern is quite common in Haskell so we use the State Monad to encapsulate the Role. The final function types are as follows:

-- Functions defining each Role 
leader   :: Message -> State Node Message
acceptor :: Message -> State Node Message
learner  :: Message -> State Node Message

NOTE: I’ve rearranged the code below so it follows the path of a message along a successful round of consensus.

We start with a Timeout at the Leader. The Leader sends a Prepare message to the Acceptors to ask for their Promise, and their current votes.

leader Timeout = do
 ld <- get            -- Load current Leader state
 return Prepare { prepareRound = preparedRound ld }

Acceptors receive the Prepare message and always send a reply, but only Promise to equal-or-higher round numbers. This reply serves as both the regular Promise message, and also a Negative-Ack to other Leaders of lower rounds.

acceptor (Prepare prepRound) = do
 acc <- get
 let newRound = max prepRound (promisedRound acc)
 modify (\a -> a { promisedRound = newRound })
 return Promise  { promiseRound = newRound
                 , lastVote = vote acc }

Upon receiving a Promise, the Leader will ignores Promises for older rounds. If the promise matches the prepare this leader sent, it adds the Acceptor to the Quorum. If the promise is for some higher round, the leader abandons the current round and records the higher round. Along the way, the vote from the highest round is selected as the value to propose next.

leader (Promise promise lastVote) = do
 ld <- get
 case (compare promise (preparedRound ld)) of
   LT -> return ()
   EQ -> modify (\a -> a
      { promiseQuorum = Set.insert from (promiseQuorum ld)
      , highestVote = chooseHighest lastVote (highestVote ld) })
   GT -> modify (\a -> a
      { promiseQuorum = Set.empty
      , preparedRound = promoteRound promise id })
 return NoMessage

At some point (possibly immediately after the previous message), the system tries to get a Value agreed. We simulate this by sending a Request message to the Leader. If the Leader has a quorum, then it selects a value to propose. If a value already exists, as discovered in the Promise messages, we must choose it. If not, the Leader can propose any value (ie: the one being offered in the Request).

leader (Request value) = do
 ld <- get
 -- Are we leading?
 if (Set.size (promiseQuorum ld)) < quorumNeeded
 then return NoMessage
 else do
    let proposeVote = chooseProposal (preparedRound ld) (highestVote ld) value
    modify (\a -> a { highestVote = proposeVote })
    return Propose  { proposeVote = proposeVote }

The Proposal reaches an Acceptor, who ignores proposals for older rounds. For current or newer rounds, the Acceptor votes for the proposed value and sends an Accepted message to the Learners.

acceptor (Propose proposedVote) = do
 acc <- get
 let Vote proposedRound proposedValue = proposedVote
 if proposedRound < (promisedRound acc)
 then return NoMessage
 else do
    let newVote = chooseHighest (vote acc) proposedVote
    modify (\a -> a { promisedRound = proposedRound
                    , vote = newVote })
    return Accepted { acceptVote = newVote
                    , acceptFrom = acceptorID acc }

The Learner receives the Vote from the Acceptor. If it matches the current vote, increment the learn Quorum. If it is newer, wipe the current vote and start a new Quorum.

learner (Accepted acceptedVote) = do
 ln <- get
 if acceptedVote == learnedVote ln
 then modify (\a -> a { learnQuorum = Set.insert from (learnQuorum ln) })
 else do
    let Vote acceptedRound _ = acceptedVote
    let Vote learnedRound  _ = learnedVote ln
    case (compare acceptedRound learnedRound) of
      LT -> return ()
      _ -> modify (\a -> a { learnQuorum = Set.singleton from
                           , learnedVote = acceptedVote })
    return NoMessage

There you have it; about 100 lines to implement the complete protocol!

Design Decisions

  • Many papers and blogs discuss the need for a new message in the protocol, the Nak message, to be returned when a leader’s prepare round is too low. This situation is never necessary. The protocol already has a message perfectly suited for this (Promise).
  • The paper “Paxos Made Live” describes a situation in which a replica behind a partition continuously increments the round number trying to gain leadership, only to cause immediate disruption of a stable group when the partition is repaired. Here we see this situation never needs to occur. The Leader-to-be simply does not increment its round number on timeout. It only increments the round number when receiving a Promise to a higher round from an Acceptor. This eliminates the need for the added complexity of “bumping” the round number at regular intervals to avoid the disruption, as well as the necessary tuning of this new parameter.
  • When a Leader receives a Request, but is forced to choose a different value, what happens to the Request? The answer to this question significantly affects how the client must behave, and what expectations it has of the distributed system. In the code above, the request is silently discarded. Most implementors would choose to return a “retry-later” message, or queue the request internally.
  • How do we know when a decision has been made? There are many different ways to count a Quorum, such as majority-rules and weighted votes. Most Paxos implementations require a majority of votes from Acceptors to achieve a Quorum.

Conclusion

I hope this post has set the stage: Paxos is a small, simple protocol for solving a difficult distributed computing problem. In roughly 100 lines of Haskell, we’ve defined all the core types, data structures for each Role, all the messages used, and implemented the basic protocol.

Next we will fill out some infrastructure for testing, then look at adding some features which bring the implementation closer to real-world needs.

 

Leave a comment