"The Byzantine Generals Problem"

Post by **Martin Hash** » Tue Jan 19, 2016 12:41 pm

THE BYZANTINE GENERALS PROBLEM
by Martin Hash, Computer Science Department, University of Idaho

ABSTRACT

Reliable computer systems must handle malfunctioning components, unreliable transmission mediums, and unsynchronized clocks that give conflicting information to different parts of the system. This area of study is known as “Byzantine” agreement, and has well over one hundred publications devoted to its solution. This paper provides a brief description of the Byzantine agreement problem plus a short survey of the related fields of study and proposed solutions.

INTRODUCTION

The most formidable problem in distributed computing is how to get all of the processors to agree on a result. This problem includes: maintaining replicated data, monitoring computation, detecting a failed processor, and synchronizing clocks. The problem is best stated as an abstraction using an adversary argument, where the adversary (one of the processors in the system) has the ability to destroy or modify messages. This assumes that the value in question is not globally visible to every processor such that they only need to examine it without agreement. Instead the value is said to be an “oral message” (OM), meaning it could have been privately changed. Interestingly enough, the problem was first posed in an adversarial argument by Lamport [PSL80] as loyal and traitorous generals in the Byzantine empire, hence it has become known as the Byzantine General (BG) problem. The adversary argument does not identify likely executions, but instead identifies faulty executions that may be entirely unlikely but possible. Without repeating the long Byzantine analogy, the BG problem can be succinctly stated as an agreement algorithm in which a faulty processor can send arbitrary messages to the network.

It is important to identify exactly what kind of real-world distributed processing problems BG agreement is intended to solve:

The sending operating system might be low on memory and could drop a packet.
The network interface can drop the transmission.
The router or gateways may drop packets when overloaded.
A message may get inadvertently duplicated.
The receiver may be low on buffering space.
A message can get fragmented at a timeout.
There may be actual data corruption.
There may be an overlong delay due to swapping applications.

LAMPORT’S OM SOLUTION

The difficulty for BG agreement is simple to describe. If three processors are hooked together such that they can directly communicate with each other (the communications are private), and a single processor was to ask the other two for a value, and the answers were different, which one is right? It takes at least one more processor to break the tie, and since any number of processors could be regrouped into this scenario, the number of processors required for BG agreement is always a multiple of 3 times the number of faults plus 1.

There are some BG agreement assumptions:
1) Nonfaulty processors always respond to a message within a given time limit, meaning the distributed system is synchronous, so every processor knows whether it should be broadcasting, listening, or doing nothing.
2) The receiving processor can accurately identify the sender, meaning every processor knows which other processors are on the message delivery list.
3) If the sender is loyal, the loyal receivers will agree.
4) Though there is a specified commanding processor in the BG problem, any processor can serve in this capacity.

It is interesting to note that system reliability for the Lamport OM BG algorithm actually decreases when the number of processors is not exactly one greater than a multiple of three faults: i.e. 5 or 6 processors is less reliable than 4 processors. Later, we will discuss other, more practical algorithms for which this anomaly is not true. Also, unfortunately, the BG algorithm requires a large number of messages to be passed, depending on the number of faults, on the order of Mk, where M is the number of processors and k is the number of faults. The great number of messages, combined with the very low probability of a traitorous type failure, makes the BG algorithm not worth implementing, in an engineering sense. Also, through tortured machinations and logic, it can be proven that the BG problem is impossible to solve when the system is asynchronous [FLM86].

It is fair to assume that most of the time all processors are correct (i.e. not faulty). An easy way to reduce the number of messages sent in a BG agreement algorithm would be to not run the algorithm at all if all of the processors are correct. Instead, a full fault detection algorithm is run first, one that is much more efficient than BG agreement, then if there are faults detected, run a BG algorithm.

SIGNED MESSAGES

Lamport proposed a possible real solution to the BG problem in another paper in which he discussed OM [LSP82]. The BG problem is made difficult by the traitor’s ability to lie. Restricting that ability by allowing the processors to communicate via unforgeable signed messages (SM), simplifies the solution tremendously. In fact, with unforgeable messages, the BG problem is solvable with any number of faulty and non-faulty processors.

The assumptions for SM agreement include the OM assumptions plus one more:
5) A non-faulty processor’s signature cannot be forged, and any alteration of the contents can be detected.
6) Any processor can verify the authenticity of a signature.

Signature protocol is common among the security industry. When a processor sends a signed message, each intermediary processor appends their signature. Essentially, every signed message received by a processor is signed again and repeatedly sent to all other connected processors. Interestingly, there are no assumptions made concerning a faulty processor’s signature. In fact, a faulty processor’s signature may be forged by another faulty processor – in essence, collusion is allowed among faulty processors.

BYZANTINE CLOCKS

Byzantine agreement reaches to include clock synchronization [SCH87]. Byzantine processors may exhibit arbitrary two-faced behavior, informing each of their neighbors of a different clock value at the same pulse, (though such behavior is impossible when each processor writes its clock value in a communication register that can be read by every other processor). We assume that every two processors have an independent means of communication, either by using communications registers between every processor pair, or by message passing. Randomization is used to ensure that the set of clock values of the non-faulty processors will eventually include only a single clock value. One technique is for every suspect processor with a “0” clock value to randomly decide whether to increment its clock value or not. Curiously, this will eventually cause all processors to stabilize.

FAULT PARTITIONING

Applications in the real world are restricted in some way: number of processors, processor speed, cost, reliability, inter-processor communication, etc. Since the BG algorithm is expensive to implement due to its factorial amount of message passing, most work has dealt with restricted scenarios. Arbitrary failure modes are not considered: faults are limited to omissions or timeouts. It is assumed that the probability of other failure modes is negligible relative to the system reliability specification. Can a designer build a system that can take advantage of constrained fault behavior? Thambidurai and Park [THPA80] partition faults into two subsets: non-malicious, and malicious. A non-malicious fault, by definition, can be detected by every non-faulty receiver. Examples of non-malicious faults are: timeouts, omissions, and crashes. A malicious fault is the case when not every non-faulty processor can detect that a fault has occurred. It is these faults that require the use of BG agreement algorithms with rounds of message exchange. Malicious faults further subdivide into “symmetric” and “asymmetric” faults. Asymmetric faults are the case where a message is not received identically by all of the non-faulty processors. If a processor sends a message to other processors via a different communication channel, the received message could differ either due to the processor or due to a fault in the medium, therefore, the distinguishing feature of an asymmetric fault is that not all processors receive the same message. A symmetric fault is the case where all processors receive exactly the same wrong message. Symmetric faults may occur due to a faulty processor or communication channel.

Fault partitioning is useful because the probabilities of the different types of faults can be dramatically different. Non-malicious faults are far more likely to occur than malicious ones, and their agreement algorithms can be simple and efficient. When a processor is faulty, the probability that it will generate valid messages is very low, and the probability that it will send valid but different messages is infinitesimally low.

asymmetric << symmetric << benign

Neither the OM or SM BG algorithms take advantage of the case where some faults may be arbitrary while others are restricted. Essentially, restricted faults do not require the number of processors to be greater than three times the number of faults, while still allowing for the possibility of arbitrary faults. It is possible to guarantee BG agreement with less processors if the faults can be partitioned into the three groups: asymmetric, symmetric, and benign.

Given that any processor has a failure rate, additional processors can be utilized to cover non-asymmetric faults. In fact, the fault partitioning allows a designer to increase the number of processors for reliability without increasing the number of rounds. Essentially, fault partitioning does not treat all faults as arbitrary, but instead takes advantage of the different types of faults, as shown in the chart below:

rounds = 1
asymmetric = 0 asymmetric = 1
symmetric 0 1 2 3 0 1 2 3
benign = 0 4 6 8 4 6 8 10
benign = 1 3 5 7 9 5 7 9 11
benign = 2 4 6 8 10 6 8 10 12
benign = 3 5 7 9 11 7 9 11 13
benign = 4 6 8 10 12 8 10 12 14
benign = 5 7 9 11 13 9 11 13 15
benign = 6 8 10 12 14 10 12 14 16

From the same paper, in the actual figures below it is interesting to note that an OM BG model with 10 processors is less reliable than the fault partitioning model using processors.

Model Processors Failure Faults
BG 4 6.0 X 10-8 1 arbitrary
BG 5 1.0 X 10-7 1 arbitrary
BG 6 1.5 X 10-7 1 arbitrary
Unified 4 6.0 X 10-8 1 asymmetric, 0 benign, 0 symmetric
Unified 5 1.0 X 10-11 1 asymmetric, 1 benign, 0 symmetric
Unified 6 2.0 X 10-11 1 asymmetric, 0 benign, 1 symmetric
Unified 7 1.1 X 10-15 1 asymmetric, 2 benign, 0 symmetric

SURVEY OF DETERMINISTIC AGREEMENTS

Exact verses Approximate [DOL82]:
In an exact agreement, all non-faulty processors must agree. Approximate agreement allows non-faulty processors a range of agreed upon values.

Authenticated verses Non-authenticated [LSP82]:
Authenticated algorithms include signatures. Forged messages cannot be detected with non-authenticated agreements.

Immediate verses Eventual [KRFE99]:
Immediate algorithms require all non-faulty processors to stop on the same round. Relaxing this requirement so that processors can stop on arbitrary rounds is Eventual agreement.

Fault Partitioning [THPA80]:
Fault classification improves the reliability of a system.

Network Connectivity [DAW78]:
A network that is not fully connected complicates the agreement. As connectivity weakens, more processors and more rounds are required.

Message Complexity [BMD93]:
The number of messages sent, combined with the size of the messages, uniquely identifies the agreement algorithm used. The number of messages can be either exponential or polynomial depending on the algorithm:

Lamport OM N > 3m +1 r = m + 1
Lamport SM N > m + 2 r > m + 1
Davis & Wakerly N > 2t + 1 s = t + 1 (no asymmetrics)
Meyer & Pradhan N > 3m + b r > m
Thambidurai & Park N > 2a + 2s + b + r r > a
Dolev N > t2 + 3t + 4 r = min(m+2, t+1)

Where:
N is the number of processors
r is the number of rounds
m is the number of malicious messages
s is symmetric faults
a is asymmetric faults
b is benign faults
t is the number of rebroadcasts

REFERENCES

[BMD93] M. Barborak, M. Malek, and A. Dahbura. The Consensus Problem in Fault Tolerant Computing. ACM Computing Surveys, 25(2):171-220, 1993.
[DOL82] D. Dolev. Byzantine Generals Strike Again. Journal of Algorithms, 3:14-30, 1982.
[FLM86] M. Fischer, N. Lynch, and M. Merritt, Easy Impossibility Proofs for Distributed Consensus Problems. Distributed Computing, 1:26-39, 1986.
[LSP82] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. ACM Transactions on Programming Languages, 4(3):382-401, 1982.
[PSL80] M. Pease, R. Shostak, and L. Lamport. Reaching Agreement in the Presence of Faults. Journal of the ACM, 27(2):228-234, 1980.l
[KRFE99] A. Krings, T. Feyer, The Byzantine Agreement Problem: Optimal Early Stopping. Proceedings of the 32nd Hawaii International Conference on System Sciences,1999.
[THPA80] P. Thambidurai, Y. Park, Interactive Consistency With Multiple Failure Modes. 7th Reliable Distributed Systems Symposium, Oct. 1988.
[SCH87] F. Schneider, Understanding Protocols for Byzantine Clock Synchronization. Dept. of Computer Science, Cornell University, Aug 1987.
[DAW78] D. Davies, J. Wakerly, Synchronization and Matching in Redundant Systems. IEEE Transactions and Computing, vol C-27, 1978

Byzantine.doc