How Amazon web services uses formal methods

  • Link2Paper
  • [Author: Chris Newcombe et al.]
  • [Keywords: Formal Verification]

Motivation

Conventional design documents consist of prose, static diagrams, and perhaps pseudo-code in an ad-hoc untestable language. Formal verification has two major benefits to writing a precise design:

  1. the author is forced to think more clearly, which helps eliminate ‘plausible hand-waving’;
  2. tools can be applied to check for errors in the design, even while it is being written.

The authors wanted a language that is simple to learn and apply, avoiding esoteric concepts - an off-the-shelf method with high return on investment.

TLA+ [1]

General Info

It is used to describe both the desired correctness properties of the system (the ‘what’), and the design of the system (the ‘how’). In TLA+, correctness properties and system designs are just steps on a ladder of abstraction, with correctness properties occupying higher levels, systems designs and algorithms in the middle, and executable code and hardware at the lower levels.

# Abstraction
+----------------------------+
|   Correctness Properties   |
+----------------------------+
|    System Design & Algo    |
+----------------------------+
| Executable Code & Hardware |
+----------------------------+

Practical PoV

Coreectness properties of TLA+ can be easily and quickly checked by tools such as TLC model checker [2].

The syntax and idioms of TLA+ are somewhat unfamiliar to programmers. Fortunately, TLA+ is accompanied by a second language called PlusCal which is closer to a C-style programming language, but much more expressive as it uses TLA+ for expressions and values. However, PlusCal can be automatically translated to TLA+ with a single key press.

Refer to [1] for tutorials, and [3] for an example of a TLA+ specification from industry that is similar in size and complexity to some of the larger specifications at Amazon.

In general, we could adopt the practice of first writing a conventional prose design document, then incrementally refining parts of it into PlusCal or TLA+.

Pros & Cons

Pros

A better way of designing systems.

When using formal specification, the authors begin by precisely stating “what needs to go right?” First they state what the system should do, by defining correctness properties. These come in two varieties:

  1. Safety properties: “what the system is allowed to do”

Example: at all times, all committed data is present and correct.

Or equivalently: at no time can the system have lost or corrupted any committed data.

  1. Liveness properties: “what the system must eventually do”

Example: whenever the system receives a request, it must eventually respond to that request.

After defining correctness properties, we then precisely describe an abstract version of the design along with an abstract version of its operating environment.

Improved understanding, productivity and innovation

Formal specification pays several dividends over the lifetime of the system: adding features, re-design, optimization, documentation…

Cons

We are concerned with two major classes of problems with large distributed systems:

  1. bugs and operator errors that cause a departure from the logical intent of the system, and
  2. surprising ‘sustained emergent performance degradation’ of complex systems that inevitably contain feedback loops.

Problems in the second category can cripple a system even though no logic bug is involved.

The authors also don’t yet know of a feasible way to model a real system that would enable tools to predict such emergent behavior. They use other techniques to mitigate those risks.

Bias

Formal verification methods are only suitable for tiny problems and give very low return on investment.

Overcoming the bias against formal methods required evidence that they work on real- world systems. This evidence was provided in a paper by Pamela Zave [4]

Related Works

  • TLA+ specification in the appendix of the Paxos consensus algorithm [5]
  • Conventional informal proofs can miss very subtle problems [6]
  • Using the TLC model checker to find ‘edge cases’ in the design on which to test the code [7]
  • Farsite project [8], a complex system but different
  • Abrial [9] lists applications to commercial safety-critical control systems, but which seem less complex

Ref