It is possible to achieve five-nines reliability with everyday commercial-quality hardware and software. The key is the way in which these components are combined.
Our society has come to expect uninterrupted service from many of the systems that it employs, such as phones, automated teller machines, and credit card verification networks. Many of these are implemented as embedded systems.
Embedded designers are being asked more often to create systems that run reliably to the degree that they're in service 99.999% of the time (termed "five-nines" availability), which is equivalent to less than one second of downtime per day. These are called high availability systems.
The design of high availability systems is based on a combination of redundant hardware components and software to manage fault detection and c orrection without human intervention. In this article, we'll quickly review some definitions tied to high availability and fault management, and then go on to discuss some hardware and software design patterns for fault tolerant systems.
Fault vs. failure
When we design a high availability system, we need to focus a major proportion of our design effort on failures and faults. For want of a better definition, we can define a failure as a situation in which the service provided by a system does not meet its specification. I'd prefer a definition that compares the service provided by a system with our expectations, rather than a spec. But often our expectations are subjective and not well documented. So we'll stick with a comparison of a system's services to its specification, even though that leaves us open to problems coming from specification errors or omissions.
A fault, on the other hand, is a failure of an interac ting system. It's what we think of as the possible cause of something undesirable happening. So a fault might be a failure of a subsystem of our system, a component failure, a failure of an external system, or a programming mistake. Faults can trigger more faults of various kinds. Or they can trigger failures. Or they can happen without triggering any failure at all.
For example, a fault might be the failure of a trucker to respect the load limit of the Golden Gate Bridge. If the truck exceeds the load limit by 20% or 30%, we don't expect the bridge to fail. But another fault might be that same truck crashing into the median divider of the bridge roadway. This wouldn't cause the bridge to physically fail either, but it might cause the bridge to fail in its specified purpose of bringing people and vehicles across the Golden Gate waterway for a number of hours.
Faults may be transient, permanent, or intermittent. When they are activated, they may cause errors in the state of a system or subsystem. Thes e errors may trigger failures in your system. There are four main approaches to dealing with faults:
- Fault forecasting
- Fault avoidance
- Fault removal
- Fault tolerance
Fault forecasting uses mathematical models as well as practical experiments to provide estimates about the presence of faults and their consequences. For example, one practical fault forecasting technique is to inject faults into a system and study any resulting failures.
Fault avoidance and removal are implemented by using rigorous system, hardware, and software development processes, perhaps including formal specification and verification techniques.
Fault tolerance is implemented by using redundant, perhaps diverse, implementations of a system to avoid the effects of faults. One way to implement fault tolerance is called "graceful degradation" (or "fail soft")-providing sensible partial functionality if full system performance cannot be provided. Another way is called "fail safe" (or "fail stop")-stopping the system in a safe state rather than continuing operation when a fault appears.
The main concept for fault tolerance is redundancy. It's based on the idea (or the hope) that multiple independent faults will not strike your system together. A fault tolerant system should be designed to avoid single points of failure. In other words, if a part of the system can suffer a fault, there should be a redundant part of the system that can cover for the fault and thus avoid a failure.
Redundancy comes in a variety of forms:
- Hardware redundancy (low-level, high-level, or both)
- Software redundancy
- Time redundancy
- Information redundancy
Examples of hardware redundancy include self-checking logic circuits, and multiple flight computers in a single airplane. Software redundancy might use two completely different algorithms to calculate the same result. Time redundancy may be done by communication re-transmissions. And information redundancy can be done using backups, checksums, and error correction codes.
Redundancy may be dynamic or static. Both use replicated system elements. In static redundancy, all replicas are active at the same time. If one replica "throws a fault," the other replicas can be used immediately to allow the system to continue correct operation. In dynamic redundancy, one replica is active and others are passive. If the active replica throws a fault, a previously passive replica is activated and takes over critical operations.
So what does all of this have to do with achieving high availability? First, a definition. High availability is the ability of a system to tolerate faults and continue to provide service according to its spec. Systems can use all of the concepts and approaches described here to achieve high availability.
Availability is often measured in units of "availability percentage" or "down time per year." Typical fault tolerant systems may be expected t o be available 99.99% of the time, or down less than an hour per year (10 seconds on a typical day). But high availability systems are expected to be available 99.999% of the time, or down less than five minutes per year (roughly, one second per day). This generally means that when a fault appears, it must be handled automatically-humans are too slow to remove or mask any problem within the time required.
Rather than building hardware as individual super-reliable modules constructed of super-reliable components, it is usually more cost-effective to use redundant replicated modules of everyday commercial quality hardware built with everyday commercial quality components.
Each of the replicas is often designed to exhibit "fail fast" or "fail stop" behavior. This vastly simplifies fault management decision making: Every failure will make the hardware stop dead in its tracks; rather than attempting to l imp along and challenge a fault manager to figure out which outputs of the module are now faulty and which are still good.
For fault tolerance using static redundancy, each replica module can be of everyday commercial reliability. Using two replicas is called pairing or duplexing. If N replicas are used, this is called N-plexing.
Figure 1: Triple redundant hardware
Figure 1 shows a 3-plexed or triple redundant hardware design. The three replicas are shown near the bottom of the diagram. They provide their outputs to a "voter," which determines the actual final output of the subsystem. When N >= 3 in an N-plexed design, a "voter" will usually use majority decision making. However, this needs to be a majority of the non-failed replicas, not just a simple majority of the total number of replicas (failed and non-failed).
But isn't a voter just a piece of hardware and software that could fail like any of the other modules in our system? In fact, it could; and it would bring disaster to our system if it did. But voters are normally very simple units that can be designed and tested to ensure their robustness. Alternatively, you can create designs involving replicated voters and second-level voters of voters, etc. But we won't discuss that here.
For fault tolerance using dynamic redundancy, the replicated modules can still be of everyday commercial reliability. One way to do this uses a redundant pair consisting of one active module and one standby module. Another way is to use a cluster of modules. The modules do not have to be precise duplicates of one another, and can have different characteristics, interfaces, and capacities. A cluster requires a failover strategy to decide how to manage multiple modules when the primary module throws a fault. Here are some choices:
- Standby backup. While the primary module runs in the system, a backup module is waiting in "st andby," watching the primary module for faults and ready to fire up and take over. For example, high-availability web servers can be designed using this approach.
- Rotating standby. While the primary module runs in the system, there may be a number of backup modules. One backup will take over running the system in case of a fault in the primary. The flight computers on the space shuttle have been designed with this philosophy: the primary module consists of a pair of computers that must always agree with one another. The first backup module is a similar pair. But the second backup module on the space shuttle is a single computer that can take over only via human command.
- Failover to non-critical module. The primary module runs the critical resources of the system. A backup module can run other non-critical things, but it can take over the most critical services of the primary in case of fault. This is what we do, as humans, when a PC's high-speed Internet connection fails as we're trying t o send an urgent e-mail, and we quickly switch over to that old modem we never thought we'd need again.
- Mutual takeover. Each module runs its own critical resources, but can take over the critical resources of another module in case of fault. For example, in a cardiac intensive care ward, there should be a heart monitoring computer for every eight patients. But each one can handle an additional eight patients (perhaps with some graceful degradation) if a neighboring heart monitoring computer goes down.
The correct implementation of failover is crucial. It would be a disaster if a primary module that was throwing faults were to continue running, while at the same time another module would try to take over its services. Their services could collide in unexpected ways. And it might be equally disastrous if a primary module were stopped after throwing a fault, and no other module came up to take over its services. So the validation and testing of failover is critical, although it's not something that many of us enjoy doing.
Most hardware faults are random and result from physical defects, either sustained during manufacturing or developed over time as components wear out or suffer shocks from the surrounding physical world. Software faults, on the other hand, are not physical; software does not wear out. Instead, software faults result from the invocation of software paths that contain defects in the software design or implementation that were always there. Since software is typically more complex than hardware, it can be expected to have many more built-in defects, resulting in more software faults than hardware faults. Software fault tolerance is also more expensive to design than hardware fault tolerance.
N-version programming is a veteran design pattern for software fault tolerance. When I first ran across it in the '70s, it was called dissimilar software. It's the software eq uivalent of hardware N-plexing (see Figure 1). But it's not as simple as the replication of hardware N-plexing, where the N copies of the same software would contain the same faults and produce the same errors N times. In N-version programming, if N units of some software functionality need to run in parallel, they need to be N disparate implementations of that functionality, independently implemented by N separate development teams. That's N-version programming.
In June 1996, the first flight of the Ariane-5 satellite-launcher rocket exploded in a ball of fire as it reached an altitude of 4,000 meters. The failure was caused by faults in the rocket's inertial reference system (a part of its digital flight controls) despite hardware redundancy, because software redundancy had not been handled properly. There were two inertial reference computers on the Ariane-5, one active and the other a "hot" standby. Both ran in parallel, with identical hardware and software. The software was nearly the same as that o n the Ariane-4, an older and successful launch vehicle. But some of the flight parameter values on the Ariane-5 were larger than on the Ariane-4, so data values overflowed. The error was handled by shutting down the computer. Since the redundant computer was running the same software, it was hit by data overflow as well, and it was quickly shut down too, and the whole inertial reference system went dead. As a result, the nozzles of the engines swiveled to extreme positions, causing the rocket to turn abruptly and break apart before self-destructing. The way the data overflow errors were handled would have been appropriate for randomly occurring hardware errors, but it was not appropriate for similar software errors occurring on two computers. And similar software errors on the two computers could have been avoided with N-version programming.
Back in the '70s we thought of N-version programming as the state-of-the-art in software fault tolerance. Sinc e then, a number of problems with this design pattern have emerged: software development costs skyrocket when you use it, since you need to pay the N independent groups to implement the N independent software designs. But if you try to skimp on some of the costs, you'll run into what's called the "Average IQ" problem: less expensive development teams have less-qualified software engineers who will produce lower-quality code. So you may end up with N diverse programs that are all riddled with faults, created in N different ways.
Another downfall of N-version programming is the issue of what to provide as input to the N independent development teams. In general, a single specification is photocopied and provided to all N development teams. But if this specification is flawed, you'll get N independently developed versions of similarly flawed software that'll all do the wrong thing. And if specification or usage errors are identified after the system is released, each new error must be fixed N times-once in each of the N diverse implementations, thus making maintenance costs exorbitant. Nowadays, the pricetag of N-version programming is usually thought to be better spent by asking one top-notch software development team to develop one high-quality software version using the best available infrastructure, software development tools, techniques, and testing.
In contrast to the static redundancy of N-version programming, a number of software fault tolerance design patterns are based on dynamic redundancy. All of them take a four-stage approach:
- Error detection
- Damage assessment and confinement (sometimes called "firewalling")
- Error recovery
- Fault treatment and continued service
In the second of these stages, a fail-stop approach is often taken when software errors are detected. But software is highly complex, so it is often unclear how t o undo the effects of faulty software behavior surrounding an error. One helpful tool in this regard is the concept of a transaction. A transaction is a collection of operations on the state of an application, such that the beginning and end of a transaction are points at which the application is in a consistent state. For example, every city's town hall has a file system containing information about all the residents of the town. When two people get married, their names and date of marriage are recorded in a file called "Married Couples." In addition, the status of the groom gets changed from single to married in the file called "Male Residents." And the status of the bride gets changed from single to married in the file called "Female Residents." If one of the three file updates fails or if the software crashes along the way, we need to go back to the beginning of the marriage "transaction." Otherwise we might end up with an inconsistent state like the groom being listed as married while his wife is still listed as single. Things are only consistent at the beginning of the marriage transaction, and at the successful end of the marriage transaction.
If we want to use the concept of transactions for fault tolerance, our system has to be able to save its state at the beginning of a transaction. This is called checkpointing. It involves taking a "snapshot" of the software's state just as it is about to begin the first step of the new transaction. The snapshot is only taken if the previous transaction was completed in an error-free state. The basic recovery strategy here is re-execution: when an error is detected during a transaction, the transaction is fail-stopped and the system is reloaded back to the last saved checkpoint. Then service is continued from this checkpoint, allowing new transactions to build on its consistent state. However, the transaction that was fail-stopped is lost. This kind of error recovery is referred to as backward error recovery, since the software state is rolled back to a past err or-free point.
Figure 2: Checkpoint-rollback scenario
Simple checkpointing has its own dangerous single point of failure. There might be a fault during the taking of the snapshot of the checkpoint itself. But there's a solution to this, sometimes called checkpoint-rollback. It's illustrated in Figure 2. In this figure, the ellipses represent a software client and a software server that communicate with one another by sending messages through queues. A single transaction may include many Request messages going from client to server, and many Response messages going from server to client. During a transaction, data are modified within the server. At the end of a transaction, a consistent set of data should be recorded on each of the two persistent mass-storage devices shown on the right. Together with the data, a transaction sequence number should be recorded. If an error is detected sometime later and the server is stopped, this server may be restarted (or a replica server may be started). As part of the startup recovery procedure, the transaction sequence numbers on the two mass-storage devices are checked. The server data will be recovered from the device containing the highest sequence number. (The other mass-storage device may contain a lower sequence number because the fault occurred during the checkpointing to that device.)
A limitation of this checkpointing design pattern is that recovery after a fault may take a long time. Starting or restarting a server might require a lot of processing, as might recovering the data back to the checkpoint. A "hot backup" server working directly with persistent mass storage device(s) of its own could speed up recovery. This design pattern is called process pairs; it is illustrated in Figure 3.
Figure 3: Process pairs
Here we see a primary server at the center of the diagram working much as it does in the previous checkpointing scenario. Clients work directly with this primary server. Whenever the primary server successfully completes an entire transaction, it passes information about its new consistent state to a backup server (on the right). Both the primary and backup servers record the data in their persistent mass-storage devices. In this way, the backup server is kept current about completed transactions. While the primary server is up and available to clients, it sends regular heartbeat messages to the backup server. These heartbeat messages may be combined with some of the checkpointing messages. If the backup server detects that the stream of heartbeat messages has stopped, it understands that the primary server is dead or unavailable, and it will very quickly take over as a new primary server.
This design pattern can work when c lient, primary server, and backup server are all on different processors or modules, as well as when they're all on a single processor.
With all its sophistication, the process pairs design pattern still leaves some issues open. For example, checkpointing messages may get lost or arrive at the backup server out of order. Or there may be problems if a primary server is the controller of a physical device and it fails in mid-operation. The backup server may not know how to complete the operation when it takes over, since it does not know how far the primary got before failing. (For example: the primary server has moved the graphite rods of a nuclear reactor 10cm of the required 30cm when it fails.)
Once again, in the process pairs design pattern the transaction that was underway when the primary server failed may well be lost or dropped in mid-execution. The backup server will take over in the state of the last previously completed transaction reported to it by the primary server before it failed.
Another design pattern for dynamic software redundancy is called recovery blocks. It's based on checkpointing, but it adds an acceptance test for the results of software processing. You need to prepare a number of alternative ways of processing your data-as in N-version programming-but you don't run all of your processing alternatives for each transaction. Instead, you select one way of processing your data as your primary alternative, and first try to do the transaction using the primary alternative's processing only. At the end of this primary processing, you run the acceptance test. If the acceptance test is passed, you're done. If the acceptance test is failed, you go on to try the next alternative processing, as shown in Figure 4.
Figure 4: Recovery blocks
As seen in this diagram, checkpoi nting must be done before first entering the recovery block. And the software must roll back to the checkpoint state after every failed acceptance test. The acceptance test is performed after every processing alternative that is tried. In this way, processing alternatives are run until a processing alternative succeeds in delivering results that pass the acceptance test. These "good" outputs comprise the final results of the recovery block.
Forward error recovery
Checkpoint-rollback, process pairs, and recovery blocks are all ways of doing backward error recovery. Most dynamic software fault tolerance is done using backward error recovery, but forward error recovery is another option. Its main idea is to continue from an erroneous state and make corrections that will clean up the damaged state. Forward error recovery depends on accurate damage assessment and is often system-specific. An example of forward error recovery is exception handling, which passes control to specialized exception handling software when a problem is detected, rather than rolling back and continuing from some previously checkpointed state.
One design pattern for forward error recovery is called alternative processing. This approach can be used when there are two (or more) processing alternatives for a transaction, with one alternative normally being very precise but computationally complex, and the other alternative being more simplistic and of higher performance. The second alternative can be used as both a test and a secondary results provider, if the primary processing is faulty. For example, a square-root algorithm can be the primary processing and a table-lookup-interpolation can be its alternative. Once the algorithm has run, table-lookup-interpolation can be used both to test the quality of the square-root algorithm results and to quickly fix some errors in those results.
Figure 5: Alternative processing for aircraft flight control
In Figure 5, we see two digital flight control systems being run at the same time to control a single airplane (Boeing 777). The decision logic on the right side of the diagram uses the outputs of the simpler flight control system as a yardstick against which to test the outputs of the complex primary flight control system. If the acceptance test fails, then the outputs of the simpler flight control system could also be used as an alternative to the failed primary outputs. (The arrows going to the left show that the results of the alternative processing may also be used to provide feedback to the primary processing.)
Design patterns such as these allow everyday commercial-quality hardware and software to be used as building blocks for true high-availability systems, systems that can, without human intervention, achieve "five-nines" or greater availability.
David Kalinsky is director of customer education at OSE Systems. He is a lecturer and seminar leader on technologies for embedded software. In recent years, David has built high-tech training programs on aspects of software engineering for the development of real-time and embedded systems. Before that, he was involved in the design of many embedded medical and aerospace systems. David holds a PhD in nuclear physics from Yale. He can be reached at firstname.lastname@example.org.
1. SIAM (Society for Industrial and Applied Mathematics) News, vol. 29, #8, October 1996.
2. Chandra, Ramesh and Lui Sha. "On Scheduling Tasks in Reliable Real-Time Control Systems," Proceedings of the 20th IEEE Real-Time Systems Symposium, 1998.