How to Design Safety into the Electro-Mechanical System Around Software

Posted February 01, 2001

When it comes to safety-critical applications, sometimes you have to protect users from the software. And sometimes you have to protect users from themselves.

Murphy's Law should have stated that "Anything that can go wrong, will go wrong while interrupts are disabled." To implement feature-rich software according to commercially feasible deadlines, programmers have to accept that conditions will be encountered in the field that were not exercised during development and testing. You may be able to state how you expect your software to behave in a given case, but most programs are too complex to predict their behavior with complete certainty. Even lots of testing is no guarantee that applications will work the same way next time. The timing conditions or an uninitialized value may turn out okay on one run, but they might not on the next.

Technischer Uberwachungs-Verein (TUV) is the regulatory body that certifies medical and other electronic devices for sale in the European Union. It functions similarly to the FDA in the U.S. One of the criteria TUV reviewers have used is to ask the following question: what is the worst thing the software might do? The assumption is that the combination of processor and software is so complex that we cannot predict what instructions might get executed, should either hardware or software go wrong.

Assessing risk

One approach to dealing with this lack of certainty is to imagine that a malicious programmer, intent on causing the maximum amount of damage, has reprogrammed your device. If the device is a Game Boy, that does not amount to much. However, many embedded programmers control devices on which lives directly depend, or devices that could put lives in danger. In other cases, no lives are at stake, but a malfunction may cost the user money through, say, factory downtime or loss of material. In those cases, it is justifiable to spend a fraction of the money at risk as insurance against failure.

Let's examine the risk again. Most programmers would consider it fanciful that software, in a failure mode, would just happen to follow the worst possible course of action. It is true that some sophisticated actions are unlikely to happen by chance. An automated guidance vehicle, for example, is not likely to chase a victim around the factory. If you look more closely at some real examples though, many cases exist in which the worst case can easily happen. In antilock brakes, for instance, the system simply has to do nothing (a common failure mode in many programs) to achieve worst-case. A driver is blissfully unaware that the protection has been removed, and when it comes time to press the brakes, nothing happens! Because of this hazard, if the microcontroller controlling your brakes fails, the mechanical coupling between the pedal and the breaking mechanism still functions. It is still possible to stop, just not as smoothly as in the computer-assisted case.

In the past, I have worked on medical ventilators. These devices pump air into a patient's lungs according to parameters defined by the physician, and monitor the patient's response. An alarm goes off if the patient is not making the effort expected, or if rates or pressures are higher or lower than expected. The worst case here is that the air pressure could be driven very high, without the physician being made aware of it. In some ventilators, the air is driven by a piston, which is powered by a motor. Ideal performance often means moving the motor as fast as possible at the start of the breath, and reducing speed later on, when the pressure has to be controlled more carefully—a fairly typical feedback control loop. If the software halts immediately after instructing the motor to move at its top speed, then the motor will not be instructed to slow down or to halt. That would be okay if our alarm mechanism can alert the physician that all is not well. However, if our software has halted, the alarm code may not get a chance to run again. So a single point of failure has led to our control mechanism going out of control and our monitoring system being disabled. Maybe you should stop reading for a moment and consider if a similar failure could occur on your current project.

A more malicious failure would not just halt the monitoring system, but also display misleading data on the user interface. Explicitly telling the user that everything was fine, when just the opposite is true lures the user into a false sense of security. In one incident, an anesthesiologist became suspicious when a blood pressure monitor displayed data that did not fit with the patient's other symptoms. When the monitor's display was examined more carefully, it turned out that the words "Demo mode" appeared occasionally in small print. ¹

The monitor had been placed in this mode during a routine service, and then put into use. The indication on the display that it was operating in demonstration mode was small and intermittent—not enough to attract attention in a busy ward. "Ah, but," objects the software engineer, "it was not a bug. They were just not using it properly." That objection does not hold up. Software engineers are not just responsible for writing software that conforms to requirements, but they must ensure that those requirements are safe. In this case, the requirements should have stated that the user interface should make it very obvious at all times when the device is in demonstration mode.

Now that we have seen that our mind game with maliciously inclined software can correspond to real failures, what can we do about it? Double checks in software will only give us limited protection, and it is always possible that our malicious program will disable that check at the most important time. So the solution has to be outside of software. A mechanical or electronic mechanism that restricts the actions that software can take is called an interlock. This is a lock that prevents certain actions, or makes one action dependent on a certain state, or forces a certain combination of actions to happen in a particular order.

One example is elevator doors. When the doors are opened the elevator will often be electrically inhibited from moving. Similarly, train doors prevent the train from moving if they are open.

Lock-in and lock-out

Going back to our medical ventilator example, one of the components was a safety valve that would open if the air pressure reached a certain threshold. Software could instruct the safety valve to open at a lower pressure if it detected pressure rising too quickly. Software could also close the valve, so long as the valve had been opened by software. Once the safety valve had opened because it had reached its mechanical cracking pressure, it remained open until the power was cycled. Effectively, the valve was making the decision that the software could not be trusted, on the grounds that pressure had reached a value that would not have been reached with software executing its control loop properly. Once a power cycle occurred, you could be sure that the software was reset. It was also assumed that an operator was present—otherwise who would turn the device off and on? The safety valve was a limiting component, but it also provided a lock-out. Once software allowed the controlled activity to go outside of its operating envelope, it was locked out, and was not allowed to control the process again until some kind of human intervention took place.

Some devices cleverly restrict you to a certain mode once you are there. Many MCUs come equipped with a watchdog timer. ² Typically, the device can operate in one of two modes. In watchdog-enabled mode, the software must set some register on a regular basis as a confirmation that the software is still running normally. In watchdog-disabled mode, the software does not have to provide this strobe. For a safety-critical system, you always want to operate in watchdog-enabled mode. However, you have to contend with our imaginary malicious programmer, and assume that one of the first things he would do, once he has taken control, is turn that watchdog timer off. This is why many MCUs that employ a watchdog timer use a lock-in. Once you enable the watchdog, you cannot disable it without a processor reset. So long as your software functions correctly up to the point where the watchdog is enabled, you are guaranteed that it will be enabled from that time forward, regardless of any bugs you may have.

As well as having a lock-in for certain modes, you may want a lock-out for others. Let's return to the example of the patient monitor that was left in demonstration mode. To the designers' credit, they realized that it would not be a good thing if the operator accidentally entered demonstration mode while the device was in use. So they put password protection on the demonstration mode. Typically, service technicians would know the password, while the end-users would not. This lock kept users out of demonstration mode. In the example described previously, the problem arose because the service technician had put the device into demonstration mode before returning it to use.

Once the problem was discovered by the anesthesiologist, the easy solution was to switch back to normal operating mode. Unfortunately, making that change also required a password. The medical staff did not know the password and had to replace the monitor with a spare one. The system was designed with the rule that any change into or out of demonstration mode would require a password. Such symmetry appeals to engineers, but it was not the right design decision in this case. The demonstration mode required a lock-out. But once in demonstration mode a lock-in wasn't needed. In general, you want to make it difficult to change from a safe state to an unsafe state, but easy to move from an unsafe state to a safe one.

Forcing functions

In my book, I describe an interlock used on a paper cutting machine. ³ The goal of this interlock was not to prevent the software from damaging the device, but to prevent the user from damaging himself. The paper cutter has a table with a bed of air holes. The operator slides blocks of paper around on the air bed and when he positions them correctly, a blade descends, cutting the paper to the required size. The hazard here is that the operator may have a hand in the way of the blade when it comes down. We want to force the operator to put his hands elsewhere while the blade is moving. Donald Norman calls interlocks that do this "forcing functions." ⁴ The solution in the case of the paper cutter is to force the user to place each hand on a button on the device. Because the buttons are too far apart to be pressed by one hand, it is a reasonable way to ensure that the operator's hands are away from the cutting area of the blade.

My inspiration to write about this mechanism came from seeing such a device in action during a tour of a printing plant. A reader of my book, Peter Turpin, worked on software for a similar device and he mailed me with an interesting anecdote. Peter told me that many of the operators of these machines were paid bonuses for productivity and could work faster if both hands were free to control one block of paper while another block was being cut. Placing both hands on the buttons while the blade was moving slowed the operator down, which in turn cost him money. As a result, many operators taped a ruler across the two buttons and leaned against it to press them, keeping both hands free. The operators increased their productivity, but decreased their safety, since accidents became more likely.

This is a classic trade off between safety and cost. While the operator chose cost over safety, even his own, it was in the designer's interest to put safety first. In order to make the device safer, the designers moved the buttons to the side of the machine, a bit like a pinball machine. The operators responded by setting up a levering mechanism on each side that was tied to the operator's belt. By moving his hips, the operator caused the buttons to be depressed and again the blade could drop with both hands free.

The designers eventually won this battle of wits. The device had a login sequence in which the operator identified himself. The operator was instructed to place his hands on the metallic buttons and the device measured the capacitance of his body. Later, when the blade was due to descend, instead of simply checking if the buttons were being pressed, it also checked that the capacitance across the buttons matched the operator. This method has not been fooled (yet), but Peter figures that it is only because it is difficult for the operator to find out how the device works. If they knew that it was based on capacitance, a counter measure would probably not be that difficult to implement. Capacitance is a crude measure, and is not accurate enough to be sure of a particular operator's presence, but it is accurate enough to distinguish between an operator and a wooden ruler.

The purpose of the interlock just described was to protect the user from himself. Many other interlocks serve the same purpose. Industrial robots are often surrounded by a fence. If a gate in the fence is opened, the robot stops moving. In a factory in Japan, a worker hopped over such a fence and was killed when the robot moved unexpectedly. ⁵ You have to assume that your users will have different priorities, and interlocks that make certain actions impossible will always be a better design than interlocks that depend on a cooperative user.

Dead man controls

One interesting type of interlock is called a dead man control. This is a control that detects if the operator has died, or is no longer at the controls for some reason. The paper cutter I mentioned previously is one example. If the operator walks away from the machine, the blade will stop moving. Dead man controls are used on trains to ensure that the driver is at the controls at all times. In one London Underground incident, a driver had weighed down the dead man lever to avoid the inconvenience of having to hold it while the train was moving. ⁶ The driver left the cabin while the train was at a platform, to see if one of the doors was stuck open. Another interlock ensured that the train could not leave the station while a door was open, to prevent the hazard of a passenger falling out, or of a passenger being dragged if he were stuck in a door. While the driver was on the platform, the train pulled away—presumably, whatever had blocked one of the doors had been removed. The train halted automatically at a red light at the next station, and the driver was able to catch up on another train. No injuries resulted, but the incident shows how a number of interlocks can interact in ways the designer—and certainly the driver—does not anticipate.

Speedboats commonly have a dead man control in which a pin in the engine is attached by a cable to the driver's wrist. If the pin is pulled out, the engine will stop. This control will not actually detect if the driver has died, but it will detect if he has left the driving position. On a boat, the main objective is to ensure that the boat halts if the driver falls overboard. Unfortunately, the driver can easily leave the wrist strap off of his wrist. I witnessed the kind of danger that is created when a driver bypasses such a control. Two boats collided near a beach where I was walking. The force of the collision left one boat submerged, and the occupants of both boats were sent flying into the water. The other boat, with no occupants or driver, propelled itself madly in circles, at speed, within feet of four people bobbing in the water. A third boat had to ram the out-of-control craft to prevent what could easily have become a fatal accident. Again, the moral of the story is that users will often bypass safety mechanisms that designers thought would be used in all cases.

On medical ventilators that I helped design, we usually allowed the operator to enter settings and then leave the machine unattended. If a problem arose with the machine or the patient, an audible alarm would attract attention. In one case we implemented a dead man control because it was too dangerous to allow the physician to leave the device and patient. One way to test the strength of a patient's lungs is to not allow them any air, and measure how much negative pressure they can generate while they attempt to breathe. Obviously, any situation that cuts off the patient's air supply should be managed carefully. In our design, the physician initiates the maneuver by holding down a button. This closes the valves that allow air to flow to the patient, while the device continuously displays the measured pressure. When the physician is satisfied that he has seen the peak pressure, he releases the button and the ventilator reverts to normal operation, and the patient, thankfully, resumes breathing. If the physician is distracted by something during this maneuver and walks away from the patient, the ventilator reverts to normal operation because the physician had let go of the button. If no dead man control was used, and the maneuver was started with one button press and ended with a second button press, the patient could be left without air if the operator was distracted, or if someone accidentally pressed the start button. The ventilator also applied a timeout to this maneuver, so that it could not go on indefinitely, but using the timeout as the only safety mechanism would not have been sufficient in this case.

Can we trust software more than the user?

Another interesting class of interlock protects against both software and user error. The Therac-25 is a radiation therapy device that overdosed a number of patients over several years, before the cause of the failures was discovered. ^7, ⁸ It is an important case, partly because of the influence it had on the FDA's approach to device safety, and because of the lessons that it teaches. Many of the discussions of the Therac-25 focus on management failure, and problems with code written without proper care regarding reentrant functions. I discuss some of the user interface issues of the Therac-25 in my book. Here, I will discuss one of the mistakes made in the area of interlocks.

The Therac series of radiation devices were capable of producing beams of different strengths, depending on whether X-rays or electron therapy was required. They used a filter, which was rotated into place on a turntable, to ensure that the patient did not receive a raw high energy beam directly. In the Therac-20 (an earlier model), an interlock ensured that the high-energy beam could not be activated unless the table had been rotated to the correct position. By detecting the turntable position with microswitches, a simple electrical circuit inhibited the high energy beam if the filter was not in position.

On the Therac-25, the table was rotated by a motor controlled by software, and software also read the microswitches. It was assumed that the software would never choose to turn on the high energy beam with the filter out of position, so, as a cost-saving measure, the interlock was removed from the design. Without the interlock, our hypothetical malicious programmer was free to cause serious damage. A race condition in the code meant that under some rare conditions, the software would indeed apply the wrong settings. The interlock that had prevented the user from applying a high energy beam with the filter out of position would have prevented software from making the same mistake. The company claimed that the interlock was implemented in software, but that is a case of the fox guarding the henhouse. Software checking itself is useful, but it is not enough in safety critical situations.

Because high energy and low energy beams are invisible, the safety-design of the device cannot depend on the operator to spot a flaw in the system. In many of the Therac-25 overdoses, patients did not start to suffer symptoms of the damage for several days.

Allowing software to view the interlock

I have tried to emphasize that some mechanism outside of software should limit software, and that software should not have control over the interlock, because software could then bypass it at the worst possible time. However, it is often software that has to explain to the operator, by means of the user interface, why some action has not taken place. This can only be done well if software can detect when the interlock has inhibited an action. For this reason, software should be given read-only access to the interlocks. In some cases, this allows the operator to rectify the situation. In other cases, such a message would indicate a device fault that might lead to the device being returned to the manufacturer. The ability to read the state of the interlock would allow the Therac-25 to inform the user that the beam was not fired because the turntable was in the wrong position. On the underground train, an indicator should be placed in the cabin to indicate if the train is not moving due to an open door. An interlock that baffles the operator or service technician as to why the device will not function creates undue frustration.

Hopefully, some of these examples will give you a few ideas for ways to make your own product safer, or at least make you stop and think about the worst thing that can happen, and when it might happen--remember, Murphy was an optimist!

Endnotes

1. Doyle, D. John. "Potentially life-threatening medical equipment failure." The Risks Digest, Volume 20: Issue 48, July 15, 1999. [back]

2. Murphy, Niall. "Watchdog Timers" . [back]

3. Murphy, Niall. Front Panel: Designing Software for Embedded User Interfaces. Lawrence, KS: R&D Books, 1998. [back]

4. Norman, Donald. The Design of Everyday Things. New York: Doubleday, 1990. [back]

5. Neumann, Peter. Computer Related Risks. Reading, MA: Addison-Wesley, 1995. [back]

6. Page, Stephen. "Tube train leaves...without its driver." The Risks Digest, Volume 9: Issue 81, April 18, 1990. [back]

7. Leveson, Nancy. Safeware: System Safety and Computers. Reading, MA: Addison-Wesley, 1995. [back]

8. Leveson, Nancy and Clark S. Turner. "An Investigation of the Therac-25 Accidents." IEEE Computer, July 1993, p. 18. [back]

9. Norman, Donald A. The Design of Everyday Things. New York: Doubleday and Company, 1990. [back]

Back to Main
Share