Recovery: Detection, Identification, Troubleshooting, Resumption

Recovering from exceptional states When the operator requests an action, the system should always respond, at least to provide any feedback. Therefore, the behavior should be defined for all operator requests, in all system states. To deal with the problem of state complexity, we need to define default behavior. The default behavior may be a feedback message, explaining that the system is in an exceptional state and that it is not safe to execute the requested command.
Recovery operation

This term refers to the interaction in exceptional situations, in attempt to troubleshoot a system fault, or a crash due to a user error. The following examples are about users who failed to operate the system emergency procedures:

  • The NYC blackout
  • The crash of airbus A320 of Air France Flight 296 in 1988
  • The crash of Torrey Canyon in 1967
  • The Three Miles Island (TMI) nuclear plant accident in 1979
  • Engine overheat, when run out of coolant fluid.
Layers of default behavior Sometimes, the feedback message may include instructions for recovery from the exceptional state, such as by restarting or by rollback to an earlier normal state. The instructions may depend on the particular action, but also on particular values of some of the state variables. To handle multiple instructions, we need to define a behavioral architecture. The different instructions must be arranged in order of priority, in order to determine the particular instruction that should hold for a particular state combination.
Recovery instructions What if the operator is not competent enough to solve the problem? The NYC blackout is an example of such a case. In most practical installations, the much of the recovery instructions is printed in separate documents, and stored in a place that nobody can recall when they are required, because at design time it was not clear when and how the user should need them. However, if the troubleshooting is integrated in the system management program, as suggested earlier, then it is only a small step to get the full instructions for fixing the problem.