Software Fault Tolerance, Software Audits, Rollback, Exception Handling and Recovery

来源:百度文库 编辑:神马文学网 时间:2024/05/20 20:52:53
Software Fault Tolerance
Most Realtime systems focus on hardware fault tolerance. Software faulttolerance is often overlooked. This is really surprising because hardwarecomponents have much higher reliability than the software that runs over them.Most system designers go to great lengths to limit the impact of a hardwarefailure on system performance. However they pay little attention to the systemsbehavior when a software module fails.
In this article we will be covering several techniques that can be used tolimit the impact of software faults (read bugs) on system performance. The mainidea here is to contain the damage caused by software faults. Software faulttolerance is not a license to ship the system with bugs. The real objective isto improve system performance and availability in cases when the systemencounters a software or hardware fault.
Timeouts
Audits
Exception Handling
Task Rollback
Incremental Reboot
Voting
Most Realtime systems use timers to keep track of feature execution. Atimeout generally signals that some entity involved in the feature hasmisbehaved and a corrective action is required. The corrective action could beof two forms:
Retry: When the application times out for a response, it can retry the message interaction. You might argue that we do not need to implement application level retries as lower level protocols will automatically recover from message loss. Keep in mind that message loss recovery is not the only objective of implementing retries. Retries help in recovering from software faults too. Consider a scenario where a message sent to a task is not processed because of a task restart or processor reboot. An application level retry will recover from this condition.
Abort: In this case timeout for a response leads to aborting of the feature. This might seem too drastic, but in reality aborting a feature might be the simplest and safest solution in recovering from the errors. The feature might be retried by the user invoking the feature. Consider a case where a call has to be cleared because the task originating the call did not receive a response in time. If this condition can happen only in rare scenarios, the simplest action on timeout might be to clear the call. The user would retry the call.
The choice between retrying or aborting on timeouts is based on severalfactors. Consider all these factors before you decide either way:
If the feature being executed is fairly important for system stability, it might be better to retry. For example, a system startup feature should not be aborted on one timeout.
If the lower layer protocol is not robust, retry might be a good option. For example, message interactions using an inherently unreliable protocol like slotted aloha should always be retried.
Complexity of implementation should also be considered before retrying a message interaction. Aborting a feature is a simpler option. More often than not system designers just default to retrying without even considering the abort option. Keep in mind that retry implementation complicates the code and state machine design.
If the entity invoking this feature will retry the feature, the simplest action might be abort the feature and wait for an  external retry.
Retrying every message in the system will lower system performance because of frequent timer start and stop operations. In many cases, performance can be improved by just running a single timer for the complete feature execution. On timeout the feature can simply be aborted.
For most external interactions, the designer might have no choice. As the timeouts and retry actions are generally specified by the external protocols.
Many times the two techniques are used together. The task retries a message certain number of times. If no response is received after exhausting this limit, the feature might be aborted.
top
Most Realtime systems comprise of software running across multipleprocessors. This implies that data is also distributed. The distributed data mayget inconsistent in Realtime due to reasons like:
independent processor reboot
software bugs
race conditions
hardware failures
protocol failures
The system must behave reliably under all these conditions. A simple strategyto overcome data inconsistency is to implement audits. Audit is a program thatchecks the consistency of data structures across processors by performingpredefined checks.
Audit Procedure
System may trigger audits due to several reasons: periodically
failure of certain features
processor reboots
processor switchovers
certain cases of resource congestion
Audits perform checks on data and look for data inconsistencies between processors.
Since audits have to run on live systems, they need to filter out conditions where the data inconsistency is caused by transient data updates. On data inconsistency detection, audits perform multiple checks to confirm inconsistency. A inconsistency is considered valid if and only if it is detected on every iteration of the check.
When inconsistency is confirmed, audits may perform data structure cleanups across processors.
At times audits may not directly cleanup inconsistencies; they may trigger appropriate feature aborts etc.
An Example
Lets consider the Xenon Switching System. If the call occupancy on the systemis much less than the maximum that could be handled and still calls are failingdue to lack of space-slot resources, call processing subsystem  will detectthis condition and will trigger space-slot audit. The audit will run on the XENand CAS processors cross-check if a space-slot that is busy at CAS actually hasa corresponding call at XEN. If no active call is found on XEN for a space-slot,the audit will recheck the condition after a small delay for several times. Ifthe inconsistency holds on every attempt, the space-slot resource is marked freeat CAS. The audit performs several rechecks to eliminate the scenario in whichthe space-slot release message may be in transit.
top
Whenever a task receives  a message, it performs a series of defensivechecks before processing it.  The defensive checks should verify theconsistency of the message as well as the internal state of the task. Exceptionhandler should be invoked on defensive check failure.
Depending on the severity, exception handler can take any of the followingactions:
Log a trace for developer post processing.
Increment a leaky-bucket counter for the error condition.
Trigger appropriate audit.
Trigger a task rollback.
Trigger processor reboot.
Leaky-bucket counters are used to detect a flurry of error conditions. To ignore rare error conditions they are periodically leaked i.e. decremented. If these counters reach a certain threshold, appropriate exception handling is triggered. Note that the threshold will never be crossed by rare happening of the associated error condition. However, if the error condition occurs rapidly, the counter will overflow i.e. cross the threshold.
top
In a complex Realtime system, a software bug in one task leading to processorreboot may not be acceptable. A better option in such cases is to isolate theerroneous task and handle the failure at the task level. The task in turn maydecide to rollback i.e. start operation from a known or previously saved state.In other cases, it may not be expensive to forget the context by just deletingthe offending task and informing other associated tasks.
For example, if theSpaceSlot Manager on the CAS card encounters a exception condition leading totask rollback, it might resume operation by recovering the space slot allocationstatus from the connection memory. On the other hand, exception in a call taskmight just be handled by clearing the call task and releasing all the resourcesassigned to this task.
Task rollback may be triggered by any of the following events:
Hardware exception conditions like divide by zero, illegal address access (bus error)
Defensive check leaky-bucket counter overflows.
Audit detected inconsistency to be resolved by task rollback.
top
Software processor reboots can be time consuming, leading to unacceptableamount of downtime. To reduce the system reboot time, complex Realtime systemsoften implement incremental system initialization procedures. For example, atypical Realtime system may implement three levels of system reboot :
Level 1 Reboot : Operating system reboot
Level 2 Reboot : Operating system reboot along with configuration data download
Level 3 Reboot : Code reload followed by operating system reboot along with configuration data download.
Incremental Reboot Procedure
A defensive check leaky-bucket counter overflow will typically lead to rollback of the offending task.
In most cases task rollback will fix the problem. However, in some cases, the problem may not be fixed leading to subsequent rollbacks too soon. This will cause the task level rollback counter to overflow, leading to a Level 1 Reboot.
Most of the times, Level 1 Reboot will fix the problem. But in some cases, the processor may continue to hit Level 1 Reboots repeatedly. This will cause the Level 1 Reboot counter to overflow, leading to a Level 2 Reboot.
Majority of the times, Level 2 Reboot is able to fix the problem. If it is unable to fix the problem, the processor will repeatedly hit Level 2 Reboots, causing the Level 2 Reboot counter to overflow leading to Level 3 Reboot.
top
This is a technique that is used in mission critical systems where softwarefailure may lead to loss of human life .e.g. aircraft navigation software. Here,the Realtime system software is developed by at least three distinct teams. Allthe teams develop the software independently. And, in a live system, all thethree implementations are run simultaneously. All the inputs are fed to thethree versions of software and their outputs are voted to determine the actualsystem response. In such systems, a bug in one of the three modules will getvoted out by the other two versions.