Handling Processor Reboot and Recovery

来源：百度文库编辑：神马文学网时间：2024/07/02 19:41:53

Handling Processor Reboot

Realtime systems typically consist of multiple processors implementingdifferent parts of the systems functionality. Each of these processors canencounter a hardware or software failure and reboot. Realtime systems should bedesigned to smoothly handle processor failure and recovery.

Processor failure and recovery handling can be divided into the followingsteps:

A processor in the system fails. Other processors in the system detect the failure.
All other processors in the system cleanup all features that are involved in interactions with the failed processor.
The failed processor reboots and comes up.
Once the processor comes back up, it establishes protocol with all the processors in the system.
After establishing protocol, the rebooted processor reconciles all its data structures with the system.
Data structure audits are initiated with other processors to weed out inconsistencies that might have taken place due to processor reboot.

In the following discussion we will cover each of the steps mentioned above.We will be taking the example of XEN card reboot inXenon.

Processor Failure Detection

When a processor reboots in the system, other processors will detect itsfailure in one of the following ways:

Loss of periodic health messages: In an idle system with very little traffic, loss of periodic health messages may be the only mechanism to detect processor failure. This mechanism places an upper bound on the time it will take to detect processor failure. For example, if a XEN card sends a health message to CAS every 5 seconds and it takes 3 timeouts to declare the card failure, worst case XEN failure detection time would be 20 seconds (15 seconds for timeouts and 5 seconds additional delay for the case when XEN card failed right after sending the health message).
Protocol faults: Protocol faults are the quickest way to detect the failure of a processor in a busy system. As soon as a node sends a message to the failed processor, the protocol software will timeout for the peer protocol entity on the failed processor. This failure is reported to the fault handling software. Note that this technique works only when a message is sent to the failed node. Thus no upper bound can be specified on the failure detection time. But in most situations, protocol fault detection will be fast as there will be some message traffic towards the failed node. For example, a XEN card failure will be detected by other XEN and CAS processors as soon as they try to send a message to the failed XEN.

Cleaning Up on Processor Failure

Whenever a node fails, all the other nodes in the system that were involvedin feature interactions with this node, need to be notified so that they canclean up any feature that might be affected by the failure of this node.

For example, when a XEN card fails, all the other XEN cards are informed sothat they can clear all calls that had one leg of the call in the failed XEN.This may appear to be fairly straightforward, but consider that all of a suddenthe system has to clear so many calls. This may lead to a sudden increase inmemory buffer and CPU utilization. The designers should take this into accountwhen dimensioning resources.

Processor Recovery

Once a failed processor reboots and comes up, it will communicate with thecentral processor informing it that it has recovered and is ready to resumeservice. At this point the central processor would inform all other processorsso that they can reestablish protocol with the just recovered processor.

In the XEN example, when XEN card recovers, it will inform the CAS card aboutits recovery. Then CAS will inform other XEN cards so that they can resumeprotocol with the recovered card. This will also involve changing the status ofall terminals and trunk groups handled by the XEN card to inservice.

Data Reconciliation

When the failed card comes up, it has to recover the context that was lostdue to failure. The context is recovered by the following mechanisms:

Getting the configuration data from the operations and maintenance module.
Periodically backing up the state data with the operations and maintenance module so that this information can be recovered on reboot.
Reconciling data structures with other processors in the system to rebuild data structures.

When a XEN card recovers, it obtains V5.2 interface definition, trunk groupdata etc. from the operations and maintenance module. Permanent status changeinformation like circuit failure status would be obtained from the backed updata. Transient state information like circuit blocking status would berecovered by exchanging blocking messages with other exchanges.

Audits

A processor reboot might have created lot of inconsistencies in the system.Software audits are run just after processor recovery to catch theseinconsistencies. Once the inconsistencies are fixed, the system designers mayopt to have audits running periodically to counter inconsistencies that mighthappen during normal course of operation.

When the XEN card recovers, it triggers the following audits:

Space slot resource audit with CAS
Time slot resource audit with other XEN cards
Call audit with XEN and CAS

The above audits will clean up any hanging slot allocations or hanging callsin the system.

Handling Processor Reboot and Recovery Software Fault Tolerance, Software Audits, Rollback, Exception Handling and Recovery Intel?Processor and Platform Evolution Direct Memory Access (DMA) and Interrupt Handling [17] Exceptions and error handling, C FAQ Lite Hardware Fault Tolerance, Redundancy Schemes and Fault Handling Fault Handling Techniques, Fault Detection and Fault Isolation Data Recovery Application Processor & GPS S3C6410 application processor Processor Architecture Patterns Processor Architecture Patterns II Handling Phone Interviews Handling Phone Interviews China’s Fake Recovery Massively multicore processor runs Linux Read Write Processor Bus Cycles [Bernstein09] Section 4.3. Client Recovery [Bernstein09] Chapter 7. System Recovery [Laskey99] Section 4.3. Database Recovery I/O Event Handling Under Linux The Linux Signals Handling Model | Linux Jour... Part 6 - General Description of Interrupt Handling [Bernstein09] Section 4.4. Handling Non-Undoable Operations