System Reliability and Availability Calculation

来源:百度文库 编辑:神马文学网 时间:2024/05/23 21:49:07
System Reliability and Availability
We have already discussedreliabilityand availability basics in a previous article. This article will focus ontechniques for calculating system availability from the availability informationfor its components.
The following topics are discussed in detail:
System AvailabilityAvailability in Series
Availability in Parallel
Partial Operation Availability
Availability Computation ExampleUnderstanding the System
Reliability Modeling of the System
Calculating Availability of Individual Components
Calculating System Availability
System Availability is calculated by modeling the system as aninterconnection of parts in series and parallel. The following rules are used todecide if components should be placed in series or parallel:
If failure of a part leads to the combination becoming inoperable, the two parts are considered to be operating in series
If failure of a part leads to the other part taking over the operations of the failed part, the two parts are considered to be operating in parallel.

As stated above, two parts X and Y are considered to be operating in series iffailure of either of the parts results in failure of the combination.  Thecombined system is operational only if both Part X and Part Y are available.From this it follows that the combined availability is a product ofthe availability of the two parts. The combined availability is shown by theequation below:

The implications of the above equation are that the combined availability oftwo components in series is always lower than the availability of its individualcomponents. Consider the system in thefigure above. Part X and Y are connected in series. The table below shows theavailability and downtime for individual components and the series combination.
Component Availability Downtime
X 99% (2-nines) 3.65 days/year
Y 99.99% (4-nines) 52 minutes/year
X and Y Combined 98.99% 3.69 days/year
From the above table it is clear that even though a very highavailability Part Y was used, the overall availability of the system was pulleddown by the low availability of Part X. This just proves the saying that a chainis as strong as the weakest link. More specifically, a chain is weaker than theweakest link.

As stated above, two parts are considered to be operating in parallel if thecombination is considered failed when both parts fail.  Thecombined system is operational if either is available.From this it follows that the combined availability is 1 - (both parts areunavailable). The combined availability is shown by theequation below:

The implications of the above equation are that the combined availability oftwo components in parallel is always much higher than the availability of its individualcomponents. Consider the system in thefigure above. Two instances of Part X are connected in parallel. The table belowshows the availability and downtime for individual components and the parallelcombination.
Component Availability Downtime
X 99% (2-nines) 3.65 days/year
Two X components operating in parallel 99.99% (4-nines)
52 minutes/year
Three X components operating in parallel 99.9999% (6-nines) 31 seconds /year !
From the above table it is clear that even though a very low availability PartX was used, the overall availability of the system is much higher. Thus paralleloperation provides a very powerful mechanism for making a highly reliable systemfrom low reliability. For this reason, all mission critical systems are designedwith redundant components. (Different redundancy techniques are discussed in theHardware Fault Tolerance article)
Consider a system like theXenonswitching system. In Xenon, XEN cards handle the call processing for digitaltrunks connected to the XEN cards. The system has been designed to incrementallyadd XEN cards to handle subscriber load. Now consider the case of a Xenon switchconfigured with 10 XEN cards. Should we consider the system to be unavailablewhen one XEN card fails? This doesn't seem right, as 90% of subscribers arestill being served.
In such systems where failure of a component leads to some users loosingservice, system availability has to be defined by considering the percentage ofusers affected by the failure. For example, in Xenon the system might beconsidered unavailable if 30% of the subscribers are affected. This translatesto 3 XEN cards out of 10 failing. The availability for this system can becomputed by calculating A(p,q) as specified below:
A(p,q) = C(q,p) * A^(q-p) * (1-A)^p
Here p is the number of failed units and qis the total number of units.
In this section we will compute the availability of a simple signalprocessing system.
As a first step, we prepare a detailed block diagram of the system. Thissystem consists of an input transducer which receives the signal and converts itto a data stream suitable for the signal processor. This output is fed to aredundant pair of signal processors. The active signal processor acts on theinput, while the standby signal processor ignores the data from the inputtransducer. Standby just monitors the sanity of the active signal processor. Theoutput from the two signal processor boards is combined and fed into the outputtransducer. Again, the active signal processor drives the data lines. Thestandby keeps the data lines tristated. The output transducer outputs the signalto the external world.
Input and output transducer are passive devices with no microprocessorcontrol. The Signal processor cards run a real-time operating system and signalprocessing applications.
Also note that the system stays completely operational as long as at leastone signal processor is in operation. Failure of an input or output transducerleads to complete system failure.

The second step is to prepare a reliability model of the system. At thisstage we decide the parallel and serial connectivity of the system. The completereliability model of our example system is shown below:

A few important points to note here are:
The signal processor hardware and software have been modeled as two distinct entities. The software and the hardware are operating in series as the signal processor cannot function if the hardware or the software is not operational.
The two signal processors (software + hardware) combine together to form the signal processing complex. Within the signal processing complex, the two signal processing complexes are placed in parallel as the system can function when one of the signal processors fails.
The input transducer, the signal processing complex and the output transducer have been placed in series as failure of any of the three parts will lead to complete failure of the system.
Third step involves computing the availability of individual components. MTBF(Mean time between failure) and MTTR (Mean time to repair) values are estimatedfor each component (SeeReliabilityand Availability basics article for details). For hardware components, MTBF information can be obtained from hardware manufactures data sheets. If thehardware has been developed in house, the hardware group would provide MTBFinformation for the board. MTTR estimates for hardware are based on the degreeto which the system will be monitored by operators. Here we estimate thehardware MTTR to be around 2 hours.
Once MTBF and MTTR are known, the availability of the component can becalculated using the following formula:

Estimating software MTBF is a tricky task. Software MTBF is really the timebetween subsequent reboots of the software. This interval may be estimated fromthe defect rate of the system. The estimate can also be based on previousexperience with similar systems. Here we estimate the MTBF to be around 4000hours. The MTTR is the time taken to reboot the failed processor. Our processorsupports automatic reboot, so we estimate the software MTTR to be around 5minute. Note that 5 minutes might seem to be on the higher side. But MTTR shouldinclude the following:
Time wasted in activities aborted due to signal processor software crash
Time taken to detect signal processor failure
Time taken by the failed processor to reboot and come back in service
Component MTBF MTTR Availability Downtime
Input Transducer 100,000 hours  2 hours 99.998% 10.51 minutes/year
Signal Processor Hardware 10,000 hours 2 hours 99.98% 1.75 hours/year
Signal Processor Software 2190 hours 5 minute 99.9962% 20 minutes/year
Output Transducer 100,000 hours 2 hours 99.998% 10.51 minutes/year
Things to note from the above table are:
Availability of software is higher, even though hardware MTBF is higher. The main reason is that software has a much lower MTTR. In other words, the software does fail often but it recovers quickly, thereby having less impact on system availability.
The input and output transducers have fairly high availability, thus fairly high availability can be achieved even without redundant components.
The last step involves computing the availability of the entire system. Thesecalculations have been based on serial and parallel availability calculationformulas.
Component Availability Downtime
Signal Processing Complex (software + hardware) 99.9762% 2.08 hours/year
Combined availability of Signal Processing Complex 0 and 1 operating in parallel 99.99999% 3.15 seconds/year
Complete System 99.9960% 21.08 minutes/year