[Horwitz02] Chapter 8. Service Outages

来源：百度文库编辑：神马文学网时间：2024/09/28 23:26:39

Chapter 8. Service Outages

You will learn about the following in this chapter:

The seven most common types of service outages and their causes
How to schedule maintenance for minimum business disruption
Best practices for performing maintenance within scheduled outage times
Assessing service performance for compliance with service level agreements
Effective procedures for responding to and resolving service outages
Analyzing the root causes of service outages

Information technologyis the lifeblood of most organizations. Revenues, production,scheduling, sales, and many other business functions rely on fullyfunctional IT services. Service outages can represent tremendouslosses—in revenues, personnel time, customer goodwill, and otherimportant business commodities. As such, outages—even those that arenecessary for routine maintenance or system repair—are considered bymost to be the great evil of information technology. As a systemadministrator, your company expects you to rise up and quickly resolveevery outage that rears its ugly head—whether it occurs during regular9-to-5 weekday working hours, or at 3:00 a.m. on Saturday. Systemadministrators are on-call heroes who must faithfully respond to outagenotifications and quickly get systems back up and running, in order tominimize losses to the organization. All too often, in fact, systemadministrators are recognized more for their performance during outagesthan for the time and energy they invest in designing and implementingsystem infrastructures that suffer a minimum of such outages. Thischapter discusses some common types of Unix system service outages andhow they apply to you, as the Unix system administrator, and yourbusiness. You learn about the metrics surrounding outages, includingmaintenance windows and service level agreements. And you learn themost effective procedures for dealing appropriately with outages andhow to use each outage as a learning tool that can help you minimizesimilar outages in the future.

Types of Outages

After you've spentyears as a system administrator and are looking back at all the serviceoutages you've dealt with in your organization during that time, you'llprobably discover that they can be divided into two groups: those thatare preventable, and those that aren't. Preventable outages are onesthat are either caused by human error or ones that you can see comingbased on current monitoring data. Human error, although impossible toprevent, can be minimized with procedures that eliminate guessing onthe part of an administrator. Creating procedures for routine taskslike removing a server from a rack or rebooting a production server canprevent the occasional human mishap.

Otheroutages can be prevented because you can see them coming long beforethey happen. For example, a server with a disk at 50% capacity andusage increasing by 10% of the total capacity per week is likely headedfor disaster in five weeks; you can see it coming well ahead of time.This is why proactive monitoring is so important: You're looking forpotential problems rather than reacting to them. You can read moreabout proactive monitoring in Chapter 6, “Monitoring Services.” Graphing these trends over time can help you monitor your system for developing problems.

Butwithin those two broad outage categories are several more specificcategories of outages based on the cause, duration, and extent of theoutage. There is much more to an outage than the unavailability of aservice. Although to your users they may all look the same, a systemadministrator needs to know more than just “the server is down” toassess the severity of each outage; proper categorization of an outagemay determine how quickly you are required to respond, who in yourorganization you need to inform about the outage, and if you shouldnotify users. There are an infinite number of causes for an outage, butthey do fall into the following categories:

Scheduled maintenance
Unscheduled outages
Degraded service
Partial service outages
Complete service outages
Distributed service outages
Third-party outages

Thesections that follow look more carefully at each of these types ofservice outage and some of the special demands each can place upon youas a system administrator.

Scheduled Maintenance

Routine maintenanceis common on any system, whether it's a computer or a car. Some routinemaintenance occurs with new technology or software releases. Just asyou change the oil in your car every 3,000 miles, you patch or upgradeyour operating systems as new patches and releases become available.Some routine maintenance is unexpected, yet still predictable. Forexample, you don't know when a disk, memory, or processor will go badon your server, but you know that at some point you'll need to replacethem, just as you know that eventually you'll need to replace yourcar's tires.

At thesame time, you know a tire blow-out can cause a nasty accident. Toavoid such surprises, you watch your tires for signs of wear, and younote their “wear guarantees” and the number of miles they've logged.That way, you don't depend on a blow-out to tell you that your tiresare ready for replacement. The same is true of your Unix network. Withproper logging and monitoring, you can anticipate and avoid manysoftware and hardware meltdowns related to age and overuse. And you canschedule the replacement of those components for a time that causes youand your business the least disruption.

Planning for Routine Maintenance Outages

Scheduled, routinemaintenance rarely has to be a critical “show stopper” for your ITdepartment or your business. Routine maintenance issues won't bringdown an entire system. Without question, processor, disk, and memoryfailures can have a dire impact on a system, but that's why smartadministrators use logging to provide a “heads up” for impendingfailures. The administrator can then schedule a time to fix theproblems on the system and make announcements appropriate so nobody istaken by surprise, especially the users.

Scheduledmaintenance puts unique responsibilities on the system administrator.Because this kind of maintenance (and the outage it requires) occurs atthe administrator's discretion, he or she must gauge the severity ofthe problem and choose the best time to take action. You can finddetailed information about monitoring logs and other metrics throughoutboth Chapter 6 and Chapter 11, “Performance Tuning and Capacity Planning.”

Don't Procrastinate Routine Maintenance

There are some inherent problems that come with the freedom of scheduling routine maintenance at your discretion, such as procrastination. Don't allow yourself to let a minor problem slip deeper and deeper into your pile of “must do” tasks until you eventually forget about the problem completely. Most neglected problems force themselves back into your attention when they escalate into full-blown outages. As you become aware of system maintenance needs, schedule a time to perform the maintenance and fix developing problems. Then stick with the schedule.

Scheduling Routine Maintenance Outages

Scheduled maintenance creates an outagein order to prevent an outage. If you think that sounds like nonsense,think again: Patching a server to avoid a potential problem typicallyinvolves rebooting the server, which causes a short outage. If yourcompany is busily using a service, the employees are likely to considerthe patching outage an unnecessary loss of productive time.

Thejustification for scheduling such outages can be difficult fornontechnical management to understand, as they may take the “if itain't broke, don't fix it” attitude, especially in high-availabilityenvironments. To enlist the support of management, do your research andhave your information ready to explain to them how the process you'llperform during this outage will prevent longer, more costly outages inthe future.

Ifyou're scheduling an outage to upgrade some part of the Unix system,make sure reluctant managers understand the benefits the upgrade willbring to the business. For example, replacing a slow, aging Web serverwith a new, top-of-the-line multiprocessor Web server may involve anoutage while the switch is made. Managers are likely to accept theinconvenience of the outage, however, when they understand the benefitsof the upgrade, such as the ability to handle more concurrentconnections.

Whenyou are actually scheduling your maintenance, you need to choose a timethat minimizes the impact to your users but still allows you to performthe required work. The following guidelines will help you choose theappropriate time:

When possible, schedule work during your maintenance window, which should coincide with the periods of least usage for your services. Maintenance windows are described in detail in the “Maintenance Windows” section later in this chapter.
Urgent maintenance, such as applying a patch to stabilize a crashing system, should be performed as soon as possible. You may want to obtain the approval of your management on this issue.
If you require a support engineer from a vendor to perform the maintenance, ensure that your support contract covers the times you need the engineer to be present. Also verify that the engineer can meet your scheduling requirements.
Coordinate your maintenance schedule around the availability of your own staff. If it is impossible for the required staff to be available at the time you need them, you may need to reschedule for a time when those resources are available.

Commit to a Scheduled Maintenance Time

Scheduled maintenance is the only type of outage with a fixed time limit. System administrators don't typically release announcements that a critical server will be taken down at 2:00 a.m. without also announcing when the server will be put back into service. After all, the point of a scheduled maintenance is that users and management know when to expect a service to be down and how long the outage will last. Maintenance windows can help enforce time limits; you learn more about them in the section titled “Maintenance Windows.”

Perhaps the most importantthing a system administrator can do to eliminate excess resistance toscheduled maintenance outages is to provide management with a clear andaccurate plan for when the maintenance will take place and how long theoutage will last. When management learns to trust your ability toschedule maintenance outages so that they cause minimal disruption tothe business and then keep the outage “on schedule,” they'll be morelikely to stop second-guessing you on this issue. You learn more aboutmaintaining maintenance schedules in “Working Within the Window,” later in this chapter.

Unscheduled Outages

An unscheduledoutage is any outage which occurs without warning. Even the mostwatchful and careful system administrator can expect to experienceunscheduled outages on his or her network. The causes for unscheduledoutages take many forms, such as a configuration glitch, hardwarefailure, human error, or even a building going up in flames.

Human error is one common cause of unscheduled outages. Some examples of human error that could cause an outage are as follows:

Typing the wrong command as root
Disconnecting the wrong cable from a server
Making a typo in a configuration file
Experimenting with new software or configurations on a production server

Mosthuman error can be prevented if the system administrator and othertechnical staff take extreme care when working with or aroundproduction systems. Double check all changes that you make, or evenbetter, use a change management system to approve and communicate yourchanges ahead of time. See Chapter 14, “Internal Communication,” for more details on change management.

Whateverthe cause of an unscheduled outage, the events that prompt themaintenance are unexpected. Most unscheduled outages are out of thesystem administrator's control, as is the amount of downtime theycause. Minimizing the downtime incurred by an unscheduled outage is oneof the many challenges a system administrator faces.

Document Outage Procedures

When you experience and resolve an unscheduled outage, document the procedures you took to solve the problem. You are likely to experience the outage more than once, and having documentation up front will help improve the response time for subsequent outages.

Real-World Example: Human Error Outage

An administrator at an ISP was decommissioning an old server in a densely populated rack, and he needed to unplug the server's power cable. The server's cables weren't labeled, and the administrator mistakenly pulled out the power cable for a Network Appliance filer that served email and Web site data for 60,000 users. This accident caused a massive outage. Needless to say, it was heart-stopping for the administrator to hear the critical filer's fans stop spinning! The filer was plugged back in and the administrator verified that everything was functioning properly before continuing to remove the old server. The outage lasted only 5 minutes, but did not go unnoticed by many of the system's 60,000 users. This outage didn't have to occur; had the administrator properly labeled the cables and/or taken the time necessary to trace the proper cable in a densely populated rack that contained business-critical hardware, he could have avoided this embarrassing and potentially costly outage. In fact, the next day the administrator organized a cable labeling project to prevent this kind of mistake from happening again. However, even in well-designed systems, human error can bring everything crashing down and is the cause of more unscheduled outages than anyone is willing to admit.

Partial Service Outages

Some services provideonly a single function. For example, POP has the sole purpose ofallowing a user to download mail from a mail server. A systemadministrator easily can determine when a single-function service isdown and the repercussions of that outage for the business. In the POPexample, an outage means users get an error message when they try todownload their mail.

Otherservices provide more functionality and, therefore, can present a morecomplex diagnostic and maintenance issue. A Web server, in addition toproviding simple static Web pages, can also support file uploading; CGIscripts can process forms, and code on the back end can interface withdatabases. Add SSL to this mix, and you have an entirely separatesecure Web server. Each Web server has its own unique functions,turning the simple HTTP protocol into a complex Internet service.

When somebody reports a Web serveroutage, therefore, the system administrator may have no clear idea whatis causing the outage or what services the outage has taken down withit. Is the entire Web site down, or is just one part of the site down?Is just one script failing, or is a database on the back end down?

Complexservices (such as this Web server example) can experience partialservice outages where the service as a whole is up and running, butparts of the service are failing. For example, a banking Web site mightbe available on the Internet, but the page that lets you check youraccount balances produces an error. For these complex services, thesooner you understand what part of the service is down, the betteryou'll be able to resolve the problem and end the partial outage. Youneed users to give detailed problem reports that help pinpoint thenonfunctioning part of the service—and the responsibility for helpingusers supply this information lies with your help desk. Help desk staffshould ask users to describe exactly what they were doing when theyexperienced the problem, as well as any error messages they received.The more detailed the problem report, the easier it will be for you totrack down a small problem with a large service.

Automatically Generated Error Reports

Many applications and operating systems provide a mechanism for automatically generating detailed error reports and sending them to the vendor's help desk. Although no Unix operating systems offer this yet, Network Appliance filers (network file servers) provide an “autosupport” functionality in which system status is periodically emailed to Network Appliance support, including after every reboot or crash. This report helps the support staff gather evidence that you would normally have to provide yourself, as well as allowing them to see potential problems before they cause an outage.

Complete Outages and Degraded Service

A completeservice outage is the nightmare all system administrators fear. In acomplete service outage, a service is 100% unavailable to its users.These outages are all too often the cause of 3:00 a.m. pages on aSaturday.

Not alloutages are as catastrophic as those just described. Sometimes problemsjust cause a service to become degraded. Much as a power brownoutcauses lights to dim but not fail, users can still use a degradedservice; it just doesn't perform as well as it normally would.

APOP3 mail service example can help illustrate degraded service outages.Users of these servers are used to clicking on a “Retrieve Mail” buttonon their clients and receiving their new mail within seconds. However,during an outage in which the mail server is overloaded with incomingmail, users experience delays of up to two minutes before receiving anymessages. Another two-minute delay separates subsequent messageretrievals. While the mail service is working, it is painfully slow, and not practical to use.

Other examples of degraded service include the following:

Slow Web server response due to excessive network traffic
A single failed Web server in a pool of servers, causing every couple of HTTP requests to fail
An application causing excessive paging, slowing down other services on the same server

Service Monitors Detect Degraded Services

As you learned in Chapter 6, service monitors can detect degraded service using timeouts. If, in the POP3 example just given, the system administrator had used NetSaint's POP3 service monitor and set its timeout to 30 seconds, the administrator would have received an alert for that two-minute delay.

Degradedservice, while not a complete service outage, still causes significantproblems for end users. Degraded service problems can be among the mostdifficult to diagnose, because the system is working functionally; youhave to find the parts that are failing, investigate the causes, andfix the problems.

Degradedservice usually indicates that one or more parts of the service areunder stress. In the earlier example, the POP3 service was slow. Thelogical place to start looking for problems in that situation is in theserver itself. Many subsystems can be stressed on a Unix server,including CPU, disk, network, and memory, and the system administratormust examine them all to find the problem. Chapter 11 discusses such diagnostic examinations in detail.

Distributed Service Outages

A running joke among system administrators offers this definition of a distributed service:

A distributed serviceis one in which a server you have never heard of, in a place you'venever been, can cause the machine on your desktop to crash.

AsHomer Simpson likes to say, “It's funny because it's true.” Distributedservice outages differ from other kinds of outages in one importantway: Although most outages on server A result from a failure on serverA, a distributed service outage can occur when a failure on server Bcauses a failure on server A.

Adistributed service (all jokes aside) is one that resides on a remotesystem but is critical to the functioning of another system. DNS(domain name system) and NFS (network file system) are good examples ofdistributed services. DNS provides critical name resolution for mostInternet applications, but it resides on remote servers. NFS servesfile systems to remote clients, some of which are critical to theoperation of those clients. If DNS servers become unavailable, Internetapplications will grind to a halt. A failed NFS server housing sharedapplications could render all of those applications unavailable to itsclients.

Tobetter understand the dynamics of distributed server outages, considerthe example of the NFS file server. Instead of installing gigabytes ofsoftware on each system in your infrastructure, you can install thesoftware once on an NFS file server and share it with the otherservers. Now imagine that you've installed user shells on that NFSserver—shells such asbash andtcshthat aren't available by default on your other servers. If the fileserver ever becomes unavailable, so do the shells. If the shells areunavailable, users can't log in to any of your servers. A failure onthe NFS file server has caused an outage on your other servers.

Perhapsthe most frustrating aspect of a distributed outage is that the remoteserver that is failing may be under somebody else's control. You mightbe responsible for Web servers, but maybe an entirely different team inyour company administers the DNS servers that just went down. Althoughyou have done nothing wrong and your servers are functioning normally,somebody else's servers that you rely on can cause an outage in yoursystem, and there's nothing you can do about it but complain and wait.

Real-World Example: A Distributed File System

AFS is a large-scale distributed file system often found at universities. Files are stored in volumes located on any number of file servers anywhere on the network. At one university, the popular email program Pine was installed in an AFS volume, along with all of the other shared applications that over 60,000 students, staff, and faculty used every day. Pine was the university's primary email program at the time. The AFS file servers were running on aging hardware, and as such were constantly crashing due to ever-increasing load. The administrators responsible for the machines running Pine were fed up with the outages, especially when the volume housing Pine became unavailable. To resolve this problem, the administrators installed Pine locally on each server, making the most popular application available regardless of AFS volume unavailability. Working with multiple copies of the program was less convenient, especially during upgrades. But the administrators preferred that small hassle to repeatedly explaining outages that weren't caused by any factor within their control.

Distributedoutages are very difficult to prevent, since the whole point ofdistributed services is to offload services onto other systems, and youare often not in control of those systems. However, there are severalsteps you can take to minimize the impact a distributed service outagehas on your own systems, as follows:

Ensure that there are redundant resources providing each service—especially for critical services like DNS and file servers.
Document the dependents of each distributed service so you know what will be affected when a service fails.
Deploy “problem services” that fail too often locally so you can control these services yourself. See the previous “Real-World Example” for an example.

Third-Party Outages

Third-party outages occurwhen a system owned by another organization fails and causes a failureon one of your systems. Though similar to distributed outages,third-party outages differ in one important way. A distributed outageactively involves the use of a service on the remote system that causesthe failure. In third-party outages, the system administrators whosesystems suffer the outage don't even know the remote system exists, andthey certainly don't use any services on it.

Theclassic example of a third-party outage is a backbone failure. Everyonedepends on backbones, which typically run at speeds between 45Mbps(DS3) and 2.4Gbps (OC-48), to connect networks around the world to formthe Internet. Multiple backbones provide several different high-speedpaths between any two machines on the Internet; the Internet wouldn'tfunction without them.

Whena major backbone goes down, everyone in the country knows it! Notraffic can get from one part of the backbone to another until routerseventually remove the route to the failed backbone and find other waysto route packets to their destinations. The most obvious symptom ofthis problem is a loss of connectivity to servers you access every day,especially those across the country. If your company suffers this kindof third-party outage, many of your customers will lose connectivity toyour services. It's scary to know that a piece of hardware you neverasked to use could cause such a major outage for your organization, butthat's the nature of the shared network called the Internet.

Some examples of third-party outages include the following:

A major Internet backbone outage (such as a backhoe cutting into fiber optic cabling underground) prevents you from reaching certain sites.
A client's mail server is down, preventing you from sending email to them.
The router that terminates your T1 line at your ISP fails, causing your organization to lose all connectivity to the Internet.

Third-partyoutages are out of your control as a system administrator. The mostimportant thing you can do is to report any outage to the third partyand keep track of any tickets that the third party opens for you.Report these tickets to your help desk and explain the situation tohelp desk staff so they can adequately update your users who call in toreport the problem. Users should know that the problem is out of yourcontrol, but that it has been reported and is being worked on. Checkback periodically with the third party to verify progress is being madeon resolving the outage.

Maintenance Windows

One of your first responsibilities as a Unix system administrator is to specify your organization's maintenance window—the time reservedfor routine scheduled maintenance tasks such as rebooting routers,upgrading servers, adding disk drives, and so on. Maintenance windowsspecify a time when service is not guaranteed so that administratorshave time to fix minor problems or upgrade servers.

Routinework such as hardware racking and application installation can be doneoutside of the maintenance window. But if you are planning to do anywork that requires system downtime, or even if you are planning to dowork that has only a slight chance of bringing something down, do itduring the maintenance window. You'll save yourself a lot of trouble ifsomething does go wrong.

Youneed to consider three factors when choosing a maintenance window: timeof least usage, maximum maintenance time, and business requirements.The following sections discuss these factors in detail.

Time of Least Usage

Common sense dictatesthat you don't want to bring down your systems when all of your usersare using its services. The best times for a maintenance window areduring the low points of system usage. By routinely monitoring yourservices, you can easily determine the hours during which they receivethe least usage. Throughout Chapter 6, you will find many of the tools you can use to do this. MRTG is one such tool.

Figure 8.1shows an MRTG graph of Internet traffic at a fictitious company. Thegraph clearly indicates that the low point of usage for this system isat about 5:00 a.m., and that makes the perfect time around which tospecify a maintenance window. Graphing your own system use can help youdetermine the best time for your maintenance window, as well.

Figure 8.1. An rrdtool(an MRTG-like application) graph of network bandwidth usage clearlyshows that this organization's optimum maintenance window is between3:00 a.m. and 7:00 a.m.

Track Usage over Time

Note that you shouldn't choose a maintenance window based solely on one day's usage logs; look at trends of usage over a week or two, and find the average usage lows. Also take all of your services into account and look for common low points that you can take advantage of.

Different types of businesses have differenttrends in high and low usage points. ISPs usually peak around 8:00p.m., when everyone is home checking their mail and surfing the Web,and have low points around 4:00 a.m. Universities tend to have a lot ofnight owl students, so their usage may peak around 10:00 p.m., with theleast usage at 4:00 a.m. Regular 9-to-5 businesses peak around 1:00p.m., with a small dip around noon for lunch. Minimum usage is between6:00 p.m. and 6:00 a.m. International business complicates the analysiseven further. Users in London might be using your service heavily whileeveryone in the United States is still sleeping.

Onlya thorough analysis of your data can tell you the low usage time foryour own system, but determining when that time occurs is critical forassigning an effective maintenance window. You need to understand thedaily operations of your business, in order to specify the mosteffective (and least intrusive) maintenance window for everyone.

Maximum Maintenance Time

After you've discoveredthe time of least usage for your services, you need to decide how muchtime to allow for maintenance. Allow yourself enough time to fix themost complex of problems without extending maintenance time intoperiods of significant usage. Typical maintenance windows last anywherebetween 2 and 6 hours, with 3 to 4 hours being the norm.

Leave Back-out Time Within Your Window

Always include enough time in your maintenance window to back out any changes you made before the window expires. Not every maintenance job is successful, and you want to allow yourself enough time to clean up any mistakes you made and regroup. You should reserve at least 25% at the end of your maintenance window for this back-out time. If the processes you are performing are unfamiliar to you, allocate even more time to account for the learning curve.

Business Requirements

Your business may have specificrequirements that will play a role in determining your optimummaintenance windows. Client contracts may guarantee that services willbe available during certain hours—sometimes client contracts evenspecify the maintenance window for you. To make matters even morecomplicated, different contracts could specify different maintenancewindows—a situation that becomes a real nightmare when working onshared systems, such as a router.

Beyondcontractual requirements, some systems operations may depend onservices being up at certain times. If a bank generates monthlystatements between 12:00 a.m. and 6:00 a.m. on the last day of eachmonth, you can't fix servers at that time. Remember to take your backupschedules into account, as well; don't interfere with backupinfrastructure without either disabling or moving the backup schedulefor that day.

Onevery effective method of coordinating all of this information is tokeep a simple calendar and post each event, including maintenancewindows, scheduled outages, and uptimes required by service levelagreements. Recording events on a paper calendar might work for smallenvironments, but an electronic calendar works best for largerorganizations. Calendaring software comes standard with the GUI in mostoperating systems. These calendars immediately catch conflicts and warnyou, for example, if scheduled maintenance occurs in a time frame thata client has required your services to be available.

Working Within the Window

After you've establishedyour maintenance window, you should honor its boundaries and performonly routine scheduled maintenance within that window. If you start tobring services down before or after the window, you will likely affectusers who expect the services to be up, and lose the trust of thoseusers as well as your management.

Oneproblem with maintenance windows is that your work oftenunintentionally runs past the end of the window into the normaloperating hours of your services. One way to stay within your window isto set a maximum time for instituting the scheduled changes, afterwhich you will back out the changes and end the outage, regardless ofcircumstances. This is a tricky game to play, however; you must balancethe need to complete the change with the need to stay on schedule. Ifyou are only running 10 minutes behind schedule, it might not be worthit to back out all of your work; but if you are running an hour behind,it makes sense to back out because your users will definitely noticethe longer outage.

Ultimately,your management should make these kinds of decisions, especially if theoutage affects a large portion of your user base. If you do back outbecause of time constraints, regroup and figure out where thebottlenecks occurred before trying again; don't repeat the samemistakes you made the first time.

Monitoring Compliance with Service Level Agreements

Customers expecta certain level of service from their providers, especially thosecustomers who have signed contracts specifying those levels. Theseagreements, called service level agreements(SLA),call for administrators to closely monitor the uptime of theirservices, as contracts depend on those numbers. An SLA can be specifiedfor any number of metrics, though the two most common metrics areuptime and response time.

Monitoring Uptime Compliance

The most important measure of service is uptime. Simply put, for what length of time can your users access and use your services? In Chapter 6,you learned of the difference between availability and usability. Thedistinction between these two conditions is important when monitoringuptime and even more important when measuring it.

Whilea service may be available for use, it is not considered “up” unlessusers can use the service as they normally would. A mail server thataccepts user connections but fails with a “permission denied” error isa system that is available, but not usable. Uptime includes only usableservice hours.

Uptimeis often measured as a percentage of the total time the service couldand should be usable. Some businesses like to report uptime once amonth, others once per year. In any case, the goal most organizationsstrive for is 99.999% uptime, or “the five nines.” This exceptionalamount of uptime assumes that all downtime will be used for short-livedroutine maintenance.

Aratio of 99.999% uptime works out to be just under 8 hours of downtimeper year. Assuming you have no unscheduled outages, this amounts to 2major 4-hour maintenance windows per year, or 8 short 1-hour outages.Throw some random outages in there, and you will ultimately have lesstime for patching, upgrades, and whatever else you do duringmaintenance windows.

Otherorganizations go for the gold and try to reach “the seven nines,” or99.99999% uptime. This uptime percentage allows approximately 5 minutesof downtime per year, a lofty goal for sure, but not completely out ofreach, as you learn in the discussion of high availability in Chapter 10, “Providing High Availability in Your Unix System.”

Deciding Uptime Requirements

You services' uptime requirements should be dictated by your organization's management and the clients for which you have service level agreements. However, every service requires downtime for routine maintenance; for example, if you need one hour per month of downtime for patches or upgrades, communicate that to your management so no contracts are signed with an SLA of less than one hour per month of downtime.

Reporting uptime canbe tricky. Accurate reporting requires constant monitoring of all ofyour services, without failures in the monitoring system. In addition,the granularity of your monitoring intervals becomes more critical asyour uptime demands increase. For example, if you monitor each of yourservices once every 15 minutes, you'll miss many outages that last lessthan 15 minutes. Furthermore, every outage reported in the system willhave an uncertainty of plus or minus 30 minutes. For example, a17-minute outage and a 43-minute outage occurring in the system with15-minute interval monitoring might both appear as 15-minute failures,as shown in Figure 8.2.That's a discrepancy of 26 minutes—time for which you do not know thestatus of your service. A 30-minute uncertainty is unacceptable in aseven-nines environment, as the minimum amount of allowed downtime isonly 5 minutes.

Figure 8.2. You can't always tell thedifference between a 17-minute outage and a 43-minute outage if themonitoring interval is 15 minutes. The gray area represents the outagelength of 15 minutes that would be reported by the monitoring software.

Amonitoring interval of 1 minute might be more appropriate in thisenvironment; in that case it is much easier to tell the differencebetween a 1-minute outage and a 5-minute outage. In addition, with thatmuch data, it's much easier to prove your uptime to clients.

Theone rule you should take away from this section is that the monitoringinterval for a service should be less than the SLAs for those services.These smaller intervals allow you to report actual downtime with moreprecision, as well as detect short-lived failures that otherwise would go unnoticed.

Netcool's Reporting Functionality

Netcool contains powerful SLA reporting functionality. Netcool can report on historical service levels from a database and even notify you when current service levels cross a predetermined threshold such as 99% availability per hour.

Monitoring Response Time Compliance

The second mosttangible aspect of any network service is its response time. How longdoes a service take to perform and respond to a user's requests? As itplays such a large part in the user experience, most companies dedicatelarge chunks of time to optimizing their services' response times. Chapter 6introduced several representative monitoring tools that were able toprovide response time statistics, including Netcool and NetSaint.

Responsetime failures and other timeouts usually qualify as downtime whenmeasuring service levels and should be recorded as such. If you arelucky enough to actually be involved in a service level contractnegotiation, look for this clause and verify that you can perform atthe levels that are specified. If the contract expects a Web site torespond within 5 seconds for every request, make sure your systems canmeet that requirement! Both Chapter 6 and Chapter 11present information that can help you determine whether your servicescan perform as requested, and if not, whether they can be tuned to do so.

Observing Production Values

This isn't a book on morality, but every system administrator should know and obey his or her own set of production values. Production values are therules that minimize risk on production servers and can prevent outagesfrom occurring in the first place. Production values may differ fromorganization to organization, and even from person to person, but theyshould all include your department's commitment to honor these basicpromises:

To use production servers appropriately
To announce all maintenance
To watch logs and monitors
To respond quickly to outages

Establishingand honoring production values is essential to establish credibilityand respect for you and your IT department. To better understand theissues involved in each of the basic values listed here, read the sections that follow.

Using Production Servers Appropriately

Systems are oftenbroken down into three categories: development, staging, andproduction. Each of these system types has a specific use, and theproduction system is the most critical to the business of yourorganization. Your first production value should include a commitmentto use the production servers as they should be used, to protect theirservice to your organization.

Touse production servers wisely, you need to understand how all threesystem categories are used. Development systems are used for testingand developing new services. It doesn't matter if development systemsare up or down; a business's financial well-being doesn't depend onthose systems (although some developers might complain).

Stagingsystems are where new services are migrated for testing in a productionenvironment before actual deployment onto production systems. Stagingsystems usually are designed to look exactly like the productionenvironment, so people can get a good idea of how services will behavein production. Not everyone has or can afford a separate stagingenvironment; in that case, development systems often play this role.

Productionsystems are the key to your business. They are where the final versionof your services are deployed and made available to your users. Theiruptime is critical to your organization's success.

It'simportant to use these systems appropriately. Installing a productionWeb server on a development server is a bad idea. You're probably notmonitoring that server, and it may not have the capacity to handle yourproduction load. At the same time, you shouldn't develop on aproduction machine. Systems in production have one purpose, and that isto serve users. Developing on those systems takes away vital resources,such as CPU and memory, from your production applications; the loss ofthose resources can cause production applications to underperform. Evenworse, you might overwrite a configuration file and cause a completeservice failure.

Youcan drastically minimize service outages simply by using productionsystems appropriately. Do your development and testing on developmentsystems, and let the production environment do its noble job of servicing your users.

Announcing All Maintenance

Users usuallydon't know and don't care about the day-to-day work you perform on yoursystems, but they do care if the services they use go down withoutwarning. When you are faced with maintenance, no matter how minor, thatcould potentially cause an outage, you should announce that maintenanceto your users.

Yourannouncement should specify what you are doing in high-level layman'sterms and give users an accurate estimate of when the work will bedone. An email like this would be appropriate:

Code View:Scroll/Show All

From: Chad Admin
To: Widget Users
Subject: Maintenance

On Sunday February 3 at 2 AM, Widget system administrators will be replacing a
failing disk in the disk array that stores the data for the Widget application.
The work should take no longer than 30 minutes, and no downtime is expected.

Thank you,

The IT Staff

Afollow-up email documenting the success or failure of the maintenancewould be appropriate as well. You should also think about what means ofcommunication to use to make these announcements; an email will deliverthe announcement to each user, whether he or she goes looking for it ornot. Making sure all users are informed is essential for criticalsituations. Less critical work can be posted on a Web site or anewsgroup so users aren't force-fed useless information, but can stillbe informed about upcoming issues. Chapter 15, “Interacting with Users,” discusses the use of these and other forums for communicating information to your users.

Grabbing Users' Attention

Emphasize the urgency of maintenance announcements by crafting attention-grabbing subjects. Using words like “URGENT”, “WARNING,” “NOTICE,” and “OUTAGE“ in all capitals will cause most users to pay attention to your announcements, when they would otherwise ignore or delete them. Only use these words for announcing downtime or other urgent events, though—you don't ever want to cry wolf in these situations.

Watching Logs and Monitors

You could havethe most verbose logging in recorded history and the most precisemonitoring that today's technology can offer, but you'll gain no goodfrom them if you don't pay attention to their output. When yourmonitoring system notifies you that there's a problem, even with themost minor parts of a system, take it seriously and investigatefurther. Even a minor problem can be an indication of a greater problem.

Youshould never rely solely on your monitoring system to reveal systemproblems; review your logs daily to note any anomalies. System logscontain more information than could possibly be understood by loganalyzers such as logsurfer (documented in the section titled “Log Monitoring” in Chapter 6),and it's up to you to look for any anomalies that you haven'tconfigured your software to detect. Log analyzer programs areinvaluable tools, but they are useless without your configuration. Takesome time every day to look at the logs for your critical systems andbecome familiar with their contents. After you get to know the usualcontents of a log file, it is much easier to pick problems out of the thousands of familiar log messages.

Tweak Log Analyzers Along the Way

As you discover the log patterns that correspond to new problems, reconfigure your log analysis tools to recognize those new patterns (a simple process with tools such as logsurfer). These periodic reconfigurations save you time and effort when dealing with recurring problems.

Responding Quickly to Outages

During anoutage every minute counts, especially when service levels areinvolved. When you receive notification of an outage or potentialoutage, respond quickly. Not only will it reduce the total length ofthe outage, but people (and clients) are less likely to noticeshort-lived problems. This is why you should institute proximityrequirements for all on-call staff (see Chapter 5, “Support Administration”).Someone who is no more than 30 minutes from your data center isprobably going to respond to an emergency faster than someone who is 2 hours away visiting family.

Provide Remote Access for Administrators

Many outages are software-related and can be solved remotely, eliminating the need for traveling to the data center. To take advantage of this capability, however, all of your on-call administrators must have some kind of remote access—a dial-up ISP, ISDN, DSL, or a cable modem. If remote access becomes a job requirement, your organization should pay for this access.

Outage Procedures

Some administrators jokethat they'd need to start ripping cables out of their data centers andput them back an hour later to be recognized for outstandingperformance during an “outage.” Right or wrong, a Unix network'soutages (or lack thereof) often are the metric with which the Unixsystem administrator is judged. What was your uptime this year? Did youmeet your clients' service levels? Remember that time your mail serverwas down? Both clients and management care about outage issues, so it'sworth your time to craft procedures that will help you minimize outagesand their downtimes.

Escalationprocedures exist to move a problem up the chain of command untileventually someone in the chain can solve the problem. Proceduresshouldn't end at that point though. Developing procedures for handlingoutages will ensure that nobody misses critical tasks such as handlingcommunication and updating trouble tickets. The actual proceduresshould be very specific to your organization, but the generalguidelines discussed in the following sections can help you get started.

Assigning Problems to Appropriate Staff

While the help deskmay assign a problem to your group, not everyone, including the on-callperson, is the best fit for every problem. Your group probably has avariety of expertise; some of you may be senior administrators, somejunior. Others may know more about Linux than Solaris. Still others mayhave extensive experience in operating systems but won't touch hardwarewith a 10-foot pole.

Knowyour IT staff's strengths and weaknesses and use that information toassign problems to the right person. Even if you are the on-callperson, don't spend 2 hours on a problem that another member of yourgroup can fix in 2 minutes. Keep a contact list for your team and askfor assistance when necessary. On-call duty shouldn't mean that youhave to solve every problem, but you should certainly be responsiblefor orchestrating the problem-solving process in the most efficient way possible.

Maintain Ongoing Communication

It's only naturalwhen dealing with a difficult problem to focus all of your energy onsolving the problem, while blocking out all other external stimuli.This may speed up your own problem-solving process, but it leaveseveryone else in your organization wondering what's going on. Alwayskeep the communication lines open and send back as much information aspossible to the help desk, your team, and your managers if necessary.

Theyin turn can keep other parties informed, like clients and seniormanagement. Periodic check-ins can help facilitate this communication.During long outages, checking in with the help desk every hour or so isa good practice. These periodic check-ins also keep you informed of howserious the outage is perceived to be from the users' point of view.

Use a Headset to Let You Keep Working

Purchase a phone headset to use during emergencies at the data center, so you can work and talk at the same time. The headset is especially helpful when talking to technical support, who usually ask you to type commands.

Ofcourse, after the problem is resolved or you've come to a crossroads(maybe you need to order parts or wait for vendor support), contact thehelp desk immediately and update the status of the problem. This canalso be done with a trouble ticket system like Remedy or req, which streamlines the whole process for you.

Maintain Activity Logs

You mayremember and understand everything about an outage the instant youfinish working on it, but it's a good bet that you'll forget about 50%of what you did the next day. Keeping a detailed activity log will helpyou document the entire problem solving process, including commandoutput, vendor contacts, and timelines.

Allgood trouble ticket management systems provide some sort of loggingfunctionality (Remedy likes to get personal and calls it a diary).These logs are invaluable tools for both future reference and foranalysis of a problem that just occurred. A very simple but typical logentry might look like this:

Code View:Scroll/Show All

Wed Jan 30 2002 21:32:00 PM brian:  Users experiencing slow response time on  the mail server "goat".  I am working on the problem.

Wed Jan 30 2002 21:44:00 PM brian:  Logs indicate a failing disk (see below). Will call vendor to replace the disk.

Jan 30 13:35:58 goat scsi: [ID 107833 kern.warning] WARNING: /pci@1f,2000/SUNW, ifp@1/ssd@w21000020374f91d8,0 (ssd15):
Jan 30 13:35:58 goat for Command: write(10)              Error Level: Retryable
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         Requested Block: 12214112                 Error Block: 12214112
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         Vendor: SEAGATE
                           Serial Number: LS934473
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         Sense Key: Aborted Command
Jan 30 13:35:58 goat scsi: [ID 107833 kern.notice]         ASC: 0x47 (scsi parity error), ASCQ: 0x0, FRU: 0x3
Wed Jan 30 2002 21:56:00 PM brian:  Vendor thinks it's a bad Gigabit card. Will send new card and disk by 10:30 AM tomorrow.  Our case number is 234763.

Thu Jan 31 2002 10:45:00 PM brian:  Disk & card received.  Sending steve to replace the card during maintenance window tonight.

Fri Feb  1 2002 02:35:00 PM steve:  Card replaced successfully.  Errors seem to have disappeared.  Will ship new disk and bad card back to vendor.

Thislog shows the progress of the problem resolution process, includingimportant log data. This data will be very useful in the future, as itwas assumed that goat had a disk problem when it was really the gigabitcard that was failing; that assumption can be avoided next time nowthat this data has been logged.

Reference the Activity Logs During Outages

You should consult old activity logs when new outages occur. There may be tips and tricks in those logs that can save you time and effort when dealing with identical or similar problems.

Remain Calm

It is difficultto remain calm in highly visible outage situations, but you can't debuga problem and execute highly technical processes while you're runningaround like a chicken with its head cut off. As an old coworker oncesaid, “You get ice water in your veins,” meaning that as you experiencevarious outages and problems over the years, you become more and morecalm even in the most dire of situations—your blood no longer boils atthe mention of the word “outage.”

Themore panicked you are during an outage, the more likely you are to makea mistake, possibly worsening the situation. What's worse is that panicis contagious—if you are running around your office or data centerscreaming, other members of your staff are likely to start doing thesame.

Lead byexample and stay calm; analyze the problems you need to solve, and takethings one step at a time. If other administrators around you arepanicking, ask them politely to leave, as they only add to the problemat hand. Nontechnical coworkers are likely to stop by and ask what'sgoing on; it is very easy to get angry at them for bothering you duringan outage, so simply ask them to go back to their desks and let you doyour job. Even your managers need to be told this as well; for example,you cannot possibly remedy a major outage with a manager looking overyour shoulder reminding you how much money the outage is costing thecompany. Just ask everyone to leave you alone so you can fix the problem.

Root Cause Analysis

All problems, nomatter how complex, have a root cause. A root cause is where a problemoriginated—the spark that caused the fire. Sometimes finding the rootcause of a problem is easy. In the example of a problem discussed inthe activity log text listed in the preceding section, the root causeof the slow response time on goat was a failing gigabit card. Sometimesit's not so easy; often the problem must be traced back further throughmany steps to find out what action truly caused the problem.

Forexample, a user might call into your help desk saying that she isn'treceiving any of the email her friends are sending to her. Upon furtherinvestigation, you find out that the file system housing her mailbox isfull. You reclaim some space, and mail begins flowing into the systemagain. What is the root cause of the problem? Was it the full filesystem? That certainly caused the user's problem, but what caused thefull file system? Maybe you were suddenly sent a massive amount of spamthat filled up the mailboxes on your system. In that case, the spammersare to blame, and you can block them from any further access. The rootcause was the spammers, and you remedied the problem by altering yourSMTP rule set to deny them access to your mail systems.

Perhapsinstead the full file system was the result of a gradual increase inusage over the past few weeks. Administrators were either not aware ofor ignored the trend of increasing disk usage. The root cause of theproblem in this case is the ignorance of the administrators, whichcould be remedied by implementing new monitoring procedures orconfiguring a software monitor to report excessive disk usage before itbecomes critical. This example perfectly demonstrates a situation inwhich proactive monitoring (discussed at length in Chapter 6) is more effective than reactive monitoring.

Whatcan root cause analysis do for you beyond assigning blame? It helps youidentify the real causes of your problems rather than the immediatecauses or symptoms. While you can deal with immediate causes as theyare found, eliminating root causes can make their resulting problemsdisappear forever, as there is no more seed from which outages canspawn.

In general,identifying a root cause requires you to trace the development of aproblem from its symptoms back to the event or condition that set theproblem into motion. After determining what caused the actual symptomsof the problem, you must determine what caused thoseproblems, and so on, until you finally reach a problem for which thereis no cause that you can remedy—that will be the root cause of theproblem. In essence, you are creating a genealogy of the problem,tracing its roots back to the beginning.

Avoid Band-Aid Solutions

Band-Aid solutions are those which mask the symptoms of a problem, but do not actually eliminate the cause of the problem. For example, in the spamming example mentioned previously, adding more disk space would prevent the mailbox file system from filling up, but only temporarily. This is a Band-Aid solution; to truly solve the problem, the root cause—in this case, spammers—must be found and remedied.

Summary

Managingoutages can be a challenging part of a system administrator's job;often, these outages can be scary and overwhelming. However, you can'tlet these outages control your daily life or take over your ITdepartment. As you resolve problems over the years, you learn how tobetter analyze new problems and fix them using previous experience.Your calm will eventually overtake your anxiety, and you will behandling outage situations with composure you didn't know you had.

Thischapter introduced the most common types of outages that can occur andhow to manage them. Some outages are created on purpose to servicehardware and software; these scheduled outages should be performedwithin a designated maintenance window. When outages do occur, it isimportant to accurately measure how long they last to monitor yourcompliance with service level agreements (SLAs). Your actions during anoutage are important as well; setting production values such asresponding to problem reports in a timely manner and documenting outageprocedures will help you and other administrators deal with problemsmore effectively. Finally, when an outage is resolved, you shouldperform a root cause analysis to determine the true cause of theproblem and fix it; this goes a long way toward eradicating thoseoutages from your infrastructure for good, and fewer outages make for ahappier system administrator!