[Cockcroft98] Chapter 14. Kernel Algorithms and Tuning

来源：百度文库编辑：神马文学网时间：2024/06/13 07:29:04

Chapter 14. Kernel Algorithms and Tuning

Thischapter explains some of the inner workings of the kernel. This chapteralso describes buffer sizes and variables that can be changed by asystem administrator when tuning a kernel. “Tunables Quick Reference” on page 557 contains a quick summary of what can safely be changed.

Thekernel algorithms are noted as being based on System V Release 4 (SVR4)if they are generic and as Solaris 2-based if Solaris 2 is differentfrom generic SVR4. Later releases of SunOS 4 have Solaris 1 names,which I have avoided for clarity (although this is not considered thecorrect use of the terms). The kernel part of Solaris 2 is known asSunOS 5.

There are no“magic silver bullet” tunable values that will make a big difference toperformance. If you look at the relative balance of user and system CPUtime, and user CPU time is much higher than system time, then kerneltuning can have little or no effect.

Kernel Tuning

Tuningthe kernel is a hard subject to deal with. Some tunables are well knownand easy to explain. Others are more complex or change from one releaseto the next. The settings in use are often based on out-of-datefolklore. This chapter identifies and explains some of the variablesthat are safe to tune. If you also use other versions of Unix, you maybe accustomed to having a long list of tunable parameters to set upwhen you rebuild the kernel. You are probably looking for theequivalent list for Solaris 2, so I will compare tunables with otherUnix implementations to identify the ones that size automatically inSolaris 2 and don’t need tuning.

A fundamental issue is the distinction between interfaces, implementations, and behaviors.

Interfaces

Interfacesare designed to stay the same over many releases of the product. Thisway, users or programmers have time to figure out how to use theinterface. A good analogy is that the controls used to drive a car arean interface that stays relatively constant. The basic controls forstop, go, steer are always in the same place. You don’t need to knowhow many cylinders there are in the engine before you can drive the car.

Implementations

Theimplementation hides behind the interface and does the actual work. Bugfixes, performance enhancements, and underlying hardware differencesare handled by changes in the implementation. There are often changesfrom one release of the product to the next, or even from one system toanother running the same software release. If a car engine starts tomisfire and you need to lift the hood and change the spark plugs, youare suddenly confronted with a lot of implementation details. Outwardlyidentical cars might have a four-, a six-, or an eight-cylinder engineand many other detailed differences that change year by year as well.

Behaviors

Evenwith no knowledge of the implementation details, the behavior of asystem changes from one implementation to the next. For example,Solaris 2.6 on an Ultra 1 has the same set of interfaces as Solaris 2.4on a SPARCstation 5. The behavior is quite different because Solarishas been tuned in several ways and the hardware implementation’sperformance is much higher. To take the car analogy again, a BMW 518iand a BMW 540i look very similar, but one has a 1.8-liter four-cylinderengine, and the other has a 4.0-liter eight-cylinder engine. They don’tsound the same, and they don’t behave the same way when you push theaccelerator pedal!

Self-Tuning Systems

Innormal use there is no need to tune the Solaris 2 kernel; itdynamically adapts itself to the hardware configuration and theapplication workload. If it isn’t working properly, you may have aconfiguration error, a hardware failure, or a software bug. To fix aproblem, you check the configuration, make sure all the hardware is OK,and load the latest software patches.

Documented Configuration Tunables

Thetunable parameters that are mentioned in the Solaris 2 AnswerBookPerformance Section configure the sizes or limits of some datastructures. The size of these data structures has no effect onperformance, but if they are set too low, an application might not runat all. Configuring shared memory allocations for databases falls intothis category.

Kernel configuration and tuning variables are normally edited into the/etc/systemfile by hand. Unfortunately, any kernel data that has a symbol can beset via this file at boot time, whether or not it is a documentedtunable. The kernel is supplied as many separate modules (type %ls /kernel/*to see some of them). To set a variable in a module or device driverwhen it is loaded, prefix the variable name by the module name and acolon. For example:

set pt_cnt = 1000 
set shmsys:shminfo_shmmax = 0x20000000

The History of Kernel Tuning

Sowhy is there so much emphasis on kernel tuning? And why are there suchhigh expectations of the performance boost available from kerneltweaks? I think the reasons are historical, and I’ll return to my caranalogy to explain it.

Comparea 1970s car with a 1998 car. The older car has a carburetor, needsregular tune-ups, and is likely to be temperamental at best. The 1998car has computerized fuel injection, self-adjusting engine components,and is easier to live with, consistent and reliable. If the old carwon’t start reliably, you get out the workshop manual and tinker with alarge number of fine adjustments. The 1998 car’s computerized ignitionand fuel injection systems have no user serviceable components.

Unixstarted out in an environment where the end users had source code anddid their own tuning and support. If you like this way of working, youprobably already run the free Unix clone, Linux, on your PC at home. AsUnix became a commercial platform for running applications, the endusers changed. Commercial users just want to run their application, andtinkering with the operating system is a distraction. SunSoft engineershave put a lot of effort into automating the tuning for Solaris 2. Itadaptively scales according to the hardware capabilities and theworkload it is running. The self-tuning nature of modern cars is now amajor selling point. The self-configuring and tuning nature of Solariscontributes to its ease of use and greatly reduces the gains fromtweaking it yourself. Each successive version of Solaris 2 has removedtuning variables by converting hand-adjusted values into adaptivelymanaged limits.

IfSunSoft can describe a tunable variable and when it should be tuned indetail, they could either document this in the manual or implement thetuning automatically. In most cases, automatic tuning has beenimplemented. The tuning manual should really tell you which thingsdon’t need to be tuned any more, but it doesn’t. This is one of mycomplaints about the manual, which is really in need of a completerewrite. It is too closely based on the original Unix System V manualfrom many years ago when things did need tuning.

Tuning to Incorporate Extra Information

Anadaptively managed kernel can react only to the workload it sees. Ifyou know enough about the workload, you may be able to use the extrainformation to effectively preconfigure the algorithms. In most cases,the gains are minor. Increasing the size of the name caches on NFSservers falls into this category. One problem is that the administratoroften knows enough to be dangerous, but not enough to be useful.

Tuning During Development

Theprimary reason why there are so many obscure “folkloric” kerneltunables is that they are used to provide options and allow tuningduring the development process. Kernel developers can read the sourcecode and try things out under controlled conditions. When the finalproduct ships, the tunables are often still there. Each bug fix and newversion of the product potentially changes the meaning of the tunables.This is the biggest danger for an end user, who is guessing what atunable does from its name or from knowledge of an older Uniximplementation.

Tuning to Solve Problems

Whena bug or performance problem is being fixed, the engineer tries to findan easy workaround that can be implemented immediately. It takes muchlonger to rewrite and test the code to eliminate the problem, and theproper fix will be part of a patch or turn up in the next release ofthe operating system. There may be a kernel tunable that can be changedto provide a partial workaround, and this information will be providedto the end user. The problem is that these “point patch” fixessometimes become part of the folklore and are propagatedindiscriminately, where they may cause problems.

Inone real-life case, a large SPARCcenter 2000 configuration was runningvery slowly. The problem turned out to be a setting in/etc/systemthat had been supplied to fix a problem on a small SPARCstation 2several years before. The administrator had carefully added it duringinstallation to every machine at his site. Instead of increasing thesize of a dynamically configured kernel table on a machine with 32Mbytes of RAM, it was drastically reducing its size on a machine with 1Gbyte of RAM. The underlying problem did not even exist in the versionof Solaris 2 that was currently being used at the site!

The message is, clean out your/etc/system when you upgrade.

The Placebo Effect

Youmay be convinced that setting a tunable has a profound effect on yoursystem when it is truly doing nothing. In one case, an administratorwas adamant that a bogus setting could not be removed from/etc/systemwithout causing serious performance problems. When the “variable notfound” error message that displayed during boot was pointed out, itstill took a while to convince him that this meant that the variable nolonger existed in this release, and so it could not be having anyeffect.

Tunable Parameters

The kernel tunable values listed in this book include the main tunables that are worth worrying about. A huge number of global values are defined in the kernel; if you hear of a tweak that is not listed here or in “Tunables Quick Reference”on page 557, think twice before using it. The algorithms, defaultvalues, and existence of many of these variables vary from one releaseto the next. Do not assume that an undocumented tweak that works wellfor one kernel will apply to other releases, other kernel architecturesof the same release, or even a different patch level.

The Ones That Went Away

I looked at HP-UX 9.0 on an HP9000 server; thesamutility provides an interface for kernel configuration. Like Solaris1/SunOS 4, the HP-UX kernel must be recompiled and relinked if it istuned and drivers and subsystems are added. In Solaris 2, file systems,drivers, and modules are loaded into memory when they are used, and thememory is returned if the module is no longer needed. Rather than a GUIbeing provided, the whole process is made transparent. There are 50 ormore tunable values listed insam. Some of them are familiar or map to dynamically managed Solaris 2 parameters. There is amaxusers parameter that must be set manually, and the size of several other parameters is based uponmaxusersin a way similar to sizing in Solaris 2. Of the tunables that I canidentify, the Solaris 2 equivalents are either unnecessary or listed in“Tunables Quick Reference” on page 557.

Dynamic Kernel Tables in Solaris 2

Solaris2 dynamically manages the memory used by the open file table, the locktable (in 2.5), the callout queue, the streams subsystem, the processtable, and the inode cache. Unlike other Unix implementations thatstatically allocate a full-size array of data structures, wasting a lotof precious memory, Solaris 2 allocates memory as it goes along. Someof the old tunables that are used to size the statically allocatedmemory in other Unixes still exist. They are now used as limits toprevent too many data structures from being allocated. This dynamicallocation approach is one reason why it is safe to letmaxusers scale automatically to very high levels. In Solaris 1 or HP-UX 9, settingmaxusersto 1024 and rebuilding the kernel would result in a huge kernel (whichmight not be able to boot) and a huge waste of memory. In Solaris 2,the relatively small directory name lookup cache is the only staticallysized table derived frommaxusers. Take a look at your own/etc/systemfile. If there are things there that are not listed in this book andthat you don’t understand, you have a problem. There should be a largecomment next to each setting that explains why it is there and how itssetting was derived. You could even divide the file into sections forconfiguration, extra information, development experiments, problemfixes, and placebos.

Ihope I have convinced you that there are very few Solaris tunables thatshould be documented and supported for general use. Why worry abouttweaking a cranky and out-of-date Unix system, when you can use onethat takes care of itself?

SunOS and Solaris Release Overview

Thenumber of fixed-size tables in the kernel has been reduced in eachrelease of Solaris. Most are now dynamically sized or are linked to themaxusers calculation, which is now alsosized automatically. There is no need for general-purpose tuning inrecent Solaris releases; the main performance improvements come fromkeeping patch levels up to date. Specific situations may require tuningfor configuration or performance reasons that are later. My ownpersonal recommendations for each release vary, as summarized in Table 14-1.

Table 14-1. Tuning SunOS Releases

Release Recommendations Older releases Upgrade to a more recent release SunOS 5.4/Solaris 2.4 Add kernel patch to fix pager, add TCP/IP and year2000 patches SunOS 5.5/Solaris 2.5 Add kernel patch to fix pager, add TCP/IP and year2000 patches SunOS 5.5.1/Solaris 2.5.1 Add kernel patch, TCP/IP and year2000 patches, hme patch SunOS 5.6/Solaris 2.6 Add TCP/IP patch

Solaris 2 Performance Tuning Manuals

There is a manual section called Administering Security, Performance, and Accounting in Solaris . The manual was revised and corrected for Solaris 2.4 but has not changed significantly since and so is not very useful. The SMCC NFS Server Performance and Tuning Guideis kept up to date in each release and contains useful information.Parts of the NFS guide were originally written by Brian Wong and me,along with NFS engineering staff. The first of these manuals is part ofthe SunSoft System Software AnswerBook, the second is part of the Sun Microsystems Computer Corporation Hardware AnswerBook. Both can be read via the online documentation service at http://docs.sun.com.

Using`/etc/system` to Modify Kernel Variables in Solaris 2

In SunOS 4, the kernel must be recompiled after values inparam.c orconf.c are tweaked to increase table sizes. In Solaris 2, there is no need to recompile the kernel; it is modified by changing/etc/system and rebooting./etc/systemis read by the kernel at startup. It configures the search path forloadable kernel modules and allows kernel variables to be set. See themanual page forsystem(4) for the full syntax.^[1]

^[1] The command to use isman -s 4 system, since there are other things called “system” in the manual.

Be very careful withset commands in/etc/system;they cause arbitrary, unchecked, and automatic changes to variables inthe kernel, so there is plenty of opportunity to break your system. Ifyour machine will not boot and you suspect a problem with/etc/system, use theboot -aoption. With this option, the system prompts (with defaults) for itsboot parameters. One of these parameters is the configuration file/etc/system. Either enter the name of a backup copy of the original/etc/system file or enter/dev/null. Fix the file and reboot the machine immediately to check that it is again working properly.

Watchfor messages at boot time. If an error is detected or a variable nameis misspelled or doesn’t exist, then a message is displayed on theconsole during boot.

General Solaris 2 Performance Improvements

Thechanges in Solaris 2.4 focus on improving interactive performance insmall memory configurations, improving overall efficiency, andproviding better support for large numbers of connected time-sharingusers with high-end multiprocessor machines.

Someof the changes that improve performance on the latest processor typesand improve high-end multiprocessor scalability could cause a slightreduction in performance on earlier processor types and onuniprocessors. Trade-offs like this are carefully assessed; in mostcases, when changes to part of the system are made, an improvement mustbe demonstrated on all configurations.

Insome areas, internationalization and stricter conformance to standardsfrom the SVID, POSIX, and X/Open cause higher overhead compared to thatof SunOS 4.1.3.

Thebase Solaris 2 operating system contains support for internationallocales that require 16-bit characters, whereas the base SunOS 4 isalways 8-bit. The internationalized and localized versions of SunOS 4were recoded and released much later than the base version. Solaris 2releases have a much simpler localization process. One side effect ofthis simplification is that Solaris 2 commands such assortdeal with 16-bit characters and localized collation ordering, whichslows them down. The standard sort algorithm taken from SVR4 is alsovery inefficient and uses small buffers. If you usesorta lot, then it would be worth getting a copy of a substitute from theGNU archives. Heavy-duty commercial sorting should be done with acommercial package such as SyncSort.

Using Solaris 2 with Large Numbers of Active Users

Toconnect large numbers of users into a system, Ethernet terminal serversusing the Telnet protocol are normally used. Characters typed byseveral users at one terminal server cannot be multiplexed into asingle Ethernet packet; a separate packet must be sent for any activityon each Telnet session. The system calls involved includepoll,which is implemented with single-threaded code in releases beforeSolaris 2.4. This call prevents raw CPU power from being used to solvethis problem—the effect in Solaris 2.3 and earlier releases is that alot of CPU time is wasted in the kernel contending onpoll, and adding another CPU has no useful effect. In Solaris 2.4,pollis fully multithreaded; so, with sufficient CPU power, Telnet no longerlimits the number of users, and kernel CPU time stays low to 500 usersand beyond. Now that the Telnet andpolllimit have gone, any subsequent limit is much more applicationdependent. Tests on a 16-processor SPARCcenter 2000 failed to find anysignificant kernel-related limits before the machine ran out of CPUpower.

Solaris 2.5 further increased efficiency for network-connected users. Thetelnet andrlogindaemons are not in the normal path; a single daemon remains to handlethe protocol itself, but the regular character data that is sentbackward and forward is processed entirely within the kernel, using newstreams modules that take the characters directly from the TCP/IP stackand put them directly into the pseudoterminal. This technique makes thewhole process much more efficient and, since thepoll system call is no longer called, also bypasses the contention problems.

Theunderlying efficiency of the network protocol is part of the problem,and one solution is to abandon the Telnet protocol completely. Thereare two good options, but both have the disadvantage of using aproprietary protocol. The terminal server that Sun supplies is made byXylogics, Inc. Xylogics has developed a multiplexed protocol that takesall the characters from all the ports on their terminal server andsends them as a single large packet to Solaris. A new protocol streamsdriver that Xylogics supplies for Solaris demultiplexes the packet andwrites the correct characters to the correct pseudoterminal. Anotherpossible alternative is to use the DECnet LAT protocol, which addressesthe same issues. Many terminal servers support LAT because of itswidespread use in DEC sites. There are several implementations of LATfor Solaris; for example, Meridian does one that is implemented as astream module and so should be quite efficient.

Directlyconnected terminals also work well, but the Sun-supplied SBus SPC8-port serial card is very inefficient and should be avoided. A farbetter alternative is a SCSI-connected serial port multiplexer. Thischoice saves valuable SBus slots for other things, and large serversusually have spare SCSIbus slots that can be used.

The Performance Implications of Patches

Everyrelease has a set of patches for the various subsystems. The patchesgenerally fall into three categories: security-related patches,reliability-related patches, and performance-related patches. In somecases, the patches are derived from the ongoing development work on thenext release, where changes have been backported to previous releasesfor immediate availability. Some patches are also labeled asrecommended patches, particularly for server systems. Some of thereliability-related patches are for fairly obscure problems and mayreduce the performance of your system. If you are benchmarking, try toget benchmark results with and without patches so that you can see thedifference.

Solaris 2.6 Performance Improvements

Solaris2.6 is a different kind of release from Solaris 2.5 and Solaris 2.5.1.Those releases were tied to very important hardware launches:UltraSPARC support in Solaris 2.5 and Ultra Enterprise Server supportin Solaris 2.5.1. With a hard deadline, you have to keep functionalityimprovements under control, so there were relatively few new features.Solaris 2.6 is not tied to any hardware launch; new systems released inearly 1998 all run an updated version of 2.5.1 as well as Solaris 2.6.The current exception is the Enterprise 10000 (Starfire), which was nota Sun product early enough (Sun brought in the development team fromCray during 1996) to have Solaris 2.6 support at first release. During1998, an update release of Solaris 2.6 will include support for theE10000. Because Solaris 2.6 had a more flexible release schedule andfewer hardware dependencies, it was possible to take longer over thedevelopment and add far more new functionality. Some of the projectsthat weren’t quite ready for Solaris 2.5 (like large-file support)ended up in Solaris 2.6. Other projects, like the integration of Java1.1, were important enough to delay the release of Solaris 2.6 for afew months. Several documents on www.sun.comdescribe the new features, so I’ll concentrate on explaining some ofthe performance tuning that was done for Solaris 2.6 and tell you aboutsome small but useful changes to the performance measurements thatsneaked into Solaris 2.6. Some of them were filed as Requests ForEnhancements (RFEs) by Brian Wong and me over the last few years.

Web Server Performance

Themost dramatic performance change in Solaris 2.6 is to web servers. Thepublished SPECweb96 results show that Solaris 2.6 is far faster thanSolaris 2.5.1. The message should be obvious. Upgrade busy web serversto Solaris 2.6 as soon as you can. The details are discussed in “Internet Servers” on page 57.

Database Server Performance

Databaseserver performance was already very good and scales well with Solaris2.5.1. There is always room for improvement though, and several changeshave been made to increase efficiency and scalability even further inSolaris 2.6. A few features are worth mentioning.

Thefirst new feature is a transparent increase in efficiency on UltraSPARCsystems. The intimate shared memory segment used by most databases isnow mapped by 4-Mbyte pages, rather than by lots of 8-Kbyte pages.

Thesecond new feature is direct I/O. This feature enables a database tablethat is resident in a file system to bypass the filesystem bufferingand behave more like a piece of raw disk. See “Direct I/O Access” on page 161.

New and Improved Performance Measurements

Acollection of Requests For Enhancement (RFE) had built up over severalyears, asking for better measurements in the operating system andimprovements for the tools that display the metrics. Brian Wong and Ifiled some of them, while others came from database engineering andfrom customers. These RFEs have now been implemented—so I’m having tothink up some new ones! You should be aware that Sun’s bug-trackingtool has three kinds of bug in it: problem bugs, RFEs, and Ease Of Use(EOU) issues. If you have an idea for an improvement or think thatsomething should be easier to use, you can help everyone by taking thetrouble to ask Sun Service to register your suggestion. It may take along time to appear in a release, but it will take even longer if youdon’t tell anyone!

The improvements we got this time include new disk metrics, newiostatoptions, tape metrics, client-side NFS mount point metrics, networkbyte counter, and accurate process memory usage measurements.

Parameters Derived from`maxusers`

WhenBSD Unix was originally developed, its designers addressed the problemof scaling the size of several kernel tables and buffers by creating asingle sizing variable. The scaling needed was related to the number oftime-sharing terminal users the system could support, so the variablewas namedmaxusers. Nowadays, somuch has changed that there is no direct relationship between thenumber of users a system supports and the value ofmaxusers.Increases in memory size and more complex applications require muchlarger kernel tables and buffers for the same number of users.

The calculation of parameters derived frommaxusers is shown in Table 14-2. The inode and name cache are described in more detail in “Directory Name Lookup Cache” on page 308. The other variables are not performance related.

Table 14-2. Default Settings for Kernel Parameters

Kernel Resource Variable Default Setting Processes max_nprocs 10 + 16 * maxusers Inode Cache ufs_ninode max_nprocs + 16 + maxusers + 64 Name Cache ncsize max_nprocs + 16 + maxusers + 64 Quota Table ndquot (maxusers * NMOUNT)/4 + max_nprocs User Process Limit maxuprc max_nprocs - 5

Changing`maxusers` and Pseudo-ttys in Solaris 2

The variable that really limits the number of user logins on the system ispt_cnt.It may be necessary to set the number of pseudo-ttys higher than thedefault of 48, especially in a time-sharing system that usesTelnet-from-Ethernet terminal servers to connect users to the system.If you are configuring a time-shared multiuser system with more than ahundred active users, make sure you have first read “Using Solaris 2 with Large Numbers of Active Users” on page 356. A practical limit is imposed by the format of theutmp file entry to 62*62 = 3844 Telnets and another 3844 rlogins; until this limit is changed, keeppt_cnt under 3000.

To actually create the/dev/pts entries, runboot -r after you setpt_cnt; see Figure 14-1.

Figure 14-1. Example Pseudo-tty Count Setting in`/etc/system`

set pt_cnt = 1000

Autoconfiguration of`maxusers` in Solaris 2.3 Through Solaris 2.6

Themaxusers setting in Solaris 2 is automatically set via thephysmem variable to be approximately equal to the number of Mbytes of RAM configured into the system.maxusersis usually set to 2 or 3 Mbytes less than the total RAM in the system.The minimum limit is 8 and the maximum automatic limit is 1024,corresponding to systems with 1 Gbyte or more of RAM.maxusers can still be set manually in/etc/system,but the manual setting is checked and limited to a maximum of 2048.This setting was tested on all kernel architectures but could wastekernel memory. In most cases, you should not need to setmaxusers explicitly.

Filesystem Name and Attribute Caching

Thissection provides a simplified view of how the kernel algorithms work.Some of the details are skipped, as the intent is to understand themeasurements provided bysar. This topic has already been covered in some detail in “File Access Caching with Local Disk” on page 307.

Vnodes, Inodes, and Rnodes

Unixtraditionally uses inodes to hold the information required to accessfiles, such as the size, owner, permissions, modification date and thelocation of the data blocks for the file. SunOS 4, SVR4, and Solaris 2all use a higher-level abstraction called a virtual node, or vnode.This scheme allows all filesystem types to be implemented in the sameway since the kernel works in terms of vnodes and each vnode containssome kind of inode that matches the filesystem type. For UFS, thesestructures are still called inodes. For NFS, they are called rnodes.

Directory Name Lookup Cache

Thedirectory name lookup cache (DNLC) is used whenever a file is opened.The DNLC associates the name of a file with a vnode. Since the filesystem forms a hierarchy, many directories need to be traversed to getat a typical, deeply nested, user file. Short file names are cached,and names that are too long to be cached are looked up the slow way byreading the directory. The number of name lookups per second isreported asnamei/s by thesar -a command; see Figure 14-2.

Figure 14-2. Example`sar` Command to Monitor Attribute Cache Rates

% sar -a 1 
SunOS hostname 5.4 sun4m    06/19/94 
09:00:00  iget/s namei/s dirbk/s 
09:00:01       4       9       2

ForSunOS 4, names up to 14 characters long are cached. For Solaris 2,names of up to 30 characters are cached. A cache miss or oversizedentry means that more kernel CPU time and perhaps a disk I/O may beneeded to read the directory. The number of directory blocks read persecond is reported asdirbk/s bysar -a. It is good policy to keep heavily used directory and symbolic link names down to 14 or 30 characters.

The DNLC is sized to a default value based onmaxusers and a large cache size (ncsize in Table 14-2) significantly helps NFS servers that have a lot of clients. The commandvmstat -sshows the DNLC hit rate since boot. A hit rate of less than 90 percentwill need attention. Every entry in the DNLC cache points to an entryin the inode or rnode cache (only NFS clients have an rnode cache).

Theonly limit to the size of the DNLC cache is available kernel memory.For NFS server benchmarks, the limit has been set as high as 16,000;for the maximummaxusers value of2048, the limit would be set at 34,906. Each DNLC cache entry is quitesmall; it basically just holds the 14 or 30 character name and a vnodereference. Increase it to at least 8000 on a busy NFS server that has512 Mbytes or less RAM by adding the line below to/etc/system. Figure 14-3 illustrates DNLC operation.

Figure 14-3. The Directory Name Lookup Cache and Attribute Information Flows

set ncsize = 8000

The Inode Cache and File Data Caching

UFSstores inodes on the disk; the inode must be read into memory wheneveran operation is performed on an entity in UFS. The number of inodesread per second is reported asiget/s by thesar -acommand. The inode read from disk is cached in case it is needed again,and the number of inodes that the system will cache is influenced by akernel tunable calledufs_ninode.The inodes are kept on a linked list rather than in a fixed-size table.A UFS file is read or written by paging from the file system. All pagesthat are part of the file and are in memory will be attached to theinode cache entry for that file. When a file is not in use, its data iscached in memory by an inactive inode cache entry. When an inactiveinode cache entry that has pages attached is reused, the pages are puton the free list; this case is shown bysar -g as%ufs_ipf. This number is the percentage of UFS inodes that were overwritten in the inode cache byigetand that had reusable pages associated with them. These pages areflushed and cannot be reclaimed by processes. Thus, this number is thepercentage ofigets with page flushes. Any non-zero values reported bysar -g indicate that the inode cache is too small for the current workload.

InSolaris 2.4, the inode algorithm was reimplemented. A reuse list ismaintained of blank inodes for instant use. The number of active inodesis no longer constrained, and the number of idle inodes (inactive butcached in case they are needed again) are kept betweenufs_ninode and 75 percent ofufs_ninode by a kernel thread that scavenges the inodes to free them and maintains entries on the reuse list. If you usesar -v to look at the inode cache, you may see a larger number of existing inodes than the reported “size.”

The only upper limit is the amount of kernel memory used by the inodes. The tested upper limit in Solaris 2 corresponds tomaxusers = 2048, which is the same asncsize at 34906. Usesar -kto report the size of the kernel memory allocation; each inode usesabout 300 bytes of kernel memory. Since it is just a limit,ufs_ninode can be tweaked withadb on a running system with immediate effect. On later Solaris 2 releases, you can see inode cache statistics by usingnetstat -k to dump out the raw kernel statistics information, as shown in Figure 14-4.

Figure 14-4. Example`netstat` Output to Show Inode Statistics

% netstat -k 
inode_cache: 
size 1200 maxsize 1200 hits 722 misses 2605 mallocs 1200 frees 0 
maxsize_reached 1200 
puts_at_frontlist 924 puts_at_backlist 1289 dnlc_looks 0 dnlc_purges 0

If themaxsize_reached value is higher than themaxsize (this variable is equal toufs_ninode), then the number ofactive inodes has exceeded the cache size at some point in the past, so you should increaseufs_ninode. Set it on a busy NFS server by editingparam.c and rebuilding the kernel in SunOS 4 and by adding the following line to/etc/system for Solaris 2.

set ufs_ninode=10000                       For Solaris 2.2 and 2.3 only 
set ufs_ninode=5000                        For Solaris 2.4 only

Rnode Cache

A similar cache, the rnode cache, is maintained on NFS clients to hold information about files in NFS. The data is read by NFSgetattrcalls to the NFS server, which keeps the information in a vnode of itsown. The default rnode cache size is twice the DNLC size and should notneed to be changed.

Buffer Cache

Thebuffer cache is used to cache all UFS disk I/O in SunOS 3 and BSD Unix.In SunOS 4, generic SVR4, and Solaris 2, it is used to cache inode-,indirect block-, and cylinder group-related disk I/O only.

In Solaris 2,nbuf keeps track of how many page-sized buffers have been allocated, and a new variable calledp_nbuf (default value 100) defines how many new buffers are allocated in one go. A variable calledbufhwm controls the maximum amount of memory allocated to the buffer and is specified in Kbytes. The default value ofbufhwmallows up to two percent of system memory to be used. On SPARCcenter2000 systems that have a large amount of memory, two percent of thememory is too much and the buffer cache can cause kernel memorystarvation, as described in “Kernel Memory Allocation” on page 365. Thebufhwm tunable can be used to fix this case by limiting the buffer cache to a few Mbytes, as shown below.

set bufhwm = 8000

In Solaris 2, the buffer cache can be monitored bysar -b, which reports a read and a write hit rate for the buffer cache, as shown in Figure 14-5. “Administering Security, Performance, and Accounting in Solaris 2” contains unreliable information about tuning the buffer cache.

Figure 14-5. Example`sar` Output to Show Buffer Cache Statistics

# sar -b 5 10 
SunOS hostname 5.2 Generic sun4c    08/06/93 
23:43:39 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 
... 
Average        0      25     100       3      22      88       0       0

An alternative look at the buffer cache hit rate can be calculated from part of the output ofnetstat -k, as shown in Figure 14-6.

Figure 14-6. Example`netstat` Output to Show Buffer Cache Statistics

% netstat -k 
biostats: 
buffer cache lookups 9705 buffer cache hits 9285 new buffer requests 0 
waits for buffer allocs 0 buffers locked by someone 3 duplicate buffers 
found 0

Comparingbuffer cache hits with lookups (9285/9705) shows a 96 percent hit ratesince reboot in this example, which seems to be high enough.

Measuring the Kernel

This section explains how you can use the tools provided to monitor the algorithms described earlier in this chapter.

Thesarutility has a huge number of options and some very powerfulcapabilities. One of its best features is that you can log its fulloutput in date-stamped binary form to a file. You can even look at afew selected measures interactively, then go back to look at all theother measures if you need to.sargenerates average values automatically and can be used to produceaverages between specified start and end times from a binary file. Manyof thesar options have already been described in “Vnodes, Inodes, and Rnodes” on page 360 and “Understanding vmstat and sar Output” on page 320.

One particularly useful facility is that the system is already set up to capture binarysar records at 20-minute intervals and to maintain one-month’s worth of past records in/var/adm/sa. This feature can easily be enabled, as discussed in “Collecting Measurements” on page 48.

Using`sar` to Examine Table Sizes

sar likes to average sizes over time, sosar -v 1 tellssarto make one measure over a one-second period. The file table is nolonger a fixed-size data structure in Solaris 2, so its size is givenas zero. The examples in Figure 14-7 were taken on a 128-Mbyte desktop machine withmaxusers set at the default value of 123.

Figure 14-7. Example`sar` Output to See Table Sizes in Solaris 2

% sar -v 1 
SunOS hostname 5.5.1 Generic_103640-14 sun4u    01/19/98 
11:22:51  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz 
11:22:52   72/1978    0 3794/3794    0  526/526     0    0/0

Kernel Memory Allocation

Thekernel is probably using more memory than you expect. Kernel memoryusage increases as you add RAM, CPUs, and processes to a system. Youneed to monitor usage on very large active systems because it ispossible for the kernel to reach itskernelmaplimit. This is a problem on SPARCcenter 2000 systems, which have a512-Mbyte limit for the whole kernel address space. UltraSPARC-basedsystems have a separate 4-Gbyte address space for the kernel; thislimit is plenty and avoids any problems. In the future, the 64-bitSolaris kernel will not have even this limit.

Acompletely new kernel memory allocation system is implemented inSolaris 2.4. It has less CPU overhead and allocates packed datastructures in a way that saves space and improves CPU cache hit rates.On desktop workstations, this allocation frees up a megabyte or moreand helps bring the memory requirements closer to SunOS 4 levels. Theallocation statistics are summarized in groups for reporting viasar -k, but the details of the allocations at each block size can be seen via part of thecrash kmastat command output, as shown in Figure 14-8in cut-down form. As you can see, the memory allocator contains specialsupport for individual types of data, and it can be interesting tomonitor the memory usage of each kernel subsystem in this way.Nonspecific memory allocations are made by means of thekmem_alloc_xxxxpools. Thekmem_magazineconcept was introduced in Solaris 2.5. It forms CPU-specific subpoolsfor the commonest data types. This approach avoids multiprocessorlocking overhead and improves cache hit rates in a multiprocessorsystem. This highly sophisticated kernel memory allocator is one of thekey reasons why Solaris 2 scales efficiently, better than otheroperating systems, to large numbers of CPUs.

Figure 14-8. Example Output from`crash kmastat`

Code View: Scroll / Show All

# crash 
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout 
> kmastat 
                       buf   buf   buf   memory    #allocations 
cache name            size avail total   in use    succeed fail 
----------           ----- ----- ----- --------    ------- ----
kmem_magazine_1          8   923  1020     8192       1370    0 
kmem_magazine_3         16   397   510     8192       1931    0 
kmem_magazine_7         32   394   510    16384       6394    0 
.... 
kmem_alloc_12288     12288     3    14   172032       1849    0 
kmem_alloc_16384     16384     2    12   196608        261    0 
sfmmu8_cache           232  1663  5304  1277952      27313    0 
sfmmu1_cache            64   811  1582   114688       5503    0 
seg_cache               28   852  2805    90112    2511092    0 
ddi_callback_cache      24     0     0        0          0    0 
thread_cache           288    77   168    49152      80659    0 
lwp_cache              472    45   136    65536        407    0 
cred_cache              96    58    85     8192     245144    0 
file_cache              40   256   850    40960    7550686    0 
streams_msg_40         112   244   884   106496   28563463    0 
streams_msg_88         160   290   336    57344   15292360    0 
... 
streams_msg_4024      4096     2     2     8192       3485    0 
streams_msg_9464      9536    54    54   516096      44775    0 
streams_msg_dup         72    88   102     8192     367729    0 
streams_msg_esb         72     0     0        0       2516    0 
stream_head_cache      152    72   408    65536      21523    0 
flk_edges               24     0     0        0          0    0 
snode_cache            128    90   660    90112    3530706    0 
... 
ufs_inode_cache        320  1059  5520  1884160     974640    0 
fas0_cache             188   156   168    32768     619863    0 
prnode_cache           108    65    72     8192    4746459    0 
fnode_cache            160    31    42     8192        278    0 
pipe_cache             288    34    75    24576       4456    0 
rnode_cache            384    15    21     8192         95    0 
lm_vnode                84     0     0        0          0    0 
lm_xprt                 16     0     0        0          0    0 
lm_sysid                96     0     0        0          0    0 
lm_client               40     0     0        0          0    0 
lm_async                24     0     0        0          0    0 
lm_sleep                64     0     0        0          0    0 
lm_config               56   143   145     8192          2    0 
----------           ----- ----- ----- --------    ------- ----
permanent                -     -     -   409600       3100    0 
oversize                 -     -     -  2809856      10185    0 
----------           ----- ----- ----- --------    ------- ----
Total                    -     -     - 14606336  128754134    0

Kernel Lock Profiling with`lockstat` in Solaris 2.6

The use of this tool is described in “Monitoring Solaris 2.6 Lock Statistics” on page 239.

Kernel Clock Tick Resolution

Manymeasurements in the kernel occur during the 100 Hz clock tickinterrupt. This parameter sets a limit on the time resolution ofevents. As CPU speed increases, the amount of work done in 10milliseconds increases greatly. Increasing the clock tick rate for allsystems would just add overhead, so a new option in Solaris 2.6 allowsthe clock tick to be set to 1000 Hz, as shown below. This option ismost useful for real-time processing. It increases the resolution ofCPU time measurements, but it is better to use microstate accounting,described in “Process Data Sources” on page 416, to obtain really accurate measurements.

set hires_tick=1

Setting Default Limits

The default limits are shown by thesysdef -i command, which lists the values in hexadecimal, as shown in Figure 14-9.

Figure 14-9. Example Systemwide Resource Limits Shown by`sysdef`

% sysdef -i 
... 
    Soft:Hard           Resource 
Infinity:Infinity       cpu time 
Infinity:Infinity       file size 
1fefe000:1fefe000       heap size 
  800000: ff00000       stack size 
Infinity:Infinity       core file size 
      40:     400       file descriptors 
Infinity:Infinity       mapped memory

Thehard limits for data size and stack size vary. Some older machines withthe Sun-4 MMU can map only 1 Gbyte of virtual address space, so stacksize is restricted to 256 Mbytes and data size is restricted to 512Mbytes. For machines with the SPARC Reference MMU, the maximums are 2Gbytes each.

To increase the default number of file descriptors per process, you can set the kernel tunablesrlim_fd_cur andrlim_fd_max in/etc/system.

The definition ofFILE for thestdio library can handle only 256 open files at most, but raw read/write will work above that limit. Theselect system call uses a fixed-size bitfield that can only cope with 1024 file descriptors; the alternative is to usepoll, which has no limit.

It is dangerous to setrlim_fd_cur to more than 256. Programs that need more file descriptors should either callsetrlimitdirectly or have their own limit set in a wrapper script. If you needto use many file descriptors to open a large number of sockets or otherraw files, it is best to force all of them to file descriptors numberedabove 256. This lets system functions such as name services, whichdepend upon stdio file operations, continue to operate using thelow-numbered file descriptors.

Mapping Device Nicknames to Full Names in Solaris 2

Theboot sequence builds a tree in memory of the hardware devices presentin the system; the tree is passed to the kernel and can be viewed withtheprtconf command, as described in “The Openboot Device Tree —prtconf andprtdiag” on page 438. This tree is mirrored in the/devices and/dev directories; after hardware changes are made to the system, these directories must be reconfigured with thedrvconfig,tapes, anddisks commands that are run automatically whenever you do aboot -r. The file/etc/path_to_instmaps hardware addresses to symbolic device names. An extract from asimple configuration with the symbolic names added is shown in Figure 14-10.

When a large number of disks are configured, it is important to know this mapping so thatiostat and related commands can be related to the output fromdf. In Solaris 2.6 and later, this mapping is done for you by the-n option toiostat. For earlier releases, you need to do it yourself or use the SE toolkitdisks.se command described in “disks.se” on page 479. Thesbus@1 part tells you which SBus is used (an E10000 can have up to 32 separate SBuses); theesp@0 part tells you which SBus slot theesp controller (one of the many types of SCSI controller) is in. Thesd@0 part tells you that this is SCSI target address 0. The/dev/dsk/c0t0d0s2device name indicates SCSI target 0 on SCSI controller 0 and is asymbolic link to a hardware specification similar to that found in /etc/path_to_inst. The extra:c at the end of the name in/devices corresponds to thes2 at the end of the name in/dev. Slices0 is partition:a,s1 is:b,s2 is:c, and so forth.

Figure 14-10. Mapping Solaris 2 Device Nicknames into Full Names

% more /etc/path_to_inst 
... 
"/fd@1,f7200000" 0                            fd0 
"/sbus@1,f8000000/esp@0,800000/sd@3,0" 3      sd3 
"/sbus@1,f8000000/esp@0,800000/sd@0,0" 0      sd0 
"/sbus@1,f8000000/esp@0,800000/sd@1,0" 1      sd1

% iostat -x 
                                 extended disk statistics 
disk      r/s  w/s   Kr/s   Kw/s wait actv  svc_t  %w  %b 
fd0       0.0  0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd0       0.1  0.1    0.4    0.8  0.0  0.0   49.3   0   1 
sd1       0.1  0.0    0.8    0.1  0.0  0.0   49.0   0   0 
sd3       0.1  0.1    0.6    0.8  0.0  0.0   75.7   0   1

% df -k 
Filesystem            kbytes    used   avail capacity  Mounted on 
/dev/dsk/c0t3d0s0      19107   13753    3444    80%    /             sd3 
/dev/dsk/c0t3d0s6      56431   46491    4300    92%    /usr          sd3 
/proc                      0       0       0     0%    /proc 
fd                         0       0       0     0%    /dev/fd^[1] 
swap                    2140      32    2108     1%    /tmp 
/dev/dsk/c0t3d0s5      19737   17643     124    99%    /opt          sd3 
/dev/dsk/c0t1d0s6      95421   71221   14660    83% /usr/openwin     sd1 
/dev/dsk/c0t0d0s2     308619  276235    1524    99%    /export       sd0

^[1] /dev/fd is a file descriptor filesystem type, nothing to do with floppy disks!

% ls -l /dev/dsk/c0t0d0s2 
lrwxrwxrwx  1 root           51 Jun  6 15:59 /dev/dsk/c0t0d0s2 -> 
../../devices/sbus@1,f8000000/esp@0,800000/sd@0,0:c

A Command Script to Do It for You

Thecsh/nawk script presented in Figure 14-11 can be used to print out the device-tonickname mappings. Enter it with three long command lines starting withset,if, andnawk— it doesn’t work if you try to use multiple lines or backslash continuation.

Figure 14-11.`Whatdev`: Device to Nickname Mapping Script

#!/bin/csh 
# print out the drive name - st0 or sd0 - given the /dev entry 
# first get something like "/iommu/.../.../sd@0,0' 
set dev = `/bin/ls -l $1 | nawk '{ n = split($11, a, "/"); 
split(a[n],b,":"); for(i = 4;iprintf("/%s\n", b[1]) }'` 
if ( $dev == "" ) exit 
# then get the instance number and concatenate with the "sd" 
nawk -v dev=$dev '$1 ~ dev{n=split(dev, a, "/"); split(a[n], \ 
b, "@"); printf("%s%s\n", b[1], $2) }' /etc/path_to_inst

An example of its use:

% foreach device (/dev/dsk/c*t*d*s2) 
> echo -n $device " ---- " 
> whatdev $device 
> end 
/dev/dsk/c0t3d0s2 ---- sd3 
/dev/dsk/c0t5d0s2 ---- sd5

[Cockcroft98] Chapter 14. Kernel Algorithms and Tuning [Cockcroft98] Chapter 1. Quick Tips and Recipes [Cockcroft98] Chapter 13. RAM and Virtual Memory [Horwitz02] Chapter 11. Performance Tuning and Capacity Planning [Cockcroft98] Chapter 7. Applications [Cockcroft98] Chapter 8. Disks [Cockcroft98] Chapter 9. Networks [Cockcroft98] Chapter 10. Processors [Cockcroft98] Chapter 12. Caches [Cockcroft98] Chapter 3. Performance Measurement [Cockcroft98] Chapter 2. Performance Management [Cockcroft98] Chapter 11. System Architectures [Cockcroft98] Chapter 6. Source Code Optimization [Cockcroft98] Chapter 16. The SymbEL Example Tools [Milberg09] Chapter 17. Tuning AIX for Oracle [Milberg09] Chapter 15. Network I/O: Tuning Beat me, break me. Blog Archive ? Algorithms and Data Structures Some notes on lock-free and wait-free algorithms [Horwitz02] Section 11.8 Tuning Memory and Swap Performance [Horwitz02] Section 11.1-11.8 Performance Tuning and Capacity Planning Chapter 3. Dimension 2: Acquiring and Integra... Chapter 4. Dimension 3: Extending and Refinin... Chapter 4. Dimension 3: Extending and Refinin... [Laskey99] Chapter 6. Security and Monitoring