Performance Analysis Tools for Linux Developers: Part 2

来源:百度文库 编辑:神马文学网 时间:2024/10/03 02:37:08
Performance Analysis Tools for Linux Developers: Part 2

Setting performance profiling and analysis goals

ByMark Gray and Julien Carreno '
October 23, 2009
URL:http://www.ddj.com/open-source/220900401

Mark Gray is a software development engineer working at Intel onReal-Time embedded systems for Telephony. Julien Carreno is a softwarearchitect and senior software developer at specializing in embeddedReal-time applications on Linux


In Part 1of this article, we summarized some of the performance tools availableto Linux developers on Intel architecture. In Part 2, we cover a set ofstandard performance profiling and analysis goals and scenarios thatdemonstrate what tool or combination of tools to select for eachscenario. In some scenarios, the depth of analysis is also a determiningfactor in selecting the tool required. With increasingly deeper levelsof investigation, we need to change tools to get the increased level ofdetail and focus from them. This is similar to using a microscope withdifferent magnification lenses. We start from the smallest magnificationand gradually increase magnification as we focus on a specific area.

Methodologies

In any performance analysis or profiling exercise, it is the authors'experience that there are two critical pieces of information that needto be present from the start:

  1. What is my expected system behavior? In other words, how do I expect the system to behave under normal conditions? In a structured project environment, this translates to a very clearly-defined set of requirements at a system level as well as, possibly, at an individual component or application level.
  2. What is my problem statement? Simplistically, this can be one of two possibilities:

    • My system is not behaving according to expectations.
    • My system is behaving as expected, but I want to know what "makes it tick". I want to be able to answer questions such as: "Where are my CPU cycles being spent?", "How much memory am I really using?" This information can be used to understand any inefficiencies in my algorithm or problem areas. This information may also be used to accurately predict how the system will scale to support higher workloads.

When items 1 and 2 above are clear, you have effectively determined"where you are" and "where you want to be". For the purposes of thisarticle, we focus on scenarios in which the system is not behavingaccording to specifications rather than measurement on a working system.

From experience, it is critical to apply a structured method at thestart of any performance analysis since any activity with aninappropriate tool can be a complete waste of time. Performance can bebroadly affected by issues in three distinct areas: CPU occupancy,memory usage and IO. As a first step, it is absolutely essential todetermine which area your problem is coming from since the tools mainlyfocus on one of these three areas to provide any kind of detailed data.Hence, the first step is always to use general tools that provide ahigh-level view of all three areas simultaneously. Once, this has beendone, the developer can delve deeper into a specific area using toolswith an increasing level of detail and potentially more and moreinvasiveness. It is advised not to make any assumptions regarding thecategory the investigated problem falls under and skipping the firsthigh-level analysis. Assumptions such as these have proven in the pastto be counter-productive on numerous occasions.

When doing performance analysis on a working system to understand whatmakes it tick, it is important to take into account a number of things.Avoid any over-kill. For example, if only a simple CPU performancemeasurement of a working system is required, it may be sufficient to usea non-invasive high-level analysis tool such as ps. The depth ofanalysis should be determined "a priori" by all interested parties.

Start at the 10,000 ft View

As stated earlier, the starting point of any analysis should be a set ofsystem-level measurements meant to provide an indication of the systemstate, most notably:

  • CPU occupancy, total and per logical core
  • Memory usage, snapshot and evolution over time
  • IO, CPU IO waits

For our purposes here, it is assumed that we are dealing with finding asingle problem area at a time during our analysis, figuring out whatthat area is that brings us here. Scenarios covering analysis of asystem with both CPU occupancy and memory usage problems, for example,is not covered here.

Figure 1: top View (Fully-Loaded Single Core System)

Figure 2: top View (Half-Loaded Dual-Core System)

Figure 3: sar System-Wide Increased Memory Usage View

Figure 4: sar IO Wait CPU Usage View

Figure 5: ps View (Loaded System)

Figure 6: iostat View (Loaded System)

Using some of the examples above, having already applied our methodologyof performing a high-level analysis that includes CPU, I/O, and memoryperformance for all the scenarios below, we can see in Figure 1 that ourCPU usage is approximately 90%. Our main problem here is CPU occupancyas the vast majority of cycles are being spent in user space. Our nextstep should be to examine more closely the applications running on thesystem. Using ps, in Figure 5, we can see that we have a number ofapplications running concurrently on the system and that our VoIPapp isby far the biggest CPU user. We should examine our VoIPapp in moredetail, see "CPU Bottlenecks".

In Figure 2, we can see that our overall CPU occupancy is just under50%, however we are using 99% of one core and virtually nothing of thesecond available core. We should examine our threading model, see"Optimizing a Complete System" and "CPU Bottlenecks".

We can see between Part 1, Figure 7and Figure 4 that, over time, our memory usage is increasing, furthermeasurements may indicate that we have a memory leak that is affectingsystem behaviour, see "Investigating a Memory Issue". From Figure 5, wecan see that the CPU is spending an inordinate amount of processing timewaiting on IO. We should investigate the reason for the high number ofIO waits, see "IO Bottleneck Issue". Optionally, we can use iostat toassess the loading of the block devices in the system to quicklydetermine if they are a factor in the bottleneck. For instance, inFigure 6, it is apparent that during the file copy, the bottleneck isthe block device which is highly loaded.

Optimizing a Complete System

Our first pass analysis has led us to believe we should look atoptimizing at a system level; that is, there are no particularoutstanding CPU bottlenecks, IO over-subscribers, or code blocks usinginordinate amounts of memory. In embedded systems, the amount ofavailable resources is typically fixed so for the purpose of thisarticle, we will not take into account possible system improvements suchas adding more memory or adding an additional disk for more blockdevices.

Looking at memory usage with free, continuous swapping will be aperformance issue. If at all possible, the main memory-usingapplication(s) (see "Investigating a Memory Issue" for finding the bigmemory users) should be analysed for memory usage reduction (codeanalysis).

Looking for high CPU-intensive applications using top, one key item tonote is if in a multi-core environment we need to pay attention to theCPU occupancy breakdown per available core that is provided by top.Identifying the main CPU user on a single core and making thatapplication multi-threaded to share the load across cores is a key stepin any multi-core system optimization. In the case of multiple heavyCPU-intensive applications, the Operating System scheduler will havealready distributed the load over multiple cores.

Looking for high I/O utilization and bottlenecks using iotop and/or sar,optimizing the applications for more efficient use of the device(transfer sizes for instance) is most likely the only option in anembedded system where adding devices is not possible.

Investigating a Memory Issue

The first-pass analysis has led to identifying a potential memory issue.We should use free or sar to monitor memory usage at selected intervalsto see if there is a consistent increase in system memory usage. Also,take note during this measurement of swap memory usage to determine ifswapping is causing a bottleneck. Use top and sort by virtual memoryusage to determine which application is using the most memory, if memoryusage is increasing (memory leak), and if any applications are using alot of swapped memory. In the case of a memory leak, once we havedetermined the application that is leaking memory, we should usevalgrind to search for memory leak locations. Considering memory leaksare determined over time, based on multiple measurements carried out, itis important to note that the system must be sufficiently wellunderstood to know when it has reached a stable state when memory usageis not expected to change. Without this information, a developer maymisinterpret normal system operation as a memory leak.

Although it may be impossible for an embedded developer to increase mainmemory to alleviate excessive swapping or disk thrashing to improveperformance, it may be desirable for a developer to lock all memory usedby an application into main memory so that it does not get swapped out.While using a large amount of swap space, a developer may note (usinggProf or LTT) that the wall-clock time required to access variousregions of memory may be greater than that during periods when the swapusage is low.

IO Bottleneck Issue

IO bottleneck identification within a full system is arguably the mostdifficult issue to track down. In networking scenarios, where thenetwork device is the bottleneck, this is not clearly identified withoutthe use of external equipment to generate the appropriate networkconditions in and out of the system. However, discussion on theperformance analysis for networking IO is beyond the scope of thisarticle.Based on the tools at our disposal, one IO area where we can getsufficient information for analysis is in the area of block devices andmore specifically, disk IO. Beginning in the "Start at the 10,000 ftView" section, sar and/or iotop provide us with data indicating that theCPU is waiting on a block device which is 100% loaded. For furtherinvestigation, we should use iotop to get a "per process" breakdown ofIO usage to determine which process is the main device user. Once thetop process has been identified, further investigation is possiblethrough the use of VTune to analyse sections of the application that arecontributing to bus/disk utilization.

CPU Bottlenecks

As stated in the "Start at the 10,000 ft View" section, we can use topor ps to sort applications by CPU usage to identify primary CPU users.Then, using VTune on the selected application, we can drill down tomodule, function and instruction-level code to determine where the hotspots are. Careful analysis of the code (and maybe assembly code) tounderstand bottle necks should follow so that algorithms or code can beupdated accordingly. Once this is done, the procedure is repeated tofurther refine the code.

Analysis Flow

For the purpose of clarity and to summarize what we have discussed sofar, the following is one possible methodology represented as a flowdiagram. This is by no means the only possible method. There areinfinite variations, but we hope it can be a good indicator of one wayto proceed.

Figure 7: Analysis Flow

Conclusion

Throughout this article, we have discussed many of available tools forperformance analysis on Intel architecture and Linux. The toolsdiscussed are by no means exhaustive as the "Alternative Tools" sectionindicates. By combining these tools with some basic performance analysismethodologies, we hope that we have provided the newcomer withsufficient information to feel comfortable starting a performanceanalysis task. For veteran developer and testers, we hope this paper isinformative and helps them understand the approach and tools availableat their disposal.