The Elements of Cache Programming Style

来源:百度文库 编辑:神马文学网 时间:2024/06/12 02:09:36
The Elements of Cache ProgrammingStyle
Chris B. Sears
Google Inc.
Those who cannot remember the past are condemned to repeatit - George Santayana
Introduction
Cache memories work on the carrot and stick principle. Thecarrot is the Principle of Locality and the stick is Amdahl's Law.The Principle of Locality says that programs tend to cluster theirmemory references. A memory location referenced once is likely to bereferenced again: temporal locality. A memory location nearby areferenced location is likely to be referenced soon: spatiallocality. And Amdahl's Law says that the performance improvement tobe gained from using some faster component is limited by the fractionof time the faster component is used. In this case, CPU and cache arefast components and memory is slow.
If your program adheres to the Principle of Locality, it benefitsfrom fast cache memory and runs at processor speed. If it doesn't, itis held accountable to Amdahl's Law and runs at memory speed. Hitrates have to be very high, say 98%, before incremental increases inprocessor speed are even noticeable.
Amdahl's Law has a special circumstances penalty for multiprocessors[Schimmel94]. Thrashing on a multiprocessor can slow down all of theprocessors. They each wait for each other waiting for memory and theleverage a multiprocessor offers works in reverse. Adherence to thePrinciple of Locality for multiprocessors, but not to the point ofFalse Sharing, isn't just a nicety, it is a necessity.
The object of cache programming style is to increase this locality.It is important to understand the structure and behavior of caches,but it is more important to know the basic properties to takeadvantage of and the worst cases to avoid. This article goes intodetails and the summary provides guidelines.
An Example
As a running example, I am going to look at Linux [Maxwell99]and at the scheduler in particular. The idea is to modify datastructures and code just slightly, trying to use the cache moreeffectively. Hopefully I will achieve two goals: a practical tutorialon caches and some performance improvements for Linux.
Instead of talking about cache memory systems in general, I willmostly use my circa 1998 350 MHz Deschutes Pentium II system as aspecific example. It has these characteristics:


Storage    Size      Latency    Notes
---------------------------------------------------------------------
register   32 bytes   3ns       register renaming file
L1         32K       6ns       on-chip, half Pentium-II clockrate
L2         256K      57 ns     off-chip, on-package [Intel99a]
memory     64 MB     162 ns     100 MHz SDRAM, single bank
disk       10GB      9ms       DMA IDE
network    whatever   whenever  56K PPP

Figure 1. Storage hierarchy sizes and latencies

These numbers are subject to change. CPU performance improves atabout 55%/year and memory improves at about 7%/year. Memory is big,cheap and slow while cache is small, fast and expensive. Double DataRate SDRAM and Rambus, when available, will improve memory bandwidthbut not latency. These improvements will help more predictableapplications like multimedia but not less predictable programs suchas Linux.
Cache Basics
First, a few words about caches in general. Cache memory fitsinto the storage hierarchy in terms of both size and speed. Cacheline misses, page faults and HTTP requests are the same thing atdifferent levels of this hierarchy. When a Squid proxy doesn't havean object in its cache, it forwards the HTTP request to the originserver. When a CPU requests an address which isn't in memory, a pagefault occurs and the page is read from disk. When a CPU requests anaddress which isn't in cache, the containing cache line is read frommemory. LRU, working set, associative, coherency, hashing,prefetching are all techniques and terminology which are used in eachlevel of the storage hierarchy.
In each case, one smaller faster level in the hierarchy is backed byanother bigger slower level. If performance is limited by excessiveuse of the slower level, then according to Amdahl's Law, littleimprovement can be made by just making the faster level faster.
With respect to cache memory [Handy98], the most important thing tounderstand is the cache line. Typically a cache line is 32 bytes longand it is aligned to a 32 byte offset. First a block of memory, amemory line, is loaded into a cache line. This cost is a cache miss,the latency of memory. Then, after loading, bytes within a cache linecan be referenced without penalty as long as it remains in the cache.If the cache line isn't used it will be dropped eventually whenanother memory line needs to be loaded. If the cache line ismodified, it will need to be written before it is dropped.
This is the simplest and most important view of a cache memory. Itslesson is two-fold: pack as much into a cache line as possible anduse as few cache lines as possible. Future memory bandwidth increases(DDR and Rambus) will reward this practice. The more complexcharacteristics of cache, the structure and behavior, are importantfor understanding and avoiding worst case cache behavior:thrashing.
Competing for and sharing of cache lines is a good thing, up to apoint, when it becomes a bad thing. Ideally a fast cache will have ahigh cache hit rate and the performance will not be bound to thespeed of the memory. But a really bad thing, thrashing, happens whenthere is too much competition for too few cache lines. This happensin worst case scenarios for data structures. Unfortunately thecurrent profiling tools look at the instructions rather than data.This means that a programmer must be aware of worst case scenariosfor data structures and avoid them. A useful tool for finding a hotspot is cacheprof [Seward].
The Pentium II L1 and L2 Caches
The Pentium II [Shanley97] 32K L1 cache consists of 1024 32 bytecache lines partitioned into instruction and data banks of 512 lineseach. It uses the color bits 5-11 to index into an array of sets ofcache lines. In parallel, it compares the tag bits 12-31 (12-35 withPentium III Physical Address Extension) for each of the cache linesin the indexed set. L1 uses a 4-way set associative mapping whichdivides the 512 lines into 128 sets of 4 cache lines.
Each of these sets is really a least recently used (LRU) list. Ifthere is a match, the matching cache line is used and it is moved tothe front of the list. If there isn't a match, the data is fetchedfrom L2, the cache line at the end of the list is replaced and thenew entry is put at the front of the list.
Two memory lines of the same color compete for the same set of 4 L1cache lines. They are off the same color if their color bits (5-11)are the same. Alternatively they are of the same color if theiraddresses differ by a multiple of 4096: 2 ^ (7 color bits + 5 offsetbits). For example, address 64 and 12352 differ by 12288 which is3*4096. So, 64 and 12352 compete for a total of 4 L1 cache lines. But64 and 12384 differ by 12320, not a multiple of 4096, so they don'tcompete for the same L1 cache lines.
Instructions are also cached. The Pentium II L1 cache is a Harvard,or split instruction/data cache. This means that instructions anddata never compete for the same L1 cache lines. L2 is a unifiedcache. Unified means that there is a single cache bank and thatinstructions and data compete for cache lines.
L2 is similar to L1 except larger and much slower. The on-package256K L2 cache on my Pentium II has 8192 cache lines. It is also 4-wayset associative but is unified. There are Pentium II's with 512K ofL2 which increase the set size to 8. Also, there are PIII's with upto 2 MB of L2. If there is a cache line miss for L2, the cache lineis fetched from memory. Two memory lines compete for the same L2cache lines if they differ by a multiple of 64K: 2 ^ (11 cache colorbits + 5 offset bits).
The important things to remember about my Pentium II are:

  •         cache lines are 32 bytes in size and are aligned to 32 byte offsets
  •         memory locations which are offset by multiples of 4K bytes compete for 4 L1 cache lines
  •         memory locations which are offset by multiples of 64K bytes compete for 4 L2 cache lines.
  •         L1 has separate cache lines for instructions and data - Harvard
  •         L2 shares cache lines between instructions and data - unified

Variable Alignment
We will start with the simple stuff. It is better to align justabout everything to a long word boundary. Linux is written in the gccprogramming language and a careful study of the gcc standardsdocument, "Using and Porting GNU CC" [Stallman00], istherefore necessary: no one embraces and extends quite like RichardStallman. gcc is particularly helpful with structure field alignmentwhich are intelligently and automatically aligned. ANSI C Standardallows for packing or padding according to the implementation.


struct dirent {
   long            d_ino;
    __kernel_off_t  d_off;
    unsigned short  d_reclen;
   char            d_name[256];
};

Figure 2.

gcc automatically aligns d_reclen to a long boundary. This workswell for unsigned short, but for short on the x86 the compiler mustinsert sign extension instructions. If you are using a short to savespace, consider using an unsigned short. For example, in changing the field vm_avl_height into an unsignedshort saves 32 bytes of instructions for a typical build. It couldjust as well be an int.


struct vm_area_struct {
    ...
   short          vm_avl_height;       // unsigned shortor int
    ...
};

Figure 3. struct vm_area_struct

Strings should be aligned as well. For example, strncmp() cancompare two long words at a time, cheap SIMD, if both source anddestination are long word aligned. The x86 code generator for egcs2.95.2 has a nice little bug that doesn't align short strings at alland aligns long strings to the cache line:


char* short_string = "a_short_string";
char* long_string = "a_long_long_long_long_long_long_long_string";

.LC0:
    .string   "a_short_string"          // an unaligned string
    ...
    .align 32
.LC1:                                    // aligned to cacheline
    .string   "a_long_long_long_long_long_long_long_string"

Figure 4. GCC x86 string disassembly

What is necessary here is to align both strings to long wordswith .align 4. This uses less space and has better alignment. On atypical Linux build, this saves about 8K.
Cache Alignment of Structures
Arrays and lists of structures offer an opportunity to cachealign large amounts of data. If the frequently accessed fields arecollected into a single cache line, they can be loaded with a singlememory access. This can reduce latency and cache footprint. However,it can also increase cache footprint if large amounts of data arebeing accessed. In this case, packing efficiency and also cachepollution are more important.
So for arrays, the base of an array should be cache aligned. The sizeof a structure must be either an integer multiple or an integerdivisor of the cache line size. If these conditions hold, then byinduction, each element of the array the cache line will be alignedor packed. Linked structures are analogous for alignment but don'thave the size constraint.
An array of structures of type mem_map_t is used by the pageallocator as a software page table:


/*
* Try to keep the most commonly accessed fields in single cachelines
* here (16 bytes or greater).  This ordering should beparticularly
* beneficial on 32-bit processors. ...
*/
typedef struct page{                    //from linux-2.4.0-test2
    structlist_head       list;        // 2,4
    struct address_space*  mapping;     // 1,2
    unsignedlong          index;       // 1,2
    structpage*            next_hash;   // 1,2
   atomic_t               count;       // 1,1+1
    unsignedlong          flags;       // 1,2
    structlist_head       lru;         // 2,4
   wait_queue_head_t      wait;        // 5,10
    structpage**          pprev_hash;  // 1,2
    struct buffer_head*    buffers;     // 1,2
    unsignedlong          virtual;     // 1,2
    struct zone_struct*    zone;        // 1,2
}mem_map_t;                             // 18* 4 ==  72 x86
                                        // 36 * 4 == 144 Alpha

Figure 5. mem_map_t from

On a 32-bit Pentium, the size of mem_map_t is 72 bytes. It was40 bytes in 2.2.16. Since the array allocation code usessizeof(mem_map_t) to align the array, the base is aligned incorrectlyas well. In any case MAP_ALIGN() can be replaced withL1_CACHE_ALIGN() which uses simpler code:


#define MAP_ALIGN(x) ((((x) % sizeof(mem_map_t)) ==0)           \
    (x) : ((x) + sizeof(mem_map_t) - ((x) %sizeof(mem_map_t))))

lmem_map = (struct page *)(PAGE_OFFSET +
   MAP_ALIGN((unsigned long)lmem_map - PAGE_OFFSET));

#define L1_CACHE_ALIGN(x)(((x)+(L1_CACHE_BYTES-1))              \
    &~(L1_CACHE_BYTES-1))

lmem_map = (struct page*) L1_CACHE_ALIGN((unsigned long)lmem_map);

Figure 6. lmem_map alignment

On a 64-bit Alpha, a long is 8 bytes with an 8 byte alignmentand sizeof(mem_map_t) is 144 bytes. The flags field doesn't need tobe a long, it should be an int. Since atomic_t is also an int and thetwo fields are adjacent, they would pack into a single long word. Thepage wait queue head used to be a pointer. Changing it back wouldsave enough to allow cache aligning both 32-bit and 64-bitversions.
Cache Line Alignment for DifferentArchitectures
It is possible to target and conditionally compile for aparticular processor. Linux has an include file,, defining the L1 cache line size,L1_CACHE_BYTES, for the x86 architecture family. The slab allocator[Bonwick94], which allocates small objects from memory pages, usesL1_CACHE_BYTES when a client requests a cache aligned object with theSLAB_HWCACHE_ALIGN flag.


/*
* include/asm-i386/cache.h
*/
#ifndef __ARCH_I386_CACHE_H
#define __ARCH_I386_CACHE_H
/* bytes per L1 cache line */
#if    CPU==586 || CPU==686
#define       L1_CACHE_BYTES  32
#else
#define       L1_CACHE_BYTES  16
#endif
#endif

Figure 7.

If someone got a Red Hat kernel conservatively compiledtargeting the 486, then it assumed 16 byte cache lines. It was alsowrong for the Athlon. This has been fixed in 2.4 by defining andusing the kernel configuration macro CONFIG_X86_L1_CACHE_BYTES in.
If you must assume one cache line size when laying out the fieldsinside of structs intended for portable software, use 32 byte cachelines. For example, mem_map_t could use this. Notice that 32 bytealigned cache lines are also 16 byte aligned. The PowerPC 601nominally has a 64 byte cache line but it really has two connected 32byte cache lines. The Sparc64 has a 32 byte L1 and a 64 byte L2 cacheline. It is much easier to think of all systems as having 32 bytecache lines and enumerate the exceptions, if any. Alpha and Sparc64have 32 byte cache lines but the Athlon and Itanium, the exceptionsthat proves the rule, have 64 byte cache lines. And the IBM S/390 G6has a 256K L1 cache with 128 byte cache lines.
On the vast majority of processors, 32 byte cache lines is the rightthing to do. And most importantly, if you have addressed and avoidedthe footprint and worst case thrashing scenarios in the 32 byte case,you will have avoided them for the other cases.
Caching and the Linux Scheduler
Linux represents each process with a task_struct which isallocated two 4K pages. The task list is a list of the task_struct'sof all existing processes. The runqueue is a list of thetask_struct's of all runnable processes. Each time the schedulerneeds to find another process to run, it searches the entire runqueuefor the most deserving process.
Some folks at IBM [Bryant00] noticed that if there were a couple ofthousand threads that scheduling took a significant percentage of theavailable CPU time. On a uniprocessor machine with a couple ofthousand native Java threads, just the scheduler alone was taking upmore than 25% of the available CPU. This gets worse on a sharedmemory SMP machine because memory bus contention goes up. Thisdoesn't scale.
It turned out that the goodness() routine in the scheduler referencedseveral different cache lines in the task_struct. After reorganizingtask_struct, goodness() now references only a single cache line andthe CPU cycle count was reduced from 179 cycles to 115 cycles. Thisis still a lot.
Here is the important cache line, the Linux scheduling loop and thegoodness() routine. The scheduler loop iterates through the entirerunqueue, evaluates each process with goodness() and finds the bestprocess to run next.


struct task_struct {
    ...
   long            counter;          // critical 2cd cache line
   long             priority;
    unsigned long    policy;
    struct mm_struct *mm, *active_mm;
   int              has_cpu;
   int              processor;
    struct list_headrun_list;          //only first long word
    ...
};

tmp = runqueue_head.next;
while (tmp != &runqueue_head) {
    p = list_entry(tmp, struct task_struct,run_list);
    if (can_schedule(p)){              // running on another CPU
        int weight = goodness(p,this_cpu, prev->active_mm);
        if (weight > c)
            c= weight, next = p;
    }
    tmp = tmp->next;
}

#define PROC_CHANGE_PENALTY  15        // processor affinity

static inline int goodness(struct task_struct *p,
    int this_cpu, struct mm_struct *this_mm)
{
    int weight;
    if (p->policy != SCHED_OTHER) {
        weight = 1000 +p->rt_priority; // realtime processes
        goto out;
    }
    weight = p->counter;
    if (!weight)
        gotoout;                       // no quanta left
#ifdef __SMP__
    if (p->processor == this_cpu)
        weight +=PROC_CHANGE_PENALTY;  // processor affinity
#endif
    if (p->mm ==this_mm)               // same thread class
        weight +=1;                    // ascurrent?
    weight += p->priority;
out:
    return weight;
}

Figure 8. task_struct and scheduler loop

A long runqueue is certainly not the common case even forheavily loaded servers. This is because event driven programsessentially self schedule with poll(). The contrasting style,threading, is preferred by Java, Apache and TUX. It is ironic thatpoll() also had scalability problems, and on other Unix systems aswell [Honeyman99]. Also, the Linux 2.4 x86 kernels increase themaximum number of threads past 4000.
On SMP machines, processes have a scheduling affinity with the lastCPU they ran on. The idea is that some of the working set is still inthe local cache. But the scheduler has a subtle SMP bug. When a CPUhas no processes on the runqueue, the scheduler will assign it arunnable process with an affinity to another CPU. It would be wiserto first dole out more quanta to processes on the runqueue, perhapsonly those with an affinity to that CPU. Even then it may be betterto idle, particularly with a short runqueue.
Cache Line Prefetching
Modern CPUs aggressively prefetch instructions but what aboutdata? CPUs don't prefetch data cache lines, but vectorizing compilersdo and programs can. Depending on the amount of CPU processing percache line, you may need to prefetch more than one cache line ahead.If the prefetch is scheduled sufficiently far in advance, it won'tmatter if the cache line is in memory rather than L2 [Intel99a].
Typically prefetching is used in multimedia kernels and matrixoperations where the prefetched address can be easily calculated.Algorithms operating on data structures can use prefetch as well. Thesame methods apply except that the prefetched address will follow alink rather than an address calculation. Prefetching for datastructures is important since memory bandwidth is increasing fasterthan latency is decreasing. Traversing a data structure is morelikely to suffer from a latency problem. Often only a few fields in astructure are used whereas with multimedia usually every bit isexamined.
Prefetching From Memory
If a prefetch instruction can be scheduled 20-25 or soinstructions before the cache line will be used, the fetch cancompletely overlap instruction execution. The exact prefetchscheduling distance is a characteristic of the processor and memory.Superscalar processors execute more than one instruction at atime.
Prefetching From L2
If an algorithm is traversing a data structure likely to be inL2, and it can schedule a prefetch 6-10 instructions before the cacheline will be used, the fetch can completely overlap instructionexecution.
The Linux scheduler loop is a good candidate for cache lineprefetching from L2 because goodness() is short and after the IBMpatch, it only touches a single cache line.
Here is a prefetching version of the scheduler. It overlaps theprefetch of the next cache line from L2 during the execution ofgoodness().


tmp = runqueue_head.next;
while (tmp != &runqueue_head) {
    p = list_entry(tmp, struct task_struct,run_list);
    tmp = tmp->next;
   CacheLine_Prefetch(tmp->next);      //movl xx(%ebx),%eax
    if (can_schedule(p)) {
        int weight = goodness(p,this_cpu, prev->active_mm);
        if (weight > c)
            c= weight, next = p;
    }
}

Figure 9. Prefetching scheduler loop

By the way, moving the tmp pointer chase before the goodness()call ends up using fewer instructions than the original. And with alittle effort, the loop could omit tmp as well.


inline void CacheLine_Prefetch(unsigned long addr)
{
    asm volatile("" : : "r"(addr));
}

Figure 10. CacheLine_Prefetch()

This little bit of gcc magic is actuallyarchitecture-independent assembly code. It basically means load fromaddr into some temporary register of the compiler's choosing. Sotechnically, I lied. It isn't a prefetch, it's really a preload. Aprefetch offers more cache control but has some restrictions.
CacheLine_Prefetch() should be specialized on different architecturesto take advantage of the various prefetch instructions. In fact,CacheLine_Prefetch() should be wrapped in conditional compilationlogic because it may be inappropriate on certain Early Bronze Agemachines. Also, AMD uses a slightly different set of prefetchinstructions not strictly based on the MMX. On the Pentium II, thiscould be:


inline void CacheLine_Prefetch(unsigned long addr)
{
    asm volatile("prefetcht0 (%0)" ::"r" (addr));
}

Figure 11. Pentium II CacheLine_Prefetch()

Caches and Iterating ThroughLists
When repeatedly iterating through an array or a list ofstructures, be careful of cache considerations. As the number ofelements increases and approaches the number of cache linesavailable, thrashing will gradually increase. The graduallyincreasing thrashing is why this performance problem is hard tofind.
The Linux scheduler iterates through the runqueue to find the nextprocess to run. Linux also uses for_each_task() to iterate througheach task_struct on the task list and perform some work. Iteratingthrough lists represents potentially large amounts of memory trafficand cache footprint. Here is the for_each_task() iterator macro:


#definefor_each_task(p)                                         \
    for (p = &init_task ; (p = p->next_task) !=&init_task ; )

Figure 12. for_each_task() iterator

for_each_task() can be combined with CacheLine_Prefetch().Notice that for_each_task() uses next_task which isn't in thepreferred cache line. This doubles the memory traffic and cachefootprint.


#definefor_each_task(p)                                         \
    for (p = &init_task ; p =p->next_task,                      \
       CacheLine_Prefetch(p->next_task),                        \
        p != &init_task ;)

Figure 13. prefetching for_each_task() iterator

As an example, when all of the processes on the runqueue haveused up their scheduling quanta, Linux uses for_each_task() to doleout more:


recalculate:
{
    struct task_struct *p;
    spin_unlock_irq(&runqueue_lock);
    read_lock(&tasklist_lock);
    for_each_task(p)
        p->counter =(p->counter >> 1) + p->priority;
    read_unlock(&tasklist_lock);
    spin_lock_irq(&runqueue_lock);
}
goto repeat_schedule;

Figure 14. Scheduler recalculate loop

As an observation, recalculate should iterate through therunqueue instead of the task queue. It is shorter and there is noreason to dole out more quanta to sleeping processes. counter forsleeping processes will grow without bound. When the sleeping processwakes up, it may have a large amount of quanta stored up, disturbingthe responsiveness of the other processes. FreeBSD [McKusick96]recomputes the scheduling priority if an awoken process was sleepingfor more than a second.
Linux also uses for_each_task() occasionally when allocating aprocess id in get_pid():


for_each_task(p) {
    if(p->pid == last_pid ||
        p->pgrp == last_pid||
        p->session == last_pid){
        ...
    }
}

Figure 15. pid allocation loop

Examining the uses of for_each_task() it is possible to changethe critical task_struct cache line. Some fields are moved out, someneed to be added and some need to be made smaller. active_mm is anunnecessary field for purposes of the scheduler loop. The kernel typepid_t is currently a 4 byte int. However, the maximum value for aprocess id is PID_MAX which is 0x8000. So pid_t can be changed to anunsigned short. (NB, PID_MAX will probably increase in 2.5). priorityis limited to the range 0..40 and counter is derived from it. policyis restricted to six values. processor is currently an int andNR_CPUS is 32 so changing it to an unsigned short is reasonable.
After these changes, several of the uses of for_each_task() as wellas the process id hash, find_task_by_pid, restrict their referencesto a single cache line.


struct task_struct {
    ...
    unsigned charcounter;              // beginning
    unsigned char priority;
    unsigned char policy_has_cpu;
        /* one char to spare*/         // one
    unsigned short processor;
    unsigned shortpid;                 // two
    unsigned short pgrp;
    unsigned shortsession;             // three
    struct mm_struct*mm;               // four
    struct task_struct*next_task;      // five
    struct task_struct *pidhash_next;   //six
    struct task_struct*run_next;       // seven
        /* one long word to spare*/    // eight
    ...
};

Figure 16. Packed task_struct for scheduler

If possible, squeeze everything the processing loop uses into asfew cache lines as possible, preferably just one. In your datastructures, if you have to use short instead of int, use short. If itwill make it fit, use nibbles and bits. If you can't say it in 32bytes, perhaps you need to rephrase yourself.
Thrashing
OK, you might want to close your eyes about now because it'sgoing to start getting really gory. Now we look at the worst casescenarios of thrashing. It also gets more complex.
The task_struct is allocated as two pages which are 4K aligned. TheL1 cache divides the 32-bit address space among the 128 groups eachwith a set of 4 cache lines. If two addresses are separated by amultiple of 4K bytes, then they map to the same cache line set. Sofor the 4-way set associative L1 cache on a Pentium II there are 4cache lines available for the scheduling related cache line for allof the task_structs.
Really that isn't so bad. Asking a task_struct cache line to remainin L1 for a schedule quantum is, I admit, a bit much. But thesituation doesn't really improve for L2. A 256K L2 cache will provideonly 64 suitable cache lines. The way this works is that atask_struct is 4K page aligned. So in the L2 cache set index, bits5-11 will be fixed. So there are 4 bits of possibly unique cache setindex, or 16 sets of 4 cache lines or 64 available cache lines.
Furthermore, L2 is managed LRU. If you are iterating through arunqueue longer than the effective L2 set depth of 64, when youexceed the set size of L2, the cache thrashes and you may as wellhave no cache. Then, every reschedule the scheduler is scanning theentire runqueue from memory. Prefetching is rendered useless. Also,the set is not a list of 64 cache lines but really 16 associativesubsets of 4 cache lines. Thrashing begins gradually before reachingthe set size of 64.
But it gets worse. I forgot to tell you about the TLB. Logicaladdresses are mapped to physical addresses via a two level page tablein memory and an on- chip memory management unit.


virtual address
    page group - bits 22-31
    page address - bits 12-21
    page offset - bits 0-11

physical page lookup
    table_level_2 = table_level_1[(vaddr & mask1)>> shift1];
   page          =table_level_2[(vaddr & mask2) >> shift2];

Figure 17. Pentium II VM address structure andtranslation

These mappings are expensive to compute. The masks and shiftsare free but the loads and adds are serial. The Alpha and Itanium usea 3 level page table structure. Some page tables can be so large theymust be demand paged. But since the mappings rarely change, they arecomputed on demand and the results are cached in the TLB. The PentiumII Translation Lookaside Buffer (TLB) uses a Harvard 4-way setassociative cache with 64 data entries and 32 instruction entries.The TLB replacement policy is LRU within the set. If the TLB doesn'tfind a match, a TLB miss takes place. The cost of a TLB miss rangesfrom a reported 5 cycles on a Xeon, to 60 or more cycles. On thePowerPC it is handled in software. The PowerPC TLB miss handler is 29instructions.
Virtual memory pages are 4K. A given task_struct page will competefor 4 TLB slots. Iterating through a long linked list (64) of 4Kaligned pages is really the worst case scenario for the TLB. Theiteration will flush the TLB, slowly, suffering TLB misses along theway. But for the purposes of the runqueue, the TLB could just as wellhave no entries at all. Linux invalidates the TLB at each processcontext switch. Then, every entry on the runqueue will be a TLB missregardless of the size of the runqueue. The only case where thisdoesn't happen is if a process context switches to itself. Even inthis case, a long runqueue can thrash and flush the TLB cache. Also,the TLB miss is synchronous with the cache line miss and a TLB misswill cause the prefetch to be ignored.
Avoid iterating through collections of 4K pages. It is a worst casecache and TLB scenario. Actually it is a member of a dysfunctionalfamily of worst case scenarios. If addresses differ by a power of thecache line size, 64, 128, 256, ..., they compete for cache lines.Each increase in the exponent halves the number of cache linesavailable, down to the minimum of the cache set size. If addressesdiffer by a power of the page size, 8192, 16384, ..., they competefor diminishing TLB slots. And while it is possible to add more L2 onsome systems, this isn't possible for TLB slots.
Pollution
An application can use prefetching to overlap execution withmemory fetch. But once referenced, some memory is used only once andshouldn't evict more important cache lines and shouldn't take upspace in the cache. If prefetch determines what cache lines will beneeded, the cache control instructions determine what cache lineswill be retained. With respect to performance, avoiding cachepollution is as important as prefetch.
As an example, there is a problem with the cache line orientedtask_struct. The for_each_task() macro iterates through every task,polluting the L2 cache, flushing good data and loading junk. The junkdata will mean further cache misses later.
On the Pentium II there is the non-temporal prefetchnta instruction.This loads a cache line into L1. If it is already in L2, it is loadedfrom L2. But if it isn't in L2, it isn't loaded into L2. The PowerPCdoesn't have this sort of instruction. On that processor, prefetchfirst and then cache flush a junk cache line after using it. A junkcache line in this case would be a process not on the runqueue. Thisreduces L2 cache pollution but doesn't avoid it altogether as in thePentium II prefetchnta example.


inline void CacheLine_Flush(unsigned long addr)
{ /* no cache line flush on the x86 */ }

inline void CacheLine_NT_Prefetch(unsigned long addr)
{
    asm volatile("prefetchnta (%0)" ::"r" (addr));
}

#definefor_each_task(p)                                         \
    for (p = &init_task; p->next ? 1 :CacheLine_Flush(p),       \
        p = p->next_task,CacheLine_NT_Prefetch(p->next_task),   \
        p !=&init_task;)

Figure 18. Pollution avoiding prefetching

Interrupt handlers and basic functions such as copying i/obuffer cache pages to and from user space with memcpy() would benefitfrom non-temporal prefetch, cache flush and streaming storeinstructions to avoid polluting L2. The Linux 2.4 x86 _mmx_memcpy()prefetches source cache lines. For a Pentium II, it should useprefetchnta for both the source and destination in order to avoidflushing and polluting L2.
On the Pentium II, an initial preload is necessary to prime the TLBfor any following prefetch instructions. Otherwise the prefetch willbe ignored. This is another argument against iterating through listsor arrays of 4K structures: a preload is necessary to prime the TLBbut the preload will pollute the cache and there is no cache lineflush instruction on the Pentium II.
As an example, this is a Pentium II user mode cache line block copyfor up to a 4K page. This function assumes that the source anddestination will not be immediately reused. The prefetchntainstructions marks the source and destination cache lines as nontemporal. A Pentium III version [Intel99b] can use movntq and thestreaming instructions. However, the streaming instructions requireadditional OS support for saving the 8 128-bit Katmai registers atcontext switch. Patches are available for 2.2.14+ [Ingo00].


void PII_BlockCopy(char* src, char* dst, int count)
{
    char    *limit;

    asm volatile("" :: "r"(*src));  // prime the TLB for prefetch
    asm volatile("" :: "r"(*dst));
    asm volatile("" :: "r" (*(src+ 4095)));  // src may span page
    asm volatile("" :: "r" (*(dst+ 4095)));

    for (limit = src + count; src < limit; src +=32, dst += 32) {
        asmvolatile("prefetchnta (%0)"  :: "r"(src));
        asmvolatile("prefetchnta (%0)"  :: "r"(dst));
        asm volatile("movq00(%0),%%mm0" :: "r" (src));
        asm volatile("movq08(%0),%%mm1" :: "r" (src));
        asm volatile("movq16(%0),%%mm2" :: "r" (src));
        asm volatile("movq24(%0),%%mm3" :: "r" (src));
        asm volatile("movq%%mm0,00(%0)" :: "r" (dst));
        asm volatile("movq%%mm1,08(%0)" :: "r" (dst));
        asm volatile("movq%%mm2,16(%0)" :: "r" (dst));
        asm volatile("movq%%mm3,24(%0)" :: "r" (dst));
    }
    asmvolatile("emms");           // empty the MMX state
}

Figure 19. PII_BlockCopy()

Another candidate is memset(). Idle task page clearing withmemset() has been tried but it isn't done because of cache pollution.Memory stores fill up cache lines just as memory loads do. But cachepollution can be avoided by prefetchnta of the destination cache linefollowed by the store. prefetchnta a priori tags the destinationcache line as non cacheable. Another alternative on the PowerPC ismarking the page to be cleared as non-cacheable but this isprivileged [Dougan99].
False Sharing
A variation on the theme of thrashing is False Sharing[HennPatt96]. Two variables contained in a single cache line areupdated by two different CPUs on a multiprocessor. When the first CPUstores into its variable in its cache line copy, it invalidates thecache line copy in the second CPU. When the second CPU stores intoits variable, reloads the cache line, stores into it and invalidatesthe cache line copy in the first CPU. This is thrashing. Allocatingeach variable its own cache line solves this problem.
An example from the scheduler, showing the problem and solution, isthe current task array variable:


struct task_struct *current_set[NR_CPUS];   // Linux2.0.35

static union{                              //Linux 2.2.14
    struct schedule_data {
        struct task_struct *curr;
        cycles_t last_schedule;
    } schedule_data;
    char __pad [SMP_CACHE_BYTES];
} aligned_data [NR_CPUS] __cacheline_aligned = {{{&init_task,0}}};

Figure 20. SMP schduler global data array

Page Coloring
Two memory lines separated by modulo 64K compete for just 4 L2cache lines. Within this 64K span, memory lines do not compete.Breaking this 64K up into 16 4K memory pages, each has a unique pagecolor. Physical memory pages of different color don't compete forcache.
Ideally if two virtual memory pages will be used at the same time,they should be mapped to physical pages of different color [Lynch93].In particular, simple direct mapped caches only have a singlesuitable cache line. Pages 0,1,2,3 are in contiguous order and ofdifferent page color. Pages 16,1,18,35 also each have differentcolor. Page coloring also improves performance repeatability.
Higher levels of L2 associativity argue against the expense ofsupporting page coloring in the OS. Linux does not currently supportpage colored allocation because it would be too expensive. FreeBSDattempts to allocate pages in color order and there has beendiscussion of this for Linux if an inexpensive page allocationapproach can be found.
However, page coloring is supported in one special case:__get_dma_pages(). This kernel function allocates up to 32 physicallycontiguous pages. Therefore pages in color order. A hint flag for themmap() system call could request this.
Constraints
Memory lines within aligned structures, perhaps aligned withinpages, are constrained. The greater the alignment constraint, thefewer eligible cache lines. Constraints cause conflict cachemisses.
For example, in the task_struct case the important cache line iscompeting for 64 entries. This is known as a hot cache line: it has aconstraint problem. It would be better for scheduling purposes toprune off the scheduling information and set up a slab allocatedcache. The task_struct already has several data structures hangingoff of it to support sharing data structures among related threads.This would be one more.
A slab is a page from which objects are allocated and freed. If aslab is full, another is constructed and the object is allocated fromit. Iterating through dynamic structures allocated from slabs suffersfewer TLB misses because the structures are packed into pages.
For larger structures, frequently accessed fields are often clumpedtogether, usually at the beginning of the structure, causing aconstraint problem. Since slabs are page aligned, the slab allocatorbalances the cache load transparent to its clients. An offset whichis a multiple of the cache line size is added as a bias.
Back To The Future
One solution for the scheduler is fairly simple: for an alignedarray of structures, if the size of the structure is an odd multipleof the cache line size, it won't have a constraint problem.
Here is a single cache line version (one is an odd number) of thecritical scheduling fields:


struct proc {
    unsignedchar        priority;
    unsignedchar        policy_has_cpu;
    unsigned short      processor;        // one
    unsigned short      pid;
    unsigned short      pgrp;             // two
    unsigned short      session;          //three - spare short
    struct mm_struct*   mm;               // four
    structproc*        next_task;        // five
    structproc*        pidhash_next;     // six
    structproc*        run_next;         // seven
    struct task_struct* task_struct;      // eight
};

Figure 21. Single cache line task_struct

Or it can be made into a two cache line structure and a fewother task_struct fields can be added. If you are familiar with oldUnix implementations, the cache line oriented task_struct is thereinvention of the proc structure. Quoting the 1977 Unix Version 6 header file:


/*
* One structure allocated per active
* process.  It contains all data needed
* about the process while the
* process may be swapped out.
* Other per process data (user.h)
* is swapped with the process.
*/

Figure 22. from 1977 Unix Version6

The Linux scheduler searches the entire runqueue eachreschedule. The FreeBSD runqueue is simpler [McKusick96]. It wasderived from the 1978 VAX/VMS rescheduling interrupt handler whichwas all of 28 instructions. From the FreeBSD source file:


/*
* We have NQS (32) run queues per scheduling class.  For the
* normal class, there are 128 priorities scaled onto these
* 32 queues.  New processes are added to the last entry ineach
* queue, and processes are selected for running by taking them
* from the head and maintaining a simple FIFO arrangement.
* Realtime and Idle priority processes have and explicit 0-31
* priority which maps directly onto their class queue index.
* When a queue has something in it, the corresponding bit is
* set in the queuebits variable, allowing a single read to
* determine the state of all 32 queues and then a ffs() to find
* the first busy queue.
*/

Figure 23. From FreeBSD

A uniprocessor reschedule is simplicity:


static struct rq    queues[32];
static u_int32_t    queuebits;
int                 qi;

qi = (current->priority >> 2);
SetBit(&queuebits, qi);
InsertQueue(&queues[current->priority], current);

qi = FindFirstSet(queuebits);
if (RemoveQueue(&queues[qi], ¤t) == Q_EMPTY)
    ClearBit(&queuebits, qi);

Figure 24. Queueing scheduler context switch

A best of breed Linux scheduler would use the FreeBSD runqueueand support the Linux scheduler policies of the goodness() function:soft realtime scheduling, SMP CPU affinity and thread batching.
In Summary, You Are Doomed
Cache programming is a collection of arcana and techniques. Whatfollows is an attempt to organize them and the final chart is anattempt to distill them.
Alignment and Packing
When laying out structures for portable software, assume a 32byte cache line.
For arrays, the base of an array should be cache aligned. The size ofa structure must be either an integer multiple or an integer divisorof the cache line size. If these conditions hold, then by inductioneach element of the array the cache line will be aligned or packed.Linked structures are analogous for alignment but don't have the sizeconstraint.
To specify a cache line alignment for types or variables with gcc,use the aligned attribute:


struct Box {
    int    element;
} __attribute__ ((aligned(SMP_CACHE_BYTES)));

Figure 25. Cache alignment of structure types

In an application, to allocate cache aligned memory, usememalign(32, size) instead of malloc(size).
Data structures change. Write a program which validates packing,alignments and worst case scenarios.
Cache Footprint
Reduce your cache footprint. Paraphrasing the Elements of Style[Strunk99]:

  •         Omit needless words.
  •         Keep related words together.
  •         Do not break cache lines in two.
  •         Avoid a succession of loose cache lines.
  •         Make the cache line the unit of composition.

When repeatedly iterating through a large array or list ofstructures, be careful of cache considerations. As the number ofelements increases and approaches the number of cache linesavailable, cache misses will gradually increase to the point ofthrashing.
Ignoring page color issues for now, large memory scans approachingthe size of the cache, flush the cache. Large memory copiesapproaching half of the cache, flush the cache. Both pollute thecache with junk.
Avoid polluting the cache with junk data. If a cache line won't bereused for some time, flush it. If streaming through a lot of data,non-temporally prefetch source and destination to avoid polluting theL2 cache. Otherwise you may pollute the cache with junk dataentailing further cache misses.
Interrupt handlers should avoid cache pollution by using non-temporalprefetch, cache flush and streaming store instructions.
Shared libraries allow many processes to share the same instructionmemory. This creates more opportunities for cache sharing and lesscache competition.
Prefetch
If you are working on one cache line and then another, prefetchthe next cache line before doing the work on the first. This overlapsthe work and the prefetch. Prefetch hides memory latency.
The prefetch scheduling distance for memory is about 20-25instructions. The prefetch scheduling distance for L2 is about 6-10instructions. These are very rough guides.
Worst Case Scenarios
Avoid worst case scenarios: for associative cache memory, ifaddresses differ by a power of the cache line size, 64, 128, 256,..., they are constrained to use fewer cache lines. Each increase inthe exponent halves the number of cache lines available, down to theminimum of the cache set size. Similarly, on an associative TLB, ifaddresses differ by a power of the page size, 8192, 16384, ..., theyare constrained to use fewer TLB slots with each increase inexponent.
False Sharing - avoid updating a cache line shared between two ormore CPUs.
Constraints - bias the allocation of small objects by a multiplecache line offset to avoid hot cache lines. The slab allocator doesthis. For an array of structures, use odd multiples of the cache linesize.
Coloring - if necessary, allocate physical memory pages in colororder. They won't compete for cache.
If you need uniquely colored pages in the kernel or in a module, use__get_dma_pages(gfp_mask, order) to allocate up to 32 pages, usually128K. They will be physically contiguous and therefore in colororder.
General

Cache Policy   Cache Miss  Description                            Suggestion
------------------------------------------------------------------------------------------------------
load          compulsory   firstaccess                            align, packand prefetch
placement     conflict     access mismatched to cachedesign       avoid worst cases
replacement    capacity    working set is larger than cache size   avoid cachepollution
store         combine      single store or large streamingstore   all of the above, combine writes,
                                                                    use non temporalinstructions


Acknowledgements
Scott Maxwell, David Sifry and Dan Sears suffered through andimproved earlier drafts.
References
[Bonwick94] Jeff Bonwick, The Slab Allocator: An Object-CachingKernel Memory Allocator, Usenix Summer 1994 Technical Conference,1994.
[Bryant00] Ray Bryant and Bill Hartner, Java technology, threads, andscheduling in Linux: Patching the kernel scheduler for better Javaperformance, IBM Developer Works, January 2000.
[Dougan99] Cort Dougan, Paul Mackerras and Victor Yodaiken,Optimizing the Idle Task and Other MMU Tricks, Third Symposium onOperating Systems Design and Implementation, 1999.
[Handy98] Jim Handy, The Cache Memory Book, 2nd edition, AcademicPress, 1998.
[HennPatt96] John Hennessy and David Patterson, ComputerArchitecture, 2nd edition, Morgan Kauffman, 1996.
[Honeyman99] Peter Honeyman et al, The Linux Scalability Project,University of Michigan CITI-99-4, 1999.
[Ingo00] Ingo Mulnar, http://people.redhat.com/mingo/mmx-patches/
[Intel99a] Intel Architecture Optimization Reference Manual,developer.intel.com, 1999.
[Intel99b] Block Copy Using Intel Pentium III Processor StreamingSIMD Extensions, developer.intel.com, 1999.
[Lynch93] William L. Lynch, The Interaction of Virtual Memory andCache Memory, Stanford CSL-TR-93-587, 1993.
[Maxwell99] Scott Maxwell, Linux Core Kernel Commentary, Coriolis,1999.
[McKusick96] Kirk McKusick et al, The Design and Implementation ofthe 4.4BSD Operating System, Addison Wesley, 1996.
[Schimmel94] Curt Schimmel, UNIX Systems for Modern Architectures:Symmetric Multiprocesssing and Caching for Kernel Programmers,Addison Wesley, 1994.
[Seward] Julian Seward, Cacheprof, www.cacheprof.org
[Shanley97] Tom Shanley, Pentium Pro and Pentium II SystemArchitecture: 2nd edition, Mindshare, 1997.
[Stallman00] Richard Stallman, Using and Porting GNU CC: for version2.95, Free Software Foundation, 2000.
[Strunk99] Strunk and White, The Elements of Style, 4th edition,Allyn and Bacon, 1999.