bonwick_the_slab_allocator下载_在线阅读_12

is_053181

暂无简介

bonwick_the_slab_allocator The Slab Allocator: An Object-Caching Kernel Memory Allocator Jeff Bonwick Sun Microsystems Abstract This paper presents a comprehensive design over- view of the SunOS 5.4 kernel memory allocator. This allocator is based on a set of object-caching primitives that...

The Slab Allocator: An Object-Caching Kernel Memory Allocator Jeff Bonwick Sun Microsystems Abstract This paper presents a comprehensive design over- view of the SunOS 5.4 kernel memory allocator. This allocator is based on a set of object-caching primitives that reduce the cost of allocating complex objects by retaining their state between uses. These same primitives prove equally effective for manag- ing stateless memory (e.g. data pages and temporary buffers) because they are space-efficient and fast. The allocator’s object caches respond dynamically to global memory pressure, and employ an object- coloring scheme that improves the system’s overall cache utilization and bus balance. The allocator also has several statistical and debugging features that can detect a wide range of problems throughout the system. 1. Introduction The allocation and freeing of objects are among the most common operations in the kernel. A fast ker- nel memory allocator is therefore essential. How- ever, in many cases the cost of initializing and destroying the object exceeds the cost of allocating and freeing memory for it. Thus, while improve- ments in the allocator are beneficial, even greater gains can be achieved by caching frequently used objects so that their basic structure is preserved between uses. The paper begins with a discussion of object caching, since the interface that this requires will shape the rest of the allocator. The next section then describes the implementation in detail. Section 4 describes the effect of buffer address distribution on the system’s overall cache utilization and bus balance, and shows how a simple coloring scheme can improve both. Section 5 compares the allocator’s performance to several other well-known kernel memory allocators and finds that it is generally superior in both space and time. Finally, Section 6 describes the allocator’s debugging features, which can detect a wide variety of prob- lems throughout the system. 2. Object Caching Object caching is a technique for dealing with objects that are frequently allocated and freed. The idea is to preserve the invariant portion of an object’s initial state — its constructed state — between uses, so it does not have to be destroyed and recreated every time the object is used. For example, an object containing a mutex only needs to have mutex_init() applied once — the first time the object is allocated. The object can then be freed and reallocated many times without incurring the expense of mutex_destroy() and mutex_init() each time. An object’s embedded locks, condition variables, reference counts, lists of other objects, and read-only data all generally qual- ify as constructed state. Caching is important because the cost of con- structing an object can be significantly higher than the cost of allocating memory for it. For example, on a SPARCstation-2 running a SunOS 5.4 develop- ment kernel, the allocator presented here reduced the cost of allocating and freeing a stream head from 33 microseconds to 5.7 microseconds. As the table below illustrates, most of the savings was due to object caching: _ ___________________________________________ Stream Head Allocation + Free Costs (μsec) _ ___________________________________________ _ ___________________________________________ construction memory other allocator + destruction allocation init. _ ___________________________________________ old 23.6 9.4 1.9 new 0.0 3.8 1.9 _ ___________________________________________ ⎜⎜ ⎜⎜ ⎜⎜ ⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜⎜ ⎜ Caching is particularly beneficial in a mul- tithreaded environment, where many of the most frequently allocated objects contain one or more embedded locks, condition variables, and other con- structible state. The design of an object cache is straightforward: To allocate an object: if (there’s an object in the cache) take it (no construction required); else { allocate memory; construct the object; } To free an object: return it to the cache (no destruction required); To reclaim memory from the cache: take some objects from the cache; destroy the objects; free the underlying memory; An object’s constructed state must be initial- ized only once — when the object is first brought into the cache. Once the cache is populated, allo- cating and freeing objects are fast, trivial operations. 2.1. An Example Consider the following data structure: struct foo { kmutex_t foo_lock; kcondvar_t foo_cv; struct bar *foo_barlist; int foo_refcnt; }; Assume that a foo structure cannot be freed until there are no outstanding references to it (foo_refcnt == 0) and all of its pending bar events (whatever they are) have completed (foo_barlist == NULL). The life cycle of a dynamically allocated foo would be something like this: foo = kmem_alloc(sizeof (struct foo), KM_SLEEP); mutex_init(&foo->foo_lock, ...); cv_init(&foo->foo_cv, ...); foo->foo_refcnt = 0; foo->foo_barlist = NULL; use foo; ASSERT(foo->foo_barlist == NULL); ASSERT(foo->foo_refcnt == 0); cv_destroy(&foo->foo_cv); mutex_destroy(&foo->foo_lock); kmem_free(foo); Notice that between each use of a foo object we perform a sequence of operations that constitutes nothing more than a very expensive no-op. All of this overhead (i.e., everything other than ‘‘use foo’’ above) can be eliminated by object caching. 2.2. The Case for Object Caching in the Central Allocator Of course, object caching can be implemented without any help from the central allocator — any subsystem can have a private implementation of the algorithm described above. However, there are several disadvantages to this approach: (1) There is a natural tension between an object cache, which wants to keep memory, and the rest of the system, which wants that memory back. Privately-managed caches cannot handle this tension sensibly. They have limited insight into the system’s overall memory needs and no insight into each other’s needs. Simi- larly, the rest of the system has no knowledge of the existence of these caches and hence has no way to ‘‘pull’’ memory from them. (2) Since private caches bypass the central alloca- tor, they also bypass any accounting mechan- isms and debugging features that allocator may possess. This makes the operating system more difficult to monitor and debug. (3) Having many instances of the same solution to a common problem increases kernel code size and maintenance costs. Object caching requires a greater degree of coopera- tion between the allocator and its clients than the standard kmem_alloc(9F)/kmem_free(9F) interface allows. The next section develops an interface to support constructed object caching in the central allocator. 2.3. Object Cache Interface The interface presented here follows from two observations: (A) Descriptions of objects (name, size, alignment, constructor, and destructor) belong in the clients — not in the central allocator. The allocator should not just ‘‘know’’ that sizeof (struct inode) is a useful pool size, for example. Such assumptions are brittle [Grunwald93A] and cannot anticipate the needs of third-party device drivers, streams modules and file systems. (B) Memory management policies belong in the central allocator — not in its clients. The clients just want to allocate and free objects quickly. They shouldn’t have to worry about how to manage the underlying memory efficiently. It follows from (A) that object cache creation must be client-driven and must include a full specification of the objects: (1) struct kmem_cache *kmem_cache_create( char *name, size_t size, int align, void (*constructor)(void *, size_t), void (*destructor)(void *, size_t)); Creates a cache of objects, each of size size, aligned on an align boundary. The align- ment will always be rounded up to the minimum allowable value, so align can be zero whenever no special alignment is required. name identifies the cache for statistics and debugging. constructor is a function that constructs (that is, performs the one-time ini- tialization of) objects in the cache; destruc- tor undoes this, if applicable. The construc- tor and destructor take a size argument so that they can support families of similar caches, e.g. streams messages. kmem_cache_create returns an opaque descriptor for accessing the cache. Next, it follows from (B) that clients should need just two simple functions to allocate and free objects: (2) void *kmem_cache_alloc( struct kmem_cache *cp, int flags); Gets an object from the cache. The object will be in its constructed state. flags is either KM_SLEEP or KM_NOSLEEP, indicating whether it’s acceptable to wait for memory if none is currently available. (3) void kmem_cache_free( struct kmem_cache *cp, void *buf); Returns an object to the cache. The object must still be in its constructed state. Finally, if a cache is no longer needed the client can destroy it: (4) void kmem_cache_destroy( struct kmem_cache *cp); Destroys the cache and reclaims all associated resources. All allocated objects must have been returned to the cache. This interface allows us to build a flexible allocator that is ideally suited to the needs of its clients. In this sense it is a ‘‘custom’’ allocator. However, it does not have to be built with compile-time knowledge of its clients as most custom allocators do [Bozman84A, Grunwald93A, Margolin71], nor does it have to keep guessing as in the adaptive-fit methods [Bozman84B, Leverett82, Oldehoeft85]. Rather, the object-cache interface allows clients to specify the allocation services they need on the fly. 2.4. An Example This example demonstrates the use of object cach- ing for the ‘‘foo’’ objects introduced in Section 2.1. The constructor and destructor routines are: void foo_constructor(void *buf, int size) { struct foo *foo = buf; mutex_init(&foo->foo_lock, ...); cv_init(&foo->foo_cv, ...); foo->foo_refcnt = 0; foo->foo_barlist = NULL; } void foo_destructor(void *buf, int size) { struct foo *foo = buf; ASSERT(foo->foo_barlist == NULL); ASSERT(foo->foo_refcnt == 0); cv_destroy(&foo->foo_cv); mutex_destroy(&foo->foo_lock); } To create the foo cache: foo_cache = kmem_cache_create("foo_cache", sizeof (struct foo), 0, foo_constructor, foo_destructor); To allocate, use, and free a foo object: foo = kmem_cache_alloc(foo_cache, KM_SLEEP); use foo; kmem_cache_free(foo_cache, foo); This makes foo allocation fast, because the allocator will usually do nothing more than fetch an already-constructed foo from the cache. foo_constructor and foo_destructor will be invoked only to populate and drain the cache, respectively. The example above illustrates a beneficial side-effect of object caching: it reduces the instruction-cache footprint of the code that uses cached objects by moving the rarely-executed con- struction and destruction code out of the hot path. 3. Slab Allocator Implementation This section describes the implementation of the SunOS 5.4 kernel memory allocator, or ‘‘slab allo- cator,’’ in detail. (The name derives from one of the allocator’s main data structures, the slab. The name stuck within Sun because it was more distinc- tive than ‘‘object’’ or ‘‘cache.’’ Slabs will be dis- cussed in Section 3.2.) The terms object, buffer, and chunk will be used more or less interchangeably, depending on how we’re viewing that piece of memory at the moment. 3.1. Caches Each cache has a front end and back end which are designed to be as decoupled as possible: cache back end front end kmem_cache_grow kmem_cache_reap kmem_cache_alloc kmem_cache_free The front end is the public interface to the allocator. It moves objects to and from the cache, calling into the back end when it needs more objects. The back end manages the flow of real memory through the cache. The influx routine (kmem_cache_grow()) gets memory from the VM system, makes objects out of it, and feeds those objects into the cache. The outflux routine (kmem_cache_reap()) is invoked by the VM system when it wants some of that memory back — e.g., at the onset of paging. Note that all back-end activity is triggered solely by memory pressure. Memory flows in when the cache needs more objects and flows back out when the rest of the sys- tem needs more pages; there are no arbitrary limits or watermarks. Hysteresis control is provided by a working-set algorithm, described in Section 3.4. The slab allocator is not a monolithic entity, but rather is a loose confederation of independent caches. The caches have no shared state, so the allocator can employ per-cache locking instead of protecting the entire arena (kernel heap) with one global lock. Per-cache locking improves scalability by allowing any number of distinct caches to be accessed simultaneously. Each cache maintains its own statistics — total allocations, number of allocated and free buffers, etc. These per-cache statistics provide insight into overall system behavior. They indicate which parts of the system consume the most memory and help to identify memory leaks. They also indicate the activity level in various subsys- tems, to the extent that allocator traffic is an accu- rate approximation. (Streams message allocation is a direct measure of streams activity, for example.) The slab allocator is operationally similar to the ‘‘CustoMalloc’’ [Grunwald93A], ‘‘QuickFit’’ [Weinstock88], and ‘‘Zone’’ [VanSciver88] alloca- tors, all of which maintain distinct freelists of the most commonly requested buffer sizes. The Grunwald and Weinstock papers each demonstrate that a customized segregated-storage allocator — one that has a priori knowledge of the most com- mon allocation sizes — is usually optimal in both space and time. The slab allocator is in this category, but has the advantage that its customiza- tions are client-driven at run time rather than being hard-coded at compile time. (This is also true of the Zone allocator.) The standard non-caching allocation routines, kmem_alloc(9F) and kmem_free(9F), use object caches internally. At startup, the system creates a set of about 30 caches ranging in size from 8 bytes to 9K in roughly 10-20% increments. kmem_alloc() simply performs a kmem_cache_alloc() from the nearest-size cache. Allocations larger than 9K, which are rare, are handled directly by the back-end page supplier. 3.2. Slabs The slab is the primary unit of currency in the slab allocator. When the allocator needs to grow a cache, for example, it acquires an entire slab of objects at once. Similarly, the allocator reclaims unused memory (shrinks a cache) by relinquishing a complete slab. A slab consists of one or more pages of virtu- ally contiguous memory carved up into equal-size chunks, with a reference count indicating how many of those chunks have been allocated. The benefits of using this simple data structure to manage the arena are somewhat striking: (1) Reclaiming unused memory is trivial. When the slab reference count goes to zero the associated pages can be returned to the VM system. Thus a simple reference count replaces the complex trees, bitmaps, and coalescing algorithms found in most other allocators [Knuth68, Korn85, Standish80]. (2) Allocating and freeing memory are fast, constant-time operations. All we have to do is move an object to or from a freelist and update a reference count. (3) Severe external fragmentation (unused buffers on the freelist) is unlikely. Over time, many allocators develop an accumulation of small, unusable buffers. This occurs as the allocator splits existing free buffers to satisfy smaller requests. For example, the right sequence of 32-byte and 40-byte allocations may result in a large accumulation of free 8-byte buffers — even though no 8-byte buffers are ever requested [Standish80]. A segregated- storage allocator cannot suffer this fate, since the only way to populate its 8-byte freelist is to actually allocate and free 8-byte buffers. Any sequence of 32-byte and 40-byte allocations — no matter how complex — can only result in population of the 32- byte and 40-byte freelists. Since prior allocation is a good predictor of future allocation [Weinstock88] these buffers are likely to be used again. The other reason that slabs reduce external fragmen- tation is that all objects in a slab are of the same type, so they have the same lifetime distribution.* The resulting segregation of short-lived and long- lived objects at slab granularity reduces the likeli- hood of an entire page being held hostage due to a single long-lived allocation [Barrett93, Hanson90]. ____________________________________ * The generic caches that back kmem_alloc() are a notable exception, but they constitute a relatively small fraction of the arena in SunOS 5.4 — all of the major consumers of memory now use kmem_cache_alloc(). (4) Internal fragmentation (per-buffer wasted space) is minimal. Each buffer is exactly the right size (namely, the cache’s object size), so the only wasted space is the unused portion at the end of the slab. For example, assuming 4096-byte pages, the slabs in a 400-byte object cache would each contain 10 buffers, with 96 bytes left over. We can view this as equivalent 9.6 bytes of wasted space per 400-byte buffer, or 2.4% internal fragmentation. In general, if a slab contains n buffers, then the internal fragmentation is at most 1/n; thus the allo- cator can actually control the amount of internal fragmentation by controlling the slab size. How- ever, larger slabs are more likely to cause external fragmentation, since the probability of being able to reclaim a slab decreases as the number of buffers per slab increases. The SunOS 5.4 implementation limits internal fragmentation to 12.5% (1/8), since this was found to be the empirical sweet-spot in the trade-off between internal and external fragmenta- tion. 3.2.1. Slab Layout — Logical The contents of each slab are managed by a kmem_slab data structure that maintains the slab’s linkage in the cache, its reference count, and its list of free buffers. In turn, each buffer in the slab is managed by a kmem_bufctl structure that holds the freelist linkage, buffer address, and a back- pointer to the controlling slab. Pictorially, a slab looks like this (bufctl-to-slab back-pointers not shown): one or more pages buf buf buf un- used kmem bufctl kmem bufctl kmem bufctl kmem slab next slab in cache prev slab in cache 3.2.2. Slab Layout for Small Objects For objects smaller than 1/8 of a page, a slab is built by allocating a page, placing the slab data at the end, and dividing the rest into equal-size buffers: one page buf buf ... buf buf un- used kmem slab Each buffer serves as its own bufctl while on the freelist. Only the linkage is actually needed, since everything else is computable. These are essential optimizations for small buffers — otherwise we would end up allocating almost as much memory for bufctls as for the buffers themselves. The freelist linkage resides at the end of the buffer, rather than the beginning, to facilitate debug- ging. This is driven by the empirical observation that the beginning of a data structure is typically more active than the end. If a buffer is modified after being freed, the problem is easier to diagnose if the heap structure (freelist linkage) is still intact. The allocator reserves an additional word for constructed objects so that the linkage doesn’t overwrite any constructed state. 3.2.3. Slab Layout for Large Objects The above scheme is efficient for small objects, but not for large ones. It could fit only one 2K buffer on a 4K page because of the embedded slab data. Moreover, with large (multi-page) slabs we lose the ability to determine the slab data address from the buffer address. Therefore, for large objects the physical layout is identical to the logical layout. The required slab and bufctl data structures come from their own (small-object!) caches. A per-cache self-scaling hash table provides buffer-to-bufctl conversion. 3.3. Freeli

本文档为【bonwick_the_slab_allocator】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

bonwick_the_slab_allocator

热门搜索

历史搜索