The Slab Allocator:
An Object-Caching Kernel Memory Allocator
Jeff Bonwick
Sun Microsystems
Abstract
This paper presents a comprehensive design over-
view of the SunOS 5.4 kernel memory allocator.
This allocator is based on a set of object-caching
primitives that reduce the cost of allocating complex
objects by retaining their state between uses. These
same primitives prove equally effective for manag-
ing stateless memory (e.g. data pages and temporary
buffers) because they are space-efficient and fast.
The allocator’s object caches respond dynamically
to global memory pressure, and employ an object-
coloring scheme that improves the system’s overall
cache utilization and bus balance. The allocator
also has several statistical and debugging features
that can detect a wide range of problems throughout
the system.
1. Introduction
The allocation and freeing of objects are among the
most common operations in the kernel. A fast ker-
nel memory allocator is therefore essential. How-
ever, in many cases the cost of initializing and
destroying the object exceeds the cost of allocating
and freeing memory for it. Thus, while improve-
ments in the allocator are beneficial, even greater
gains can be achieved by caching frequently used
objects so that their basic structure is preserved
between uses.
The paper begins with a discussion of object
caching, since the interface that this requires will
shape the rest of the allocator. The next section
then describes the implementation in detail. Section
4 describes the effect of buffer address distribution
on the system’s overall cache utilization and bus
balance, and shows how a simple coloring scheme
can improve both. Section 5 compares the
allocator’s performance to several other well-known
kernel memory allocators and finds that it is
generally superior in both space and time. Finally,
Section 6 describes the allocator’s debugging
features, which can detect a wide variety of prob-
lems throughout the system.
2. Object Caching
Object caching is a technique for dealing with
objects that are frequently allocated and freed. The
idea is to preserve the invariant portion of an
object’s initial state — its constructed state —
between uses, so it does not have to be destroyed
and recreated every time the object is used. For
example, an object containing a mutex only needs
to have mutex_init() applied once — the first
time the object is allocated. The object can then be
freed and reallocated many times without incurring
the expense of mutex_destroy() and
mutex_init() each time. An object’s embedded
locks, condition variables, reference counts, lists of
other objects, and read-only data all generally qual-
ify as constructed state.
Caching is important because the cost of con-
structing an object can be significantly higher than
the cost of allocating memory for it. For example,
on a SPARCstation-2 running a SunOS 5.4 develop-
ment kernel, the allocator presented here reduced
the cost of allocating and freeing a stream head
from 33 microseconds to 5.7 microseconds. As the
table below illustrates, most of the savings was due
to object caching:
_ ___________________________________________
Stream Head Allocation + Free Costs (μsec)
_ ___________________________________________
_ ___________________________________________
construction memory other
allocator + destruction allocation init.
_ ___________________________________________
old 23.6 9.4 1.9
new 0.0 3.8 1.9
_ ___________________________________________ ⎜⎜
⎜⎜
⎜⎜
⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜⎜
⎜
Caching is particularly beneficial in a mul-
tithreaded environment, where many of the most
frequently allocated objects contain one or more
embedded locks, condition variables, and other con-
structible state.
The design of an object cache is
straightforward:
To allocate an object:
if (there’s an object in the cache)
take it (no construction required);
else {
allocate memory;
construct the object;
}
To free an object:
return it to the cache (no destruction required);
To reclaim memory from the cache:
take some objects from the cache;
destroy the objects;
free the underlying memory;
An object’s constructed state must be initial-
ized only once — when the object is first brought
into the cache. Once the cache is populated, allo-
cating and freeing objects are fast, trivial operations.
2.1. An Example
Consider the following data structure:
struct foo {
kmutex_t foo_lock;
kcondvar_t foo_cv;
struct bar *foo_barlist;
int foo_refcnt;
};
Assume that a foo structure cannot be freed until
there are no outstanding references to it
(foo_refcnt == 0) and all of its pending bar
events (whatever they are) have completed
(foo_barlist == NULL). The life cycle of a
dynamically allocated foo would be something like
this:
foo = kmem_alloc(sizeof (struct foo),
KM_SLEEP);
mutex_init(&foo->foo_lock, ...);
cv_init(&foo->foo_cv, ...);
foo->foo_refcnt = 0;
foo->foo_barlist = NULL;
use foo;
ASSERT(foo->foo_barlist == NULL);
ASSERT(foo->foo_refcnt == 0);
cv_destroy(&foo->foo_cv);
mutex_destroy(&foo->foo_lock);
kmem_free(foo);
Notice that between each use of a foo object we
perform a sequence of operations that constitutes
nothing more than a very expensive no-op. All of
this overhead (i.e., everything other than ‘‘use foo’’
above) can be eliminated by object caching.
2.2. The Case for Object Caching in the
Central Allocator
Of course, object caching can be implemented
without any help from the central allocator — any
subsystem can have a private implementation of the
algorithm described above. However, there are
several disadvantages to this approach:
(1) There is a natural tension between an object
cache, which wants to keep memory, and the
rest of the system, which wants that memory
back. Privately-managed caches cannot handle
this tension sensibly. They have limited
insight into the system’s overall memory needs
and no insight into each other’s needs. Simi-
larly, the rest of the system has no knowledge
of the existence of these caches and hence has
no way to ‘‘pull’’ memory from them.
(2) Since private caches bypass the central alloca-
tor, they also bypass any accounting mechan-
isms and debugging features that allocator may
possess. This makes the operating system
more difficult to monitor and debug.
(3) Having many instances of the same solution to
a common problem increases kernel code size
and maintenance costs.
Object caching requires a greater degree of coopera-
tion between the allocator and its clients than the
standard kmem_alloc(9F)/kmem_free(9F)
interface allows. The next section develops an
interface to support constructed object caching in
the central allocator.
2.3. Object Cache Interface
The interface presented here follows from two
observations:
(A) Descriptions of objects (name, size, alignment,
constructor, and destructor) belong in the
clients — not in the central allocator. The
allocator should not just ‘‘know’’ that
sizeof (struct inode) is a useful pool
size, for example. Such assumptions are brittle
[Grunwald93A] and cannot anticipate the needs
of third-party device drivers, streams modules
and file systems.
(B) Memory management policies belong in the
central allocator — not in its clients. The
clients just want to allocate and free objects
quickly. They shouldn’t have to worry about
how to manage the underlying memory
efficiently.
It follows from (A) that object cache creation must
be client-driven and must include a full specification
of the objects:
(1) struct kmem_cache *kmem_cache_create(
char *name,
size_t size,
int align,
void (*constructor)(void *, size_t),
void (*destructor)(void *, size_t));
Creates a cache of objects, each of size size,
aligned on an align boundary. The align-
ment will always be rounded up to the
minimum allowable value, so align can be
zero whenever no special alignment is required.
name identifies the cache for statistics and
debugging. constructor is a function that
constructs (that is, performs the one-time ini-
tialization of) objects in the cache; destruc-
tor undoes this, if applicable. The construc-
tor and destructor take a size argument so
that they can support families of similar
caches, e.g. streams messages.
kmem_cache_create returns an opaque
descriptor for accessing the cache.
Next, it follows from (B) that clients should need
just two simple functions to allocate and free
objects:
(2) void *kmem_cache_alloc(
struct kmem_cache *cp,
int flags);
Gets an object from the cache. The object will
be in its constructed state. flags is either
KM_SLEEP or KM_NOSLEEP, indicating
whether it’s acceptable to wait for memory if
none is currently available.
(3) void kmem_cache_free(
struct kmem_cache *cp,
void *buf);
Returns an object to the cache. The object
must still be in its constructed state.
Finally, if a cache is no longer needed the client can
destroy it:
(4) void kmem_cache_destroy(
struct kmem_cache *cp);
Destroys the cache and reclaims all associated
resources. All allocated objects must have
been returned to the cache.
This interface allows us to build a flexible allocator
that is ideally suited to the needs of its clients. In
this sense it is a ‘‘custom’’ allocator. However, it
does not have to be built with compile-time
knowledge of its clients as most custom allocators
do [Bozman84A, Grunwald93A, Margolin71], nor
does it have to keep guessing as in the adaptive-fit
methods [Bozman84B, Leverett82, Oldehoeft85].
Rather, the object-cache interface allows clients to
specify the allocation services they need on the fly.
2.4. An Example
This example demonstrates the use of object cach-
ing for the ‘‘foo’’ objects introduced in Section 2.1.
The constructor and destructor routines are:
void
foo_constructor(void *buf, int size)
{
struct foo *foo = buf;
mutex_init(&foo->foo_lock, ...);
cv_init(&foo->foo_cv, ...);
foo->foo_refcnt = 0;
foo->foo_barlist = NULL;
}
void
foo_destructor(void *buf, int size)
{
struct foo *foo = buf;
ASSERT(foo->foo_barlist == NULL);
ASSERT(foo->foo_refcnt == 0);
cv_destroy(&foo->foo_cv);
mutex_destroy(&foo->foo_lock);
}
To create the foo cache:
foo_cache = kmem_cache_create("foo_cache",
sizeof (struct foo), 0,
foo_constructor, foo_destructor);
To allocate, use, and free a foo object:
foo = kmem_cache_alloc(foo_cache, KM_SLEEP);
use foo;
kmem_cache_free(foo_cache, foo);
This makes foo allocation fast, because the
allocator will usually do nothing more than fetch an
already-constructed foo from the cache.
foo_constructor and foo_destructor
will be invoked only to populate and drain the
cache, respectively.
The example above illustrates a beneficial
side-effect of object caching: it reduces the
instruction-cache footprint of the code that uses
cached objects by moving the rarely-executed con-
struction and destruction code out of the hot path.
3. Slab Allocator Implementation
This section describes the implementation of the
SunOS 5.4 kernel memory allocator, or ‘‘slab allo-
cator,’’ in detail. (The name derives from one of
the allocator’s main data structures, the slab. The
name stuck within Sun because it was more distinc-
tive than ‘‘object’’ or ‘‘cache.’’ Slabs will be dis-
cussed in Section 3.2.)
The terms object, buffer, and chunk will be
used more or less interchangeably, depending on
how we’re viewing that piece of memory at the
moment.
3.1. Caches
Each cache has a front end and back end which are
designed to be as decoupled as possible:
cache
back end front end
kmem_cache_grow
kmem_cache_reap
kmem_cache_alloc
kmem_cache_free
The front end is the public interface to the
allocator. It moves objects to and from the cache,
calling into the back end when it needs more
objects.
The back end manages the flow of real
memory through the cache. The influx routine
(kmem_cache_grow()) gets memory from the
VM system, makes objects out of it, and feeds those
objects into the cache. The outflux routine
(kmem_cache_reap()) is invoked by the VM
system when it wants some of that memory back —
e.g., at the onset of paging. Note that all back-end
activity is triggered solely by memory pressure.
Memory flows in when the cache needs more
objects and flows back out when the rest of the sys-
tem needs more pages; there are no arbitrary limits
or watermarks. Hysteresis control is provided by a
working-set algorithm, described in Section 3.4.
The slab allocator is not a monolithic entity,
but rather is a loose confederation of independent
caches. The caches have no shared state, so the
allocator can employ per-cache locking instead of
protecting the entire arena (kernel heap) with one
global lock. Per-cache locking improves scalability
by allowing any number of distinct caches to be
accessed simultaneously.
Each cache maintains its own statistics —
total allocations, number of allocated and free
buffers, etc. These per-cache statistics provide
insight into overall system behavior. They indicate
which parts of the system consume the most
memory and help to identify memory leaks. They
also indicate the activity level in various subsys-
tems, to the extent that allocator traffic is an accu-
rate approximation. (Streams message allocation is
a direct measure of streams activity, for example.)
The slab allocator is operationally similar to
the ‘‘CustoMalloc’’ [Grunwald93A], ‘‘QuickFit’’
[Weinstock88], and ‘‘Zone’’ [VanSciver88] alloca-
tors, all of which maintain distinct freelists of the
most commonly requested buffer sizes. The
Grunwald and Weinstock papers each demonstrate
that a customized segregated-storage allocator —
one that has a priori knowledge of the most com-
mon allocation sizes — is usually optimal in both
space and time. The slab allocator is in this
category, but has the advantage that its customiza-
tions are client-driven at run time rather than being
hard-coded at compile time. (This is also true of
the Zone allocator.)
The standard non-caching allocation routines,
kmem_alloc(9F) and kmem_free(9F), use
object caches internally. At startup, the system
creates a set of about 30 caches ranging in size
from 8 bytes to 9K in roughly 10-20% increments.
kmem_alloc() simply performs a
kmem_cache_alloc() from the nearest-size
cache. Allocations larger than 9K, which are rare,
are handled directly by the back-end page supplier.
3.2. Slabs
The slab is the primary unit of currency in the slab
allocator. When the allocator needs to grow a
cache, for example, it acquires an entire slab of
objects at once. Similarly, the allocator reclaims
unused memory (shrinks a cache) by relinquishing a
complete slab.
A slab consists of one or more pages of virtu-
ally contiguous memory carved up into equal-size
chunks, with a reference count indicating how many
of those chunks have been allocated. The benefits
of using this simple data structure to manage the
arena are somewhat striking:
(1) Reclaiming unused memory is trivial. When
the slab reference count goes to zero the associated
pages can be returned to the VM system. Thus a
simple reference count replaces the complex trees,
bitmaps, and coalescing algorithms found in most
other allocators [Knuth68, Korn85, Standish80].
(2) Allocating and freeing memory are fast,
constant-time operations. All we have to do is
move an object to or from a freelist and update a
reference count.
(3) Severe external fragmentation (unused
buffers on the freelist) is unlikely. Over time,
many allocators develop an accumulation of small,
unusable buffers. This occurs as the allocator splits
existing free buffers to satisfy smaller requests. For
example, the right sequence of 32-byte and 40-byte
allocations may result in a large accumulation of
free 8-byte buffers — even though no 8-byte buffers
are ever requested [Standish80]. A segregated-
storage allocator cannot suffer this fate, since the
only way to populate its 8-byte freelist is to actually
allocate and free 8-byte buffers. Any sequence of
32-byte and 40-byte allocations — no matter how
complex — can only result in population of the 32-
byte and 40-byte freelists. Since prior allocation is
a good predictor of future allocation [Weinstock88]
these buffers are likely to be used again.
The other reason that slabs reduce external fragmen-
tation is that all objects in a slab are of the same
type, so they have the same lifetime distribution.*
The resulting segregation of short-lived and long-
lived objects at slab granularity reduces the likeli-
hood of an entire page being held hostage due to a
single long-lived allocation [Barrett93, Hanson90].
____________________________________
* The generic caches that back kmem_alloc() are a
notable exception, but they constitute a relatively small fraction
of the arena in SunOS 5.4 — all of the major consumers of
memory now use kmem_cache_alloc().
(4) Internal fragmentation (per-buffer wasted
space) is minimal. Each buffer is exactly the right
size (namely, the cache’s object size), so the only
wasted space is the unused portion at the end of the
slab. For example, assuming 4096-byte pages, the
slabs in a 400-byte object cache would each contain
10 buffers, with 96 bytes left over. We can view
this as equivalent 9.6 bytes of wasted space per
400-byte buffer, or 2.4% internal fragmentation.
In general, if a slab contains n buffers, then the
internal fragmentation is at most 1/n; thus the allo-
cator can actually control the amount of internal
fragmentation by controlling the slab size. How-
ever, larger slabs are more likely to cause external
fragmentation, since the probability of being able to
reclaim a slab decreases as the number of buffers
per slab increases. The SunOS 5.4 implementation
limits internal fragmentation to 12.5% (1/8), since
this was found to be the empirical sweet-spot in the
trade-off between internal and external fragmenta-
tion.
3.2.1. Slab Layout — Logical
The contents of each slab are managed by a
kmem_slab data structure that maintains the slab’s
linkage in the cache, its reference count, and its list
of free buffers. In turn, each buffer in the slab is
managed by a kmem_bufctl structure that holds
the freelist linkage, buffer address, and a back-
pointer to the controlling slab. Pictorially, a slab
looks like this (bufctl-to-slab back-pointers not
shown):
one or more pages
buf buf buf un-
used
kmem
bufctl
kmem
bufctl
kmem
bufctl
kmem
slab
next slab in cache
prev slab in cache
3.2.2. Slab Layout for Small Objects
For objects smaller than 1/8 of a page, a slab is
built by allocating a page, placing the slab data at
the end, and dividing the rest into equal-size
buffers:
one page
buf buf ... buf buf un-
used
kmem
slab
Each buffer serves as its own bufctl while on the
freelist. Only the linkage is actually needed, since
everything else is computable. These are essential
optimizations for small buffers — otherwise we
would end up allocating almost as much memory
for bufctls as for the buffers themselves.
The freelist linkage resides at the end of the
buffer, rather than the beginning, to facilitate debug-
ging. This is driven by the empirical observation
that the beginning of a data structure is typically
more active than the end. If a buffer is modified
after being freed, the problem is easier to diagnose
if the heap structure (freelist linkage) is still intact.
The allocator reserves an additional word for
constructed objects so that the linkage doesn’t
overwrite any constructed state.
3.2.3. Slab Layout for Large Objects
The above scheme is efficient for small objects, but
not for large ones. It could fit only one 2K buffer
on a 4K page because of the embedded slab data.
Moreover, with large (multi-page) slabs we lose the
ability to determine the slab data address from the
buffer address. Therefore, for large objects the
physical layout is identical to the logical layout.
The required slab and bufctl data structures come
from their own (small-object!) caches. A per-cache
self-scaling hash table provides buffer-to-bufctl
conversion.
3.3. Freeli