IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289
Compressed Sensing
David L. Donoho, Member, IEEE
Abstract—Suppose is an unknown vector in (a digital
image or signal); we plan to measure general linear functionals
of and then reconstruct. If is known to be compressible
by transform coding with a known transform, and we recon-
struct via the nonlinear procedure defined here, the number of
measurements can be dramatically smaller than the size .
Thus, certain natural classes of images with pixels need only
= (
1 4
log
5 2
( )) nonadaptive nonpixel samples for
faithful recovery, as opposed to the usual pixel samples.
More specifically, suppose has a sparse representation in
some orthonormal basis (e.g., wavelet, Fourier) or tight frame
(e.g., curvelet, Gabor)—so the coefficients belong to an ball
for 0 1. The most important coefficients in that
expansion allow reconstruction with
2
error ( 1 2 1 ). It is
possible to design = ( log( )) nonadaptive measurements
allowing reconstruction with accuracy comparable to that attain-
able with direct knowledge of the most important coefficients.
Moreover, a good approximation to those important coeffi-
cients is extracted from the measurements by solving a linear
program—Basis Pursuit in signal processing. The nonadaptive
measurements have the character of “random” linear combi-
nations of basis/frame elements. Our results use the notions of
optimal recovery, of -widths, and information-based complexity.
We estimate the Gel’fand -widths of balls in high-dimensional
Euclidean space in the case 0 1, and give a criterion
identifying near-optimal subspaces for Gel’fand -widths. We
show that “most” subspaces are near-optimal, and show that
convex optimization (Basis Pursuit) is a near-optimal way to
extract information derived from these near-optimal subspaces.
Index Terms—Adaptive sampling, almost-spherical sections of
Banach spaces, Basis Pursuit, eigenvalues of random matrices,
Gel’fand -widths, information-based complexity, integrated
sensing and processing, minimum
1
-norm decomposition, op-
timal recovery, Quotient-of-a-Subspace theorem, sparse solution
of linear equations.
I. INTRODUCTION
AS our modern technology-driven civilization acquires andexploits ever-increasing amounts of data, “everyone” now
knows that most of the data we acquire “can be thrown away”
with almost no perceptual loss—witness the broad success of
lossy compression formats for sounds, images, and specialized
technical data. The phenomenon of ubiquitous compressibility
raises very natural questions: why go to so much effort to ac-
quire all the data when most of what we get will be thrown
away? Can we not just directly measure the part that will not
end up being thrown away?
In this paper, we design compressed data acquisition proto-
cols which perform as if it were possible to directly acquire just
Manuscript received September 18, 2004; revised December 15, 2005.
The author is with the Department of Statistics, Stanford University, Stanford,
CA 94305 USA (e-mail: donoho@stanford.edu).
Communicated by A. Høst-Madsen, Associate Editor for Detection and
Estimation.
Digital Object Identifier 10.1109/TIT.2006.871582
the important information about the signals/images—in effect,
not acquiring that part of the data that would eventually just be
“thrown away” by lossy compression. Moreover, the protocols
are nonadaptive and parallelizable; they do not require knowl-
edge of the signal/image to be acquired in advance—other than
knowledge that the data will be compressible—and do not at-
tempt any “understanding” of the underlying object to guide
an active or adaptive sensing strategy. The measurements made
in the compressed sensing protocol are holographic—thus, not
simple pixel samples—and must be processed nonlinearly.
In specific applications, this principle might enable dra-
matically reduced measurement time, dramatically reduced
sampling rates, or reduced use of analog-to-digital converter
resources.
A. Transform Compression Background
Our treatment is abstract and general, but depends on one spe-
cific assumption which is known to hold in many settings of
signal and image processing: the principle of transform sparsity.
We suppose that the object of interest is a vector , which
can be a signal or image with samples or pixels, and that there
is an orthonormal basis for which can
be, for example, an orthonormal wavelet basis, a Fourier basis,
or a local Fourier basis, depending on the application. (As ex-
plained later, the extension to tight frames such as curvelet or
Gabor frames comes for free.) The object has transform coeffi-
cients , and these are assumed sparse in the sense
that, for some and for some
(I.1)
Such constraints are actually obeyed on natural classes of sig-
nals and images; this is the primary reason for the success of
standard compression tools based on transform coding [1]. To
fix ideas, we mention two simple examples of constraint.
• Bounded Variation model for images. Here image bright-
ness is viewed as an underlying function on the
unit square , which obeys (essentially)
The digital data of interest consist of pixel sam-
ples of produced by averaging over pixels.
We take a wavelet point of view; the data are seen as a su-
perposition of contributions from various scales. Let
denote the component of the data at scale , and let
denote the orthonormal basis of wavelets at scale , con-
taining elements. The corresponding coefficients
obey .
0018-9448/$20.00 © 2006 IEEE
1290 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006
• Bump Algebra model for spectra. Here a spectrum (e.g.,
mass spectrum or magnetic resonance spectrum) is
modeled as digital samples of an underlying
function on the real line which is a superposition of
so-called spectral lines of varying positions, amplitudes,
and linewidths. Formally
Here the parameters are line locations, are ampli-
tudes/polarities, and are linewidths, and represents a
lineshape, for example the Gaussian, although other pro-
files could be considered. We assume the constraint where
, which in applications represents an en-
ergy or total mass constraint. Again we take a wavelet
viewpoint, this time specifically using smooth wavelets.
The data can be represented as a superposition of con-
tributions from various scales. Let denote the com-
ponent of the spectrum at scale , and let denote
the orthonormal basis of wavelets at scale , containing
elements. The corresponding coefficients again obey
, [2].
While in these two examples, the constraint appeared, other
constraints with can appear naturally as well;
see below. For some readers, the use of norms with
may seem initially strange; it is now well understood that the
norms with such small are natural mathematical measures of
sparsity [3], [4]. As decreases below , more and more sparsity
is being required. Also, from this viewpoint, an constraint
based on requires no sparsity at all.
Note that in each of these examples, we also allowed for sep-
arating the object of interest into subbands, each one of which
obeys an constraint. In practice, in the following we stick
with the view that the object of interest is a coefficient vector
obeying the constraint (I.1), which may mean, from an applica-
tion viewpoint, that our methods correspond to treating various
subbands separately, as in these examples.
The key implication of the constraint is sparsity of the
transform coefficients. Indeed, we have trivially that, if de-
notes the vector with everything except the largest coeffi-
cients set to
(I.2)
for , with a constant depending only on
. Thus, for example, to approximate with error , we
need to keep only the biggest terms in .
B. Optimal Recovery/Information-Based Complexity
Background
Our question now becomes: if is an unknown signal whose
transform coefficient vector obeys (I.1), can we make a
reduced number of measurements which will allow
faithful reconstruction of . Such questions have been discussed
(for other types of assumptions about ) under the names of
Optimal Recovery [5] and Information-Based Complexity [6];
we now adopt their viewpoint, and partially adopt their nota-
tion, without making a special effort to be really orthodox. We
use “OR/IBC” as a generic label for work taking place in those
fields, admittedly being less than encyclopedic about various
scholarly contributions.
We have a class of possible objects of interest, and are
interested in designing an information operator
that samples pieces of information about , and an algorithm
that offers an approximate reconstruction of .
Here the information operator takes the form
where the are sampling kernels, not necessarily sampling
pixels or other simple features of the signal; however, they are
nonadaptive, i.e., fixed independently of . The algorithm is
an unspecified, possibly nonlinear reconstruction operator.
We are interested in the error of reconstruction and in the
behavior of optimal information and optimal algorithms. Hence,
we consider the minimax error as a standard of comparison
So here, all possible methods of nonadaptive linear sampling are
allowed, and all possible methods of reconstruction are allowed.
In our application, the class of objects of interest is the set
of objects where obeys (I.1) for a given
and . Denote then
Our goal is to evaluate and to have practical
schemes which come close to attaining it.
C. Four Surprises
Here is the main quantitative phenomenon of interest for this
paper.
Theorem 1: Let be a sequence of problem sizes with
, , and , , . Then for
there is so that
(I.3)
We find this surprising in four ways. First, compare (I.3) with
(I.2). We see that the forms are similar, under the calibration
. In words: the quality of approximation to
which could be gotten by using the biggest transform coef-
ficients can be gotten by using the pieces of
nonadaptive information provided by . The surprise is that
we would not know in advance which transform coefficients are
likely to be the important ones in this approximation, yet the
optimal information operator is nonadaptive, depending at
most on the class and not on the specific object. In
some sense this nonadaptive information is just as powerful as
knowing the best transform coefficients.
This seems even more surprising when we note that for ob-
jects , the transform representation is the optimal
one: no other representation can do as well at characterizing
by a few coefficients [3], [7]. Surely then, one imagines,
the sampling kernels underlying the optimal information
DONOHO: COMPRESSED SENSING 1291
operator must be simply measuring individual transform coef-
ficients? Actually, no: the information operator is measuring
very complex “holographic” functionals which in some sense
mix together all the coefficients in a big soup. Compare (VI.1)
below. (Holography is a process where a three–dimensional
(3-D) image generates by interferometry a two–dimensional
(2-D) transform. Each value in the 2-D transform domain is
influenced by each part of the whole 3-D object. The 3-D
object can later be reconstructed by interferometry from all
or even a part of the 2-D transform domain. Leaving aside
the specifics of 2-D/3-D and the process of interferometry, we
perceive an analogy here, in which an object is transformed to a
compressed domain, and each compressed domain component
is a combination of all parts of the original object.)
Another surprise is that, if we enlarged our class of informa-
tion operators to allow adaptive ones, e.g., operators in which
certain measurements are made in response to earlier measure-
ments, we could scarcely do better. Define the minimax error
under adaptive information allowing adaptive operators
where, for , each kernel is allowed to depend on the
information gathered at previous stages .
Formally setting
we have the following.
Theorem 2: For , there is so that for
So adaptive information is of minimal help—despite the quite
natural expectation that an adaptive method ought to be able
iteratively somehow “localize” and then “close in” on the “big
coefficients.”
An additional surprise is that, in the already-interesting case
, Theorems 1 and 2 are easily derivable from known results
in OR/IBC and approximation theory! However, the derivations
are indirect; so although they have what seem to the author as
fairly important implications, very little seems known at present
about good nonadaptive information operators or about concrete
algorithms matched to them.
Our goal in this paper is to give direct arguments which cover
the case of highly compressible objects, to give di-
rect information about near-optimal information operators and
about concrete, computationally tractable algorithms for using
this near-optimal information.
D. Geometry and Widths
From our viewpoint, the phenomenon described in The-
orem 1 concerns the geometry of high-dimensional convex and
nonconvex “balls.” To see the connection, note that the class
is the image, under orthogonal transformation, of
an ball. If this is convex and symmetric about the
origin, as well as being orthosymmetric with respect to the axes
provided by the wavelet basis; if , this is again symmetric
about the origin and orthosymmetric, while not being convex,
but still star-shaped.
To develop this geometric viewpoint further, we consider two
notions of -width; see [5].
Definition 1.1: The Gel’fand -width of with respect to
the norm is defined as
where the infimum is over -dimensional linear subspaces of
, and denotes the orthocomplement of with respect
to the standard Euclidean inner product.
In words, we look for a subspace such that “trapping”
in that subspace causes to be small. Our interest in Gel’fand
-widths derives from an equivalence between optimal recovery
for nonadaptive information and such -widths, well known in
the case [5], and in the present setting extending as
follows.
Theorem 3: For and
(I.4)
(I.5)
Thus the Gel’fand -widths either exactly or nearly equal the
value of optimal information. Ultimately, the bracketing with
constant will be for us just as good as equality, owing to
the unspecified constant factors in (I.3). We will typically only
be interested in near-optimal performance, i.e., in obtaining
to within constant factors.
It is relatively rare to see the Gel’fand -widths studied
directly [8]; more commonly, one sees results about the
Kolmogorov -widths.
Definition 1.2: Let be a bounded set. The
Kolmogorov -width of with respect the norm is defined
as
where the infimum is over -dimensional linear subspaces
of .
In words, measures the quality of approximation of
possible by -dimensional subspaces .
In the case , there is an important duality relationship
between Kolmogorov widths and Gel’fand widths which allows
us to infer properties of from published results on . To
state it, let be defined in the obvious way, based on
approximation in rather than norm. Also, for given
and , let and be the standard dual indices
, . Also, let denote the standard unit
ball of . Then [8]
(I.6)
In particular
1292 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006
The asymptotic properties of the left-hand side have been de-
termined by Garnaev and Gluskin [9]. This follows major work
by Kashin [10], who developed a slightly weaker version of this
result in the course of determining the Kolmogorov -widths of
Sobolev spaces. See the original papers, or Pinkus’s book [8]
for more details.
Theorem 4: (Kashin, Garnaev, and Gluskin (KGG)) For
all and
Theorem 1 now follows in the case by applying KGG
with the duality formula (I.6) and the equivalence formula (I.4).
The case of Theorem 1 does not allow use of duality
and the whole range is approached differently in this
paper.
E. Mysteries …
Because of the indirect manner by which the KGG result im-
plies Theorem 1, we really do not learn much about the phenom-
enon of interest in this way. The arguments of Kashin, Garnaev,
and Gluskin show that there exist near-optimal -dimensional
subspaces for the Kolmogorov widths; they arise as the null
spaces of certain matrices with entries which are known to
exist by counting the number of matrices lacking certain prop-
erties, the total number of matrices with entries, and com-
paring. The interpretability of this approach is limited.
The implicitness of the information operator is matched
by the abstractness of the reconstruction algorithm. Based on
OR/IBC theory we know that the so-called central algorithm
is optimal. This “algorithm” asks us to consider, for given
information , the collection of all objects which
could have given rise to the data
Defining now the center of a set
center
the central algorithm is
center
and it obeys, when the information is optimal,
see Section III below.
This abstract viewpoint unfortunately does not translate into
a practical approach (at least in the case of the ,
). The set is a section of the ball
, and finding the center of this section does not corre-
spond to a standard tractable computational problem. Moreover,
this assumes we know and , which would typically not be
the case.
F. Results
Our paper develops two main types of results.
• Near-Optimal Information. We directly consider the
problem of near-optimal subspaces for Gel’fand -widths
of , and introduce three structural conditions
(CS1–CS3) on an -by- matrix which imply that its
nullspace is near-optimal. We show that the vast majority
of -subspaces of are near-optimal, and random
sampling yields near-optimal information operators with
overwhelmingly high probability.
• Near-Optimal Algorithm. We study a simple nonlinear re-
construction algorithm: simply minimize the norm of
the coefficients subject to satisfying the measurements.
This has been studied in the signal processing literature
under the name Basis Pursuit; it can be computed by linear
programming. We show that this method gives near-op-
timal results for all .
In short, we provide a large supply of near-optimal infor-
mation operators and a near-optimal reconstruction procedure
based on linear programming, which, perhaps unexpectedly,
works even for the nonconvex case .
For a taste of the type of result we obtain, consider a specific
information/algorithm combination.
• CS Information. Let be an matrix generated by
randomly sampling the columns, with different columns
independent and identically distributed (i.i.d.) random
uniform on . With overwhelming probability for
large , has properties CS1–CS3 discussed in detail
in Section II-A below; assume we have achieved such a
favorable draw. Let be the basis matrix with
basis vector as the th column. The CS Information
operator is the matrix .
• -minimization. To reconstruct from CS Information, we
solve the convex optimization problem
subject to
In words, we look for the object having coefficients
with smallest norm that is consistent with the informa-
tion .
To evaluate the quality of an information operator , set
To evaluate the quality of a combined algorithm/information
pair , set
Theorem 5: Let , be a sequence of problem sizes
obeying , , ; and let be a
corresponding sequence of operators deriving from CS matrices
with underlying parameters and (see Section II below). Let
. There exists so that
is near-optimal:
DONOHO: COMPRESSED SENSING 1293
for , . Moreover, the algorithm delivering
the solution to is near-optimal:
for , .
Thus, for large , we have a simple description of near-op-
timal information and a tractable near-optimal reconstruction
algorithm.
G. Potential Applications
To see the potential implications, recall first the Bump Al-
gebra model for spectra. In this context, our result says that,
for a spectrometer based on the information operator in The-
orem 5, it is really only necessary to take
measurements to get an accurate reconstruction of such spectra,
rather than the nominal measurements. Ho