Short Papers___________________________________________________________________________________________________
Density-Based Multifeature Background
Subtraction with Support Vector Machine
Bohyung Han, Member, IEEE, and
Larry S. Davis, Fellow, IEEE
Abstract—Background modeling and subtraction is a natural technique for object
detection in videos captured by a static camera, and also a critical preprocessing
step in various high-level computer vision applications. However, there have not
been many studies concerning useful features and binary segmentation
algorithms for this problem. We propose a pixelwise background modeling and
subtraction technique using multiple features, where generative and discriminative
techniques are combined for classification. In our algorithm, color, gradient, and
Haar-like features are integrated to handle spatio-temporal variations for each
pixel. A pixelwise generative background model is obtained for each feature
efficiently and effectively by Kernel Density Approximation (KDA). Background
subtraction is performed in a discriminative manner using a Support Vector
Machine (SVM) over background likelihood vectors for a set of features. The
proposed algorithm is robust to shadow, illumination changes, spatial variations of
background. We compare the performance of the algorithm with other density-
based methods using several different feature combinations and modeling
techniques, both quantitatively and qualitatively.
Index Terms—Background modeling and subtraction, Haar-like features, support
vector machine, kernel density approximation.
Ç
1 INTRODUCTION
THE identification of regions of interest is typically the first step in
many computer vision applications, including event detection,
visual surveillance, and robotics. A general object detection
algorithm may be desirable, but it is extremely difficult to properly
handle unknown objects or objects with significant variations in
color, shape, and texture. Therefore, many practical computer
vision systems assume a fixed camera environment, which makes
the object detection process much more straightforward; a back-
ground model is trained with data obtained from empty scenes
and foreground regions are identified using the dissimilarity
between the trained model and new observations. This procedure
is called background subtraction.
Various background modeling and subtraction algorithms have
been proposed [1], [2], [3], [4], [5] which are mostly focused on
modeling methodologies, but potential visual features for effective
modeling have received relatively little attention. The study of new
features for background modeling may overcome or reduce the
limitations of typically used features, and the combination of
several heterogeneous features can improve performance, espe-
cially when they are complementary and uncorrelated. There have
been several studies for using texture for background modeling to
handle spatial variations in the scenes; they employ filter
responses, whose computation is typically very costly. Instead of
complex filters, we select efficient Haar-like features [6] and
gradient features to alleviate potential errors in background
subtraction caused by shadow, illumination changes, and spatial
and structural variations.
Model-based approaches involving probability density function
are common in background modeling and subtraction, and we
employ Kernel Density Approximation (KDA) [3], [7], where a
density function is represented with a compact weighted sum of
Gaussians whose number, weights, means, and covariances are
determined automatically based on mean-shift mode-finding
algorithm. In our framework, each visual feature is modeled by
KDA independently and every density function is 1D. By utilizing
the properties of the 1D mean-shift mode-finding procedure, the
KDA can be implemented efficiently because we need to compute
the convergence locations for only a small subset of data.
When the background is modeled with probability density
functions, the probabilities of foreground and background pixels
should be discriminative, but it is not always true. Specifically, the
background probabilities between features may be inconsistent
due to illumination changes, shadow, and foreground objects
similar in features to the background. Also, some features are
highly correlated, i.e., RGB color features. So, we employ a Support
Vector Machine (SVM) for nonlinear classification, which mitigates
the inconsistency and the correlation problem among features. The
final classification between foreground and background is based
on the outputs of the SVM.
There are three important aspects of our algorithm—integration
of multiple features, efficient 1D density estimation by KDA, and
foreground/background classification by SVM. These are coordi-
nated tightly to improve background subtraction performance. An
earlier version of this research appeared in [8]; the current paper
includes more comprehensive analysis of the feature sets and
additional experiments.
2 PREVIOUS WORK
The main objective of background subtraction is to obtain an
effective and efficient background model for foreground object
detection. In the early years, simple statistics, such as frame
differences and median filtering, were used to detect foreground
objects. Some techniques utilized a combination of local statistics [9]
or vector quantization [10] based on intensity or color at each pixel.
More advanced background modeling methods are density-
based, where the background model for each pixel is defined by a
probability density function based on the visual features observed
at the pixel during a training period. Wren et al. [11] modeled the
background in YUV color space using a Gaussian distribution for
each pixel. However, this method cannot handle multimodal
density functions, so it is not robust in dynamic environments.
A mixture of Gaussians is another popular density-based
method which is designed for dealing with multiple backgrounds.
Three Gaussian components representing the road, shadow, and
vehicle are employed to model the background in traffic scenes in
[12]. An adaptive Gaussian mixture model is proposed in [2] and
[13], where a maximum of K Gaussian components for each pixel
are allowed but the number of Gaussians is determined dynami-
cally. Also, variants of incremental EM algorithms have been
employed to deal with real-time adaptation constraints of back-
ground modeling [14], [15]. However, a major challenge in the
mixture model is the absence or weakness of strategies to
determine the number of components; it is also generally difficult
to add or remove components to/from the mixture [16]. Recently,
more elaborate and recursive update techniques are discussed in
[4] and [5]. However, none of the Gaussian mixture models have
any principled way to determine the number of Gaussians.
Therefore, most real-time applications rely on models with a fixed
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 5, MAY 2012 1017
. B. Han is with the Department of Computer Science and Engineering,
POSTECH, Pohang 790-784, Korea. E-mail: bhhan@postech.ac.kr.
. L.S. Davis is with the Department of Computer Science, University of
Maryland, College Park, MD 20742. E-mail: lsd@cs.umd.edu.
Manuscript received 28 Sept. 2009; revised 19 Dec. 2010; accepted 27 Aug.
2011; published online 8 Dec. 2011.
Recommended for acceptance by G.D. Hager.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2009-09-0649.
Digital Object Identifier no. 10.1109/TPAMI.2011.243.
0162-8828/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society
number of components or apply ad hoc strategies to adapt the
number of mixtures over time [2], [4], [5], [13], [17].
Kernel density estimation is a nonparametric density estimation
technique that has been successfully applied to background
subtraction [1], [18]. Although it is a powerful representation for
general density functions, it requires many samples for accurate
estimation of the underlying density functions and is computa-
tionally expensive, so it is not appropriate for real-time applica-
tions, especially when high-dimensional features are involved.
Most background subtraction algorithms are based on pixel-
wise processing, but multilayer approaches are also introduced in
[19] and [20], where background models are constructed at the
pixel, region, and frame levels and information from each layer is
combined for discriminating foreground and background. In [21],
several modalities based on different features and algorithms are
integrated and a Markov Random Field (MRF) is employed for the
inference procedure. The co-occurrence of visual features within
neighborhood pixels is used for robust background subtraction in
dynamic scenes in [22].
Some research on background subtraction has focused more on
features than the algorithm itself. Various visual features may be
used to model backgrounds, including intensity, color, gradient,
motion, texture, and other general filter responses. Color and
intensity are probably the most popular features for background
modeling, but several attempts have been made to integrate other
features to overcome their limitations. There are a few algorithms
based on motion cue [18], [23]. Texture and gradients have also
been successfully integrated for background modeling [24], [25]
since they are relatively invariant to local variations and illumina-
tion changes. In [26], spectral, spatial, and temporal features are
combined, and background/foreground classification is per-
formed based on the statistics of the most significant and frequent
features. Recently, a feature selection technique was proposed for
background subtraction [27].
3 BACKGROUND MODELING AND SUBTRACTION
ALGORITHM
This section describes our background modeling and subtraction
method based on the 1D KDA using multiple features. KDA is a
flexible and compact density estimation technique [7], and we
present a faster method to implement KDA for 1D data. For
background subtraction, we employ the SVM, which takes a
vector of probabilities obtained from multiple density functions
as an input.
3.1 Multiple Feature Combination
The most popular features for background modeling and subtrac-
tion are probably pixelwise color (or intensity) since they are directly
available from images and reasonably discriminative. Although it is
natural to monitor color variations at each pixel for background
modeling, they have several significant limitations as follows:
. They are not invariant to illumination changes and
shadows.
. Multidimensional color features are typically correlated
and joint probability modeling may not be advantageous
in practice.
. They rely on local information only and cannot handle
structural variations in neighborhoods.
We integrate color, gradient, and Haar-like features together to
alleviate the disadvantages of pixelwise color modeling. The
gradient features are more robust to illumination variations than
color or intensity features and are able to model local statistics
effectively. So, gradient features are occasionally used in back-
ground modeling problems [26], [27]. The strength of Haar-like
features lies in their simplicity and the ability to capture
neighborhood information. Each Haar-like feature is weak by
itself, but the collection of weak features has significant classifica-
tion power [6], [28]. The integration of these features is expected to
improve the accuracy of background subtraction. We have
11 features altogether, RGB color, two gradient features (horizontal
and vertical), and six Haar-like features. The Haar-like features
employed in our implementation are illustrated in Fig. 1. The
Haar-like features are extracted from 9� 9 rectangular regions at
each location in the image, while the gradient features are
extracted with 3� 3 Sobel operators. The fourth and fifth Haar-
like features are similar to the gradient features, but differ in filter
design, especially scale.
3.2 Background Modeling by KDA
The background probability of each pixel for each feature is
modeled with a Gaussian mixture density function.1 There are
various methods to implement this idea, and we adopt KDA,
where the density function for each pixel is represented with a
compact and flexible mixture of Gaussians. The KDA is a density
approximation technique based on mixture models, where mode
locations (local maxima) are detected automatically by the mean-
shift algorithm and a single Gaussian component is assigned to
each detected mode. The covariance for each Gaussian is
computed by curvature fitting around the associated mode. More
details about KDA are found in [7], and we summarize how to
build a density function by KDA for 1D case.
Suppose that training data for a sequence are composed of
n frames and xF;i (i ¼ 1; . . . ; n) is the ith value for feature F at a
certain location. Note that the index for pixel location is omitted for
simplicity. For each feature, we first construct a 1D density
function at each pixel by kernel density estimation based on
Gaussian kernel as follows:
f^F ðxÞ ¼ 1ffiffiffiffiffiffi
2�
p
Xn
i¼1
�F;i
�F;i
exp �ðx� xF;iÞ
2
2�2F;i
!
; ð1Þ
where �F;i and �F;i are the bandwidth and weight of the ith kernel,
respectively.
As described above, the KDA finds local maxima in the
underlying density function (1), and a mode-based representation
of the density is obtained by estimating all the parameters for a
compact Gaussian mixture. The original density function is
simplified by KDA as
~fF ðxÞ ¼ 1ffiffiffiffiffiffi
2�
p
XmF
i¼1
~�F;i
~�F;i
exp �ðx� ~xF;iÞ
2
2~�2F;i
!
; ð2Þ
where ~xF;i is a convergence location by the mean-shift mode
finding procedure, and ~�F;i and ~�F;i are the estimated standard
deviation and weight for the Gaussian assigned to each mode,
respectively. Note that the number of components in (2), mF , is
generally much smaller than n.
One remaining issue is bandwidth selection—initialization of
�F;i—to create the original density function in (1). Although there
is no optimal solution for bandwidth selection for the nonpara-
metric density estimation, we initialize the kernel bandwidth
based on the median absolute deviation over the temporal
neighborhood of the corresponding pixels, which is almost
1018 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 5, MAY 2012
Fig. 1. Haar-like features for our background modeling.
1. Histogram-based modeling is appropriate for 1D data. However, the
construction of a histogram at each pixel requires significant memory and
the quantization is not straightforward due to the wide range of Haar-like
feature values. Also, a histogram is typically less stable than continuous
density functions when only a small number of samples are available.
identical to the method proposed in [1]. Specifically, the initial
kernel bandwidth of the ith sample for feature F is given by
�F;i ¼ max �min;medt¼i�tw;...;iþtw jxF;t � xF;t�1j
� �
; ð3Þ
where tw is the temporal window size and �min is the
predefined minimum kernel bandwidth to avoid too narrow a
kernel. The motivation for this strategy is that multiple objects
may be projected onto the same pixel, but feature values from
the same object would be observed for a period of time; the
absolute difference between two consecutive frames, therefore,
does not jump too frequently.
Although KDA handles multimodal density functions for each
feature, it is still not sufficient to handle long-term background
variations. We must update background models periodically or
incrementally, which is done by Sequential Kernel Density
Approximation (SKDA) [7]. The density function at time step tþ 1
is obtained by the weighted average of the density function at the
previous time step and a new observation, which is given by
f^ tþ1F ðxÞ ¼ ð1� rÞf^ tF ðxÞ þ
r~�tþ1F;iffiffiffiffiffiffi
2�
p
~�tþ1F;i
exp �
�
x� ~xtþ1F;i
�2
2
�
~�tþ1F;i
�2
!
; ð4Þ
where r is a learning rate (forgetting factor). To prevent the
number of components from increasing indefinitely by this
procedure, KDA is applied at each time step. The detailed
technique and analysis are described in [7]. Since we are dealing
with only 1D vectors, the sequential modeling is also much simpler
and similar to the one introduced in Section 3.3; it is sufficient to
compute the new convergence points of the mode locations
immediately before and after the new data in the density function
at each time step.2 The computation time of SKDA depends
principally on the number of iterations for the data points in (4),
and can vary at each time step.
3.3 Optimization in One Dimension
We are only interested in 1D density functions, and the
convergence of each sample point can be obtained much faster
than the general case. Given 1D samples, xi and xj (i; j ¼ 1; . . . ; n),
the density function created by kernel density estimation has the
following properties:
ðxi � xj � x^iÞ _ ðxi � xj � x^iÞ ) x^i ¼ x^j; ð5Þ
where x^i and x^j are the convergence location of xi and xj,
respectively. In other words, all the samples located between a
sample and its associated mode converge to the same mode
location. So, every convergence location in the underlying density
function can be found without running the actual mode-finding
procedure for all the sample points, which is the most expensive
part in the computation of KDA. This section describes a simple
method to find all the convergence points by a single linear scan of
samples using the above properties, efficiently.
We sort the sample points in ascending order, and start
performing mean-shift mode finding from the smallest sample.
When the current sample moves in the gradient ascent direction by
the mean-shift algorithm in the underlying density function and
passes another sample’s location during the iterative procedure,
we note that the convergence point of the current sample must be
the same as the convergence location of the sample just passed,
terminate the current sample’s mean-shift process, and move on to
the next smallest sample, where we begin the mean-shift process
again.3 If a mode is found during the mean-shift iterations, its
location is stored and the next sample is considered. After
finishing the scan of all samples, each sample is associated with
a mode and the mode-finding procedure is complete. This strategy
is simple, but improves the speed of mean-shift mode-finding
significantly, especially when many samples are involved. Fig. 2
illustrates the impact of 1D optimization for the mode-finding
procedure, where a huge difference in computation time between
two algorithms is evident.
3.4 Foreground and Background Classification
After background modeling, each pixel is associated with k 1D
Gaussian mixtures, where k is the number of features integrated.
Background/foreground classification for a new frame is per-
formed using these distributions. The background probability of a
feature value is computed by (2), and k probability values are
obtained from each pixel, which are represented by a k-dimen-
sional vector. Such k-dimensional vectors are collected from
annotated foreground and background pixels, and we denote
them by yj (j ¼ 1; . . . ; N), where N is the number of data points.
In most density-based background subtraction algorithms, the
probabilities associated with each pixel are combined in a
straightforward way, either by computing the average probability
or by voting for the classification. However, such simple methods
may not work well under many real-world situations due to
feature dependency and nonlinearity. For example, pixels in
shadow may have a low-background probability in color modeling
unless shadows are explicitly modeled as transformations of color
variables, but high-background probability in texture modeling.
Also, the foreground color of a pixel