To appear in the ACM SIGGRAPH conference proceedings
Content-Preserving Warps for 3D Video Stabilization
Feng Liu
Michael Gleicher
University of Wisconsin-Madison
Hailin Jin
Aseem Agarwala
Adobe Systems, Inc.
Abstract
We describe a technique that transforms a video from a hand-held
video camera so that it appears as if it were taken with a directed
camera motion. Our method adjusts the video to appear as if it were
taken from nearby viewpoints, allowing 3D camera movements to
be simulated. By aiming only for perceptual plausibility, rather than
accurate reconstruction, we are able to develop algorithms that can
effectively recreate dynamic scenes from a single source video. Our
technique first recovers the original 3D camera motion and a sparse
set of 3D, static scene points using an off-the-shelf structure-from-
motion system. Then, a desired camera path is computed either
automatically (e.g., by fitting a linear or quadratic path) or inter-
actively. Finally, our technique performs a least-squares optimiza-
tion that computes a spatially-varying warp from each input video
frame into an output frame. The warp is computed to both follow the
sparse displacements suggested by the recovered 3D structure, and
avoid deforming the content in the video frame. Our experiments
on stabilizing challenging videos of dynamic scenes demonstrate
the effectiveness of our technique.
1 Introduction
While digital still photography has progressed to the point where
most amateurs can easily capture high-quality images, the quality
gap between professional and amateur-level video remains remark-
ably wide. One of the biggest components of this gap is camera
motion. Most camera motions in casual video are shot hand-held,
yielding videos that are difficult to watch, even if video stabiliza-
tion is used to remove high-frequency jitters. In contrast, some
of the most striking camera movements in professional produc-
tions are “tracking shots” [Kawin 1992], where cameras are moved
along smooth, simple paths. Professionals achieve such motion
with sophisticated equipment, such as cameras mounted on rails
or steadicams, that are too cumbersome or expensive for amateurs.
In this paper, we describe a technique that allows a user to trans-
form their hand-held videos to have the appearance of an idealized
camera motion, such as a tracking shot, as a post-processing step.
Given a video sequence from a single video camera, our algorithm
can simulate any camera motion that is reasonably close to the cap-
tured one. We focus on creating canonical camera motions, such as
linear or parabolic paths, because such paths have a striking effect
and are difficult to create without extensive equipment. Our method
can also perform stabilization using low-pass filtering of the origi-
nal camera motion to give the appearance of a steadicam. Given a
desired output camera path, our method then automatically warps
the input sequence so that it appears to have been captured along
the specified path.
1
http://www.cs.wisc.edu/graphics/Gallery/WarpFor3DStabilization/
While existing video stabilization algorithms are successful at re-
moving small camera jitters, they typically cannot produce the more
aggressive changes required to synthesize idealized camera mo-
tions. Most existing methods operate purely in 2D; they apply full-
frame 2D warps (e.g., affine or projective) to each image that best
remove jitter from the trajectory of features in the video. These 2D
methods are fundamentally limited in two ways: first, a full-frame
warp cannot model the parallax that is induced by a translational
shift in viewpoint; second, there is no connection between the 2D
warp and a 3D camera motion, making it impossible to describe
desired camera paths in 3D. We therefore consider a 3D approach.
Image-based rendering methods can be used to perform video sta-
bilization in 3D by rendering what a camera would have seen along
the desired camera path [Buehler et al. 2001a]. However, these tech-
niques are currently limited to static scenes, since they render a
novel viewpoint by combining content from multiple video frames,
and therefore multiple moments in time.
Our work is the first technique that can perform 3D video stabi-
lization for dynamic scenes. In our method, dynamic content and
other temporal properties of video are preserved because each out-
put frame is rendered as a warp of a single input frame. This con-
straint implies that we must perform accurate novel view interpo-
lation from a single image, which is extremely challenging [Hoiem
et al. 2005]. Performing this task for a non-rigid dynamic scene
captured by a single camera while maintaining temporal coherence
is even harder; in fact, to the best of our knowledge it has never
been attempted. An accurate solution would require solving several
challenging computer vision problems, such as video layer separa-
tion [Chuang et al. 2002], non-rigid 3D tracking [Torresani et al.
2008], and video hole-filling [Wexler et al. 2004]. In this paper
we provide a technique for novel view interpolation that avoids
these challenging vision problems by relaxing the constraint of a
physically-correct reconstruction. For our application, a perceptu-
ally plausible result is sufficient: we simply want to provide the
illusion that the camera moves along a new but nearby path. In prac-
tice, we find our technique is effective for video stabilization even
though our novel views are not physically accurate and would not
match the ground truth.
Our method takes advantage of recent advances in two areas of re-
search: shape-preserving image deformation [Igarashi et al. 2005],
which deforms images according to user-specified handles while
minimizing the distortion of local shape; and content-aware im-
age resizing [Avidan and Shamir 2007; Wolf et al. 2007], which
changes the size of images while preserving salient image content.
Both of these methods minimize perceivable image distortion by
optimally distributing the deformation induced by user-controlled
edits across the 2D domain. We apply this same principle to image
warps for 3D video stabilization, though in our case we optimally
distribute the distortion induced by a 3D viewpoint change rather
than user-controlled deformation. Since the change in viewpoint re-
quired by video stabilization is typically small, we have found that
this not-physically-correct approach to novel view interpolation is
sufficient even for challenging videos of dynamic scenes.
Our method consists of three stages. First, it recovers the 3D camera
motion and a sparse set of 3D, static scene points using an off-the-
shelf structure-from-motion (SFM) system. Second, the user inter-
actively specifies a desired camera path, or chooses one of three
1
To appear in the ACM SIGGRAPH conference proceedings
camera path options: linear, parabolic, or a smoothed version of
the original; our algorithm then automatically fits a camera path to
the input. Finally, our technique performs a least-squares optimiza-
tion that computes a spatially-varying warp from each input video
frame into an output frame. The warp is computed to both follow the
sparse displacements suggested by the recovered 3D structure, and
minimize distortion of local shape and content in the video frames.
The result is not accurate, in the sense that it will not reveal the dis-
occlusions or non-Lambertian effects that an actual viewpoint shift
should yield; however, for the purposes of video stabilization, we
have found that these inaccuracies are difficult to notice in casual
viewing. As we show in our results, our method is able to con-
vincingly render a range of output camera paths that are reasonably
close to the input path, even for highly dynamic scenes.
2 Related Work
Two-dimensional video stabilization techniques have reached a
level of maturity that they are commonly implemented in on-camera
hardware and run in real time [Morimoto and Chellappa 1997]. This
approach can be sufficient if the user only wishes to damp unde-
sired camera shake, if the input camera motion consists mostly of
rotation with very little translation, or if the scene is planar or very
distant. However, in the common case of a camera moving through
a three-dimensional scene, there is typically a large gap between 2D
video stabilization and professional-quality camera paths.
The idea of transforming hand-held videos to appear as if they were
taken as a proper tracking shot was first realized by Gleicher and
Liu [2008]. Their approach segments videos and applies idealized
camera movements to each. However, this approach is based on
full-frame 2D warping, and therefore suffers (as all 2D approaches)
from two fundamental limitations: it cannot reason about the move-
ment of the physical camera in 3D, and it is limited in the amount
of viewpoint change for scenes with non-trivial depth complexity.
The 3D approach to video stabilization was first described by
Buehler et al. [2001a]. In 3D video stabilization, the 3D camera
motion is tracked using structure-from-motion [Hartley and Zisser-
man 2000], and a desired 3D camera path is fit to the hand-held
input path. With this setup, video stabilization can be reduced to
the classic image-based rendering (IBR) problem of novel view in-
terpolation: given a collection of input video frames, synthesize the
images which would have been seen from viewpoints along the de-
sired camera path. Though the novel viewpoint interpolation prob-
lem is challenging and ill-posed, recent sophisticated techniques
have demonstrated high-quality video stabilization results [Fitzgib-
bon et al. 2005; Bhat et al. 2007]. However, the limitation to static
scenes renders these approaches impractical, since most of us shoot
video of dynamic content, e.g., people.
Image warping and deformation techniques have a long his-
tory [Gomes et al. 1998]. Recent efforts have focused on defor-
mation controlled by a user who pulls on various handles [Igarashi
et al. 2005; Schaefer et al. 2006] while minimizing distortion of
local shape, as measured by the local deviation from conformal or
rigid transformations. These methods, which build on earlier work
in as-rigid-as-possible shape interpolation [Alexa et al. 2000], are
able to minimize perceivable distortion much more effectively than
traditional space-warp methods [Beier and Neely 1992] or standard
scattered data interpolation [Bookstein 1989]. Our method applies
this principle in computing spatially-varying warps induced by the
recovered 3D scene structure. A related image deformation problem
is to change the size or aspect ratio of an image without distorting
salient image structure. Seam Carving [Avidan and Shamir 2007]
exploited the fact that less perceptually salient regions in an image
can be deformed more freely than salient regions, and was later ex-
tended to video [Rubinstein et al. 2008]. However, the discrete algo-
rithm behind Seam Carving requires removing one pixel from each
image row or column, which limits its application to general image
warping. Others have explored more continuous formulations [Gal
et al. 2006; Wolf et al. 2007; Wang et al. 2008], which deform a
quad mesh placed on the image according to the salience (or user-
marked importance) found within each quad; we take this approach
in designing our deformation technique.
A limitation of our approach is that it requires successful computa-
tion of video structure-from-motion. However, this step has become
commonplace in the visual effects industry, and commercial 3D
camera trackers like Boujou2 and Syntheyes3 are widely used. We
use the free and publicly available Voodoo camera tracker4, which
has been used in a number of recent research systems [van den Hen-
gel et al. 2007; Thorma¨hlen and Seidel 2008]. Finally, there are a
number of orthogonal issues in video stabilization that we do not
address [Matsushita et al. 2006], such as removing motion blur, and
full-frame video stabilization that avoids the loss of information at
the video boundaries via hole-filling (we simply crop our output).
These techniques could be combined with our method to yield a
complete video stabilization solution.
3 Traditional video stabilization
We begin by describing the current technical approaches to video
stabilization in more detail, and showing their results on the exam-
ple sequence in Video Figure 1 (since many of the issues we discuss
can only be understood in an animated form, we will refer to a set
of video figures that are included as supplemental materials and on
the project web site1).
3.1 2D stabilization
Traditional 2D video stabilization proceeds in three steps. First, a
2D motion model, such as an affine or projective transformation,
is estimated between consecutive frames. Second, the parameters
of this motion model are low-pass filtered across time. Third, full-
frame warps computed between the original and filtered motion
models are applied to remove high-frequency camera shake. Video
Figures 2 and 3 show two results of this approach, created using our
implementation of Matsushita et al. [2006] (we do not perform the
inpainting or deblurring steps, and the two videos contain different
degrees of motion smoothing).
While 2D stabilization can significantly reduce camera shake, it
cannot simulate an idealized camera path similar to what can be
found in professional tracking shots. Since the 2D method has no
knowledge of the 3D trajectory of the input camera, it cannot rea-
son in 3D about what the output camera path should be, and what
the scene would have looked like from this path. Instead, it must
make do with fitting projective transformations (which are poor ap-
proximations for motion through a 3D scene) and low-pass filtering
them. Strong low-pass filtering (Video Figure 3) can lead to visible
distortions of the video content, while weak filtering (Video Figure
2) only damps shake; neither can simulate directed camera motions.
3.2 3D stabilization
The 3D approach to video stabilization is more powerful, though
also more computationally complex. Here, the actual 3D trajectory
of the original camera is first estimated using standard structure-
from-motion [Hartley and Zisserman 2000]; this step also results in
2http://www.2d3.com
3http://ssontech.com
4http://www.digilab.uni-hannover.de
2
To appear in the ACM SIGGRAPH conference proceedings
Figure 1: A crop of a video frame created using novel view inter-
polation. While the static portions of the scene appear normal, the
moving people suffer from ghosting.
Figure 2: A crop of a video frame created using generic sparse
data interpolation. The result does not contain ghosting, but distorts
structures such as the window and pole highlighted with red arrows.
a sparse 3D point cloud describing the 3D geometry of the scene.
Second, a desired camera path is fit to the original trajectory (we de-
scribe several approaches to computing such a path in Section 4.3).
Finally, an output video is created by rendering the scene as it would
have been seen from the new, desired camera trajectory.
There are a number of techniques for rendering novel views of a
scene; in Video Figure 4 we show a video stabilization result cre-
ated using the well-known unstructured lumigraph rendering algo-
rithm [Buehler et al. 2001b]. The result is remarkably stable. How-
ever, like all novel view interpolation algorithms, each output frame
is rendered as a blend of multiple input video frames. Therefore, dy-
namic scene content suffers from ghosting (we show a still frame
example of this ghosting in Figure 1).
One approach to handling dynamic scene content would be to iden-
tify the dynamic objects, matte them out, use novel view interpo-
lation to synthesize the background, re-composite, and fill any re-
maining holes. However, each of these steps is a challenging prob-
lem, and the probability that all would complete successfully is low.
Therefore, in the next section we introduce the constraint that each
output video frame be rendered only from the content in its corre-
sponding input video frame.
4 Our approach
Our approach begins similarly to the 3D stabilization technique just
described; we recover the original 3D camera motion and sparse
3D point cloud using structure-from-motion, and specify a desired
output camera motion in 3D (in this section we assume the output
path is given; our approach for computing one is described in Sec-
tion 4.3). Then, rather than synthesize novel views using multiple
input video frames, we use both the sparse 3D point cloud and the
content of the video frames as a guide in warping each input video
frame into its corresponding output video frame.
More specifically, we compute an output video sequence from the
input video such that each output video frame It is a warp of its
corresponding input frame Iˆt (since we treat each frame indepen-
dently, we will omit the t subscript from now on). As guidance we
have a sparse 3D point cloud which we can project into both the
input and output cameras, yielding two sets of corresponding 2D
points: P in the output, and Pˆ in the input. Each k’th pair of pro-
jected points yields a 2D displacement Pk − Pˆk that can guide the
warp from input to output. The problem remaining is to create a
dense warp guided by this sparse set of displacements. This warp,
which can use the displacements as either soft or hard constraints,
should maintain the illusion of a natural video by maintaining tem-
poral coherence and not distorting scene content. We first consider
two simple warping solutions, the first of which is not successful,
and the second of which is moderately successful.
The first solution is to use generic sparse data interpolation to yield
a dense warp from the sparse input. In Video Figure 5 we show a
result computed by simply triangulating the sparse points and us-
ing barycentric coordinates to interpolate the displacements inside
the triangles; the displacements are therefore treated as hard con-
straints. The result has a number of problems. Most significantly,
important scene structures are distorted (we show a still example in
Figure 2). These distortions typically occur near occlusions, which
are the most challenging areas for novel view interpolation. Also,
problems occur near the frame boundaries because extrapolation
outside the hull of the points is challenging (for this example, we
do not perform extrapolation). Finally, treating the displacements
as hard constraints leads to temporal incoherence since the recon-
structed 3D points are not typically visible for the entire video. Pop-
ping and jittering occur when the corresponding displacements ap-
pear and disappear over time. In this example, we use a very short
segment of video and only include points that last over the entire du-
ration of the video; however, the problem is unavoidable in longer
sequences. Our approach for preserving temporal coherence, which
is only applicable if displacements are used as soft constraints, is
described in Section 4.1.4.
The second alternative is to fit a full-frame warp to the sparse dis-
placements, such as a homography (thereby treating the displace-
ments as a soft constraint). We show a somewhat successful result
of this technique in Video Figure 6. This method can achieve good
results if the depth variation in the scene is not large, or if the de-
sired camera path is very close to the original. We show a less suc-
cessful result of this technique in Video Figure 7. In the general
case, a homography is too constrained a model to sufficiently fit
the desired displacements. This deficiency can result in undesired
distortion (we show an individual frame example in Figure 3), and
temporal wobbling. However, this novel approach is the best of the
alternatives we have considered up to now.
The first solution described above is too flexible; it exactly sat-
isfies the sparse displacements, but does n