Exploiting the Manhattan-world Assumption for
Extrinsic Self-calibration of Multi-modal Sensor Networks
Marcel Bru¨ckner∗ Joachim Denzler
Chair for Computer Vision, Friedrich Schiller University of Jena
Ernst-Abbe-Platz 2, 07743 Jena, Germany
{marcel.brueckner, joachim.denzler}@uni-jena.de
Abstract
Many new applications are enabled by combining a
multi-camera system with a Time-of-Flight (ToF) camera,
which is able to simultaneously record intensity and depth
images. Classical approaches for self-calibration of a
multi-camera system fail to calibrate such a system due to
the very different image modalities. In addition, the typical
environments of multi-camera systems are man-made and
consist primary of only low textured objects. However, at
the same time they satisfy the Manhattan-world assumption.
We formulate the multi-modal sensor network calibration
as a Maximum a Posteriori (MAP) problem and solve it by
minimizing the corresponding energy function. First we es-
timate two separate 3D reconstructions of the environment:
one using the pan-tilt unit mounted ToF camera and one
using the multi-camera system. We exploit the Manhattan-
world assumption and estimate multiple initial calibration
hypotheses by registering the three dominant orientations of
planes. These hypotheses are used as prior knowledge of a
subsequent MAP estimation aiming to align edges that are
parallel to these dominant directions. To our knowledge,
this is the first self-calibration approach that is able to cali-
brate a ToF camera with a multi-camera system. Quantita-
tive experiments on real data demonstrate the high accuracy
of our approach.
1. Introduction
Multi-camera systems can be found in many man-made
environments. Various computer vision applications like
object tracking or 3D reconstruction use such multi-camera
systems. Most of these systems consist of classical CCD
cameras recording 2D (color) images. A calibrated multi-
camera system can be used to extract 3D information from
the camera images. However, this (wide baseline) 3D re-
∗Marcel Bru¨ckner would like to thank the Carl Zeiss Foundation (Carl-
Zeiss-Stiftung) for supporting his research.
Figure 1. A multi-sensor system consisting of a ToF camera (solid
circle and bottom right) and several classical cameras (dashed cir-
cles). Each camera is mounted on a pan-tilt unit.
construction is computationally expensive and works only
for textured objects.
In recent years a new type of camera became increas-
ingly popular: the Time-of-Flight (ToF) camera [18] (Fig-
ure 1, bottom right). This type of camera uses modulated
infrared light to simultaneously record intensity (grayscale)
and depth images at a frequency of about 20Hz. In con-
trast to stereo cameras it is able to extract depth information
even from untextured objects. Another advantage is that the
depth is measured in metric units, which endows the correct
scale of the 3D reconstruction. Drawbacks of these cameras
are their low resolution (200 × 200 or lower) and that they
are not able to record color information. The combination
of a ToF camera with a classical multi-camera system over-
comes the drawbacks of both. Many computer vision ap-
plications benefit from such a multi-sensor system (Figure
1). To this end, an accurate calibration between the multi-
camera system and the ToF camera is necessary.
State of the art for calibration of such multi-sensor sys-
tems are approaches that use a calibration pattern [11, 15,
19] or some other artificial calibration object [4, 10]. From
a practical point of view, however, a pure self-calibration
is most appealing. Self-calibration in this context means
that no artificial landmarks or user interaction are necessary.
2011 IEEE International Conference on Computer Vision
978-1-4577-1102-2/11/$26.00 c©2011 IEEE
945
Figure 2. Intensity (left) and depth image (middle) recorded by a
ToF camera. Right: Same scene recorded by a CCD camera.
The cameras estimate their rotation and position only from
the images they record.
The appearance of an object in the ToF images depends
strongly on its material and color. Figure 2 shows an inten-
sity and depth image recorded by a ToF camera and an im-
age of the same scene recorded by a classical CCD camera.
Note that the stripes on the sweater on the left are not visi-
ble in the ToF intensity image. Another important property
is that the depth of dark objects like the black folder in the
middle is measured incorrectly. Due to these very different
image modalities, the extraction of point correspondences
with state of the art methods results in (almost) no correct
point correspondences. Hence, point correspondence based
approaches for multi-camera calibration [1, 2, 17] fail to
calibrate the multi-sensor system.
Another difficulty for point correspondence based ap-
proaches is the fact that most multi-sensor systems are
placed in man-made environments (e.g. building interiors),
that primary consist of low textured objects. However, these
environments also share the property that most of the sur-
faces are piecewise planar and aligned to three orthogonal
dominant directions. Environments and objects with this
property satisfy the so-called Manhattan-world assumption
[5]. In this paper we present a method that exploits this as-
sumption to create hypotheses for the calibration between a
ToF camera and a calibrated multi-camera system. We then
use these hypotheses to formulate the multi-modal sensor
network calibration as a MAP problem and solve it by min-
imizing the corresponding energy function. We assume that
the ToF camera is able to record its environment e.g. by
being mounted on a pan-tilt unit (Figure 1). To our knowl-
edge, this is the first self-calibration approach which is able
to calibrate a ToF camera with a multi-camera system.
The Manhattan-world assumption has also been ex-
ploited by Furukawa et al. [6, 7] for dense 3D reconstruction
of planar, non-textured surfaces like buildings or building
interiors. Note that even though we are adapting some of
their ideas we are aiming to solve a totally different prob-
lem. They use a calibrated multi-camera system to estimate
a dense 3D reconstruction of planar, non-textured surfaces.
In contrast we aim to estimate the transformation between
the coordinate systems of a ToF camera and a multi-camera
system.
The remainder of this paper is structured as follows. We
state the problem and give an overview of our method in
Section 2. Section 3 shows how to estimate the initial hy-
potheses. The MAP estimation of the final calibration is
described in Section 4. Section 5 presents our experiments
and results. Conclusions are given in Section 6.
2. Problem Statement and Method Overview
Starting point of our approach is a calibrated multi-
camera system [1, 2, 17] and a ToF camera. They both
have their own coordinate system. We assume that the ToF
camera is able to change its point of view to record images
from its environment. In our experiments we mount the ToF
camera on a pan-tilt unit, but it could also be mounted e.g.
on some mobile robot. The setup that we have in mind is
a multi-sensor system consisting of pan-tilt unit mounted
CCD and ToF cameras, similar to the one in Figure 1.
Normally, if one likes to add some camera to an al-
ready calibrated multi-camera system, the rotation R and
the translation t need to be estimated. However, as the ToF
camera is able to measure metric distances, we formulate
the calibration as a similarity transformation from the co-
ordinate system of the the multi-camera system. This al-
lows us to correct the typically unknown scale of the multi-
camera calibration [1, 17]. A similarity transformation
S def=
(
asRs ts
0 1
)
(1)
consists of a rotation Rs, a translation ts and a scale factor
as. Written in this matrix form it can be used to transform
homogeneous 3D coordinates from one coordinate system
to the other.
Our approach aims to estimate this similarity transforma-
tion. To this end we exploit the Manhattan-world assump-
tion [5]. It assumes that most surfaces in the surrounding
world of the camera are piecewise planar and aligned to
three orthogonal dominant directions. This assumption is
satisfied for the typical environment of multi-camera sys-
tems like building interiors or urban scenes.
Figure 3 gives a schematic overview of our proposed
method. It consists of two steps. First we estimate two
3D reconstructions of the complete environment: one us-
ing the multi-camera system and another one using the ToF
camera (Section 3.1). For each of these 3D reconstructions
we estimate the three orthogonal dominant directions by
computing a histogram over the surface normal directions
(Section 3.2). Given these dominant directions we can cal-
culate hypotheses for the similarity transformation between
the two coordinate systems.
In the second step we use these hypotheses as prior for
a MAP estimation (Section 4) that aims to align 3D points
of the multi-camera system, that are on edges parallel to the
dominant directions, with edges in the ToF intensity images,
that are parallel to the same dominant direction.
946
Figure 3. Overview of the proposed method. Each step is explained in the indicated Section.
Note that we use a superscript f for all ToF camera re-
lated terms and a superscript c for terms related to the multi-
camera system throughout the paper.
3. Estimating the Initial Candidates
3.1. 3D Reconstruction of the Environment
The input of our approach are n intensity images
If def= {If1 , . . . , Ifi , . . . , Ifn} and n depth images Df def=
{Df1 , . . . , Dfi , . . . , Dfn} from the environment recorded by
the ToF camera. For each of these images the pinhole ma-
trixKfi and the relative pose consisting of a rotationR
f
i and
translation tfi are assumed to be known. With these data it
is possible to estimate a 3D point cloud. The position of a
3D point is calculated by
Xf def=
Dfi
(
xf
)∥∥∥Kfi −1xf∥∥∥
2
Rfi
T
Kfi
−1
xf −Rfi
T
tfi , (2)
whereDfi (x
f ) is the measured depth in the i-th depth image
at image coordinate xf . This 3D point estimation is done
for each image coordinate xf in each depth image Dfi ∈
Df . We estimate the surface normal nf of each 3D point
Xf by local plane fitting.
From the multi-camera system we get m images Ic =
{Ic1 , . . . , Icm} with known relative poses and pinhole matri-
ces. Given these data we estimate a 3D reconstruction using
the multi-view stereo approach of Furukawa and Ponce [9].
This results in a set of 3D points Pc. The approach also
estimates the surface normal nc of each 3D point Xc ∈ Pc.
The modalities of the two 3D reconstructions are very
different. The images in Figure 3 give an impression of
these. While the ToF reconstruction is very dense but quite
noisy, the reconstruction using the multi-camera system is
very accurate but since 3D points can only be estimated near
textured objects it is also quite sparse.
3.2. Estimating the Dominant Directions
The surfaces in a Manhattan-world are aligned to three
orthogonal dominant directions. For each of our 3D recon-
structions we want to estimate these directions. Similar to
[6], we first build two histograms hc and hf of surface nor-
mal directions n (with ‖n‖2 = 1) over a unit hemisphere.
Since we are only interested in the directions of the sur-
face normals, we do not need a histogram over a complete
sphere. Hence the normal directions −n and n are mapped
to the same histogram bin.
Note, that the reconstructions of many realistic scenes
consist only of two orthogonal dominant directions (see ex-
ample histograms in Figure 3). This is why we search the
histograms for the two unit normals nˆq, nˆr that
argmax
nq,nr
(h (nq) + h (nr)) with nTq nr = 0 . (3)
These correspond to the three orthogonal dominant direc-
tions d1
def= nˆq , d2
def= nˆr and d3
def= nˆq × nˆr. At the end
of this procedure, we have the three dominant directions of
the ToF camera reconstruction Vf def= {df1 ,df2 ,df3} and of
the multi-camera reconstruction Vc def= {dc1,dc2,dc3}.
The estimated directions are basically the same but es-
timated in different coordinate systems. There are exactly
24 rotations {R˜s1 . . . , R˜sk, . . . , R˜s24} that align the three or-
thogonal directions of one set with the directions of the
other set (also considering negative directions).
947
We now form a set of 24 initial similarity transformations
{S˜1, . . . , S˜k, . . . , S˜24} using the rotations R˜sk and calculat-
ing the translation and scale by minimizing the distance be-
tween the depth measured by the ToF camera and the depth
of the transformed 3D point set Pc
argmin
ts,as
∑
Xc∈Pc
min
i
∣∣∣Dfi (piSi (Xc))− ∥∥cSi (Xc)∥∥2∣∣∣ . (4)
The function
cSi (X
c) def= Rfi (a
sRsXc + ts) + tfi (5)
transforms a 3D pointXc from the coordinate system of the
multi-camera system to the coordinate system of the i-th
image of the ToF camera using the similarity transformation
S. The function
piSi (X
c) def= Kfi c
S
i (X
c) (6)
projects a 3D point Xc from the multi-camera coordinate
system into the i-th image of the ToF camera using the
similarity transformation S. We use the downhill simplex
method [13] for the optimization of (4).
One might assume that an energy minimization similar
to (4) (optimizing also the rotation) should suffice to esti-
mate the correct similarity transformation. Unfortunately
this is not the case. This is due to the different modalities
of the two 3D data sets. The 3D reconstruction obtained
by the multi-camera system consists only of few 3D points
on planar surfaces, most points are along object edges. Ex-
actly at these object edges the depth measurement of ToF
cameras is very inaccurate. The accuracy also suffers from
a systematic depth measurement error of the ToF camera
[11, 15]. Furthermore the cost function of (4) is not reliable
for deciding which of the similarity transformations S˜k is
the correct one.
4. Estimating the Similarity Transformation
We avoid using the depth images of the ToF camera in
our final step. Instead we use the 24 similarity transforma-
tion hypotheses as prior for a Maximum a Posteriori (MAP)
estimation. This MAP estimation aims to align 3D points of
the multi-camera system, which are on edges parallel to the
dominant directions, to edges in the ToF intensity images,
which are parallel to the same dominant direction.
4.1. Edges Parallel to Dominant Directions
Similar to [6], we use the estimated dominant directions
Vf and Vc to find edges in the camera images If and Ic
that are parallel to one of these directions. We search the
images for image coordinates lying on image edges [3] that
are parallel to one of the dominant directions d ∈ V . An
image edge at image point x is parallel to a dominant di-
rection d if it passes through x and the vanishing point
Figure 4. Intensity image of the ToF camera and the resulting edge
image. Each color encodes a different dominant direction.
vd of the direction d. All the image points x
f
i,d from the
ToF intensity images that fulfill these constraints for the
direction d are stored in the set J fd . For each of these
image points the index of its image i is known. The set
J fV def= {J fd1 ,J
f
d2
,J fd3} contains the image point sets of
the three directions. Figure 4 shows a ToF intensity image
and the extracted edges. Each edge direction is colored dif-
ferently.
The same procedure is repeated for the images of the
multi-camera system Ic. But instead of storing the im-
age points that lie on an edge parallel to one of the dom-
inant directions d, we store the 3D points, that are pro-
jected to one of these edges, in the point set Pcd. The set
PcV def= {Pcd1 ,Pcd2 ,Pcd3} contains the 3D point sets of the
three directions.
4.2. MAP Estimation
Given J fV and PcV we want to estimate the similarity
transformation S. In order to increase the robustness against
noise and outliers we formulate a Maximum a Posteriori
(MAP) problem
argmax
S
p
(
S | J fV ,PcV
)
∼ p
(
J fV ,PcV | S
)
p (S) , (7)
where p(J fV ,PcV | S) is the likelihood and p (S) is the prior.
We define the likelihood as
p
(
J fV ,PcV | S
)
def∼ (8)∏
d∈V
∏
Xcd∈Pcd
max
xfi,d∈J fd
e−λdf(x
f
i,d,X
c
d)
where λ is the parameter of the exponential distribution and
df
(
xfi,d,X
c
d
)
def= min
(
dc,
∥∥∥xfi,d − piSi (Xcd)∥∥∥
2
)
(9)
is the minimum between the length of the image diagonal
dc in pixels and the Euclidean distance between the projec-
tion of Xcd and the image point x
f
i,d, both lying on an edge
parallel to the same dominant direction d. This distance
function increases the robustness against outliers. The like-
lihood (8) is high if each 3D point is projected on an image
948
Figure 5. Example images of the two rooms used for the exper-
iments: a low textured office (left) and a laboratory with several
occlusions and ambiguities (right).
edge with the same dominant direction. Note that the as-
signment of the dominant directions is given by the rotation
of the similarity transformation.
The prior should be high if the distance of S to one of
the initial 24 similarity transformations S˜k is small. The
distance between two similarity transformations is actually
defined by three distances: the rotation distance dR(S, S˜k)
in degree, the translation distance dt(S, S˜k) in meters and
the relative scale difference da(S, S˜k). We model the prior
p (S) def= min
k
e−λRdR(S,S˜k)e−λtdt(S,S˜k)e−λada(S,S˜k)
(10)
as the product of three exponential distributions, where λR,
λt and λa are the parameters of these distributions.
In order to find the optimum of this MAP estimation,
we initialize an optimization at each of the initial similarity
transformations. The optimization is done using the down-
hill simplex method [13]. The similarity transformation
with the highest probability is the final MAP estimate.
5. Experiments and Results
5.1. Experimental Setup
In our experiments we use a PMDTechnologies
PMD[vision] 19K ToF camera (resolution 160 × 120) that
is mounted on a Directed Perception PTU-46-17.5 pan-tilt
unit (Figure 1, bottom right). The posesRfi , t
f
i of these ToF
images are provided by the pan-tilt unit.
The CCD images are recorded with two statically
mounted AVT Pike cameras (resolution 1388× 1038) and a
handheld Canon 500D camera (resolution 2352×1568). For
the calibration of the CCD images we use the Bundler soft-
ware of Snavely [16]. 3D reconstruction is done using the
PMVS software of Furukawa and Ponce [8]. For the ground
truth calibration between the two static cameras and the ToF
camera we use the calibration pattern based MultiCamera-
Calibration software of Schiller [14]. This software is also
used to estimate the intrinsic parameters of the ToF cam-
era. We use the point correspondences extracted from the
calibration pattern to calculate the reprojection error of our
calibration. The calibration errors presented in Section 5.2
are between the ToF camera and the two static cameras.
Our method is tested on a total of 12 calibrations using
different camera setups and two different rooms. Figure 5
shows some example images of the two rooms. The rooms
are typical office/laboratory environments. They differ in
size, inventory and amount of texture. Depending on the
camera setup 90 − 160 images are recorded, about half of
these are ToF images.
Note, that in none of our experiments the ToF and multi-
camera 3D reconstructions cover the complete or exactly
the same part of the room. However, the 3D reconstructions
need to provide enough information to estimate the domi-
nant directions. This is why in our experiments at least two
walls of the room are included in the 3D reconstructions.
In our experiments we use 1005 bins for the normal his-
tograms and the parameter of the exponential distribution
(8) is set to λ = 1.0. The three parameters of the prior
(10) are set to λR = 0.01, λt = 0.025 and λa = 10.0.
The choice of these parameters is not