计算机视觉应用下载_在线阅读_6

is_456680

暂无简介

计算机视觉应用 Exploiting the Manhattan-world Assumption for Extrinsic Self-calibration of Multi-modal Sensor Networks Marcel Bru¨ckner∗ Joachim Denzler Chair for Computer Vision, Friedrich Schiller University of Jena Ernst-Abbe-Platz 2, 07743 Jena, Germany {marcel.brueckner, j...

Exploiting the Manhattan-world Assumption for Extrinsic Self-calibration of Multi-modal Sensor Networks Marcel Bru¨ckner∗ Joachim Denzler Chair for Computer Vision, Friedrich Schiller University of Jena Ernst-Abbe-Platz 2, 07743 Jena, Germany {marcel.brueckner, joachim.denzler}@uni-jena.de Abstract Many new applications are enabled by combining a multi-camera system with a Time-of-Flight (ToF) camera, which is able to simultaneously record intensity and depth images. Classical approaches for self-calibration of a multi-camera system fail to calibrate such a system due to the very different image modalities. In addition, the typical environments of multi-camera systems are man-made and consist primary of only low textured objects. However, at the same time they satisfy the Manhattan-world assumption. We formulate the multi-modal sensor network calibration as a Maximum a Posteriori (MAP) problem and solve it by minimizing the corresponding energy function. First we es- timate two separate 3D reconstructions of the environment: one using the pan-tilt unit mounted ToF camera and one using the multi-camera system. We exploit the Manhattan- world assumption and estimate multiple initial calibration hypotheses by registering the three dominant orientations of planes. These hypotheses are used as prior knowledge of a subsequent MAP estimation aiming to align edges that are parallel to these dominant directions. To our knowledge, this is the first self-calibration approach that is able to cali- brate a ToF camera with a multi-camera system. Quantita- tive experiments on real data demonstrate the high accuracy of our approach. 1. Introduction Multi-camera systems can be found in many man-made environments. Various computer vision applications like object tracking or 3D reconstruction use such multi-camera systems. Most of these systems consist of classical CCD cameras recording 2D (color) images. A calibrated multi- camera system can be used to extract 3D information from the camera images. However, this (wide baseline) 3D re- ∗Marcel Bru¨ckner would like to thank the Carl Zeiss Foundation (Carl- Zeiss-Stiftung) for supporting his research. Figure 1. A multi-sensor system consisting of a ToF camera (solid circle and bottom right) and several classical cameras (dashed cir- cles). Each camera is mounted on a pan-tilt unit. construction is computationally expensive and works only for textured objects. In recent years a new type of camera became increas- ingly popular: the Time-of-Flight (ToF) camera [18] (Fig- ure 1, bottom right). This type of camera uses modulated infrared light to simultaneously record intensity (grayscale) and depth images at a frequency of about 20Hz. In con- trast to stereo cameras it is able to extract depth information even from untextured objects. Another advantage is that the depth is measured in metric units, which endows the correct scale of the 3D reconstruction. Drawbacks of these cameras are their low resolution (200 × 200 or lower) and that they are not able to record color information. The combination of a ToF camera with a classical multi-camera system over- comes the drawbacks of both. Many computer vision ap- plications benefit from such a multi-sensor system (Figure 1). To this end, an accurate calibration between the multi- camera system and the ToF camera is necessary. State of the art for calibration of such multi-sensor sys- tems are approaches that use a calibration pattern [11, 15, 19] or some other artificial calibration object [4, 10]. From a practical point of view, however, a pure self-calibration is most appealing. Self-calibration in this context means that no artificial landmarks or user interaction are necessary. 2011 IEEE International Conference on Computer Vision 978-1-4577-1102-2/11/$26.00 c©2011 IEEE 945 Figure 2. Intensity (left) and depth image (middle) recorded by a ToF camera. Right: Same scene recorded by a CCD camera. The cameras estimate their rotation and position only from the images they record. The appearance of an object in the ToF images depends strongly on its material and color. Figure 2 shows an inten- sity and depth image recorded by a ToF camera and an im- age of the same scene recorded by a classical CCD camera. Note that the stripes on the sweater on the left are not visi- ble in the ToF intensity image. Another important property is that the depth of dark objects like the black folder in the middle is measured incorrectly. Due to these very different image modalities, the extraction of point correspondences with state of the art methods results in (almost) no correct point correspondences. Hence, point correspondence based approaches for multi-camera calibration [1, 2, 17] fail to calibrate the multi-sensor system. Another difficulty for point correspondence based ap- proaches is the fact that most multi-sensor systems are placed in man-made environments (e.g. building interiors), that primary consist of low textured objects. However, these environments also share the property that most of the sur- faces are piecewise planar and aligned to three orthogonal dominant directions. Environments and objects with this property satisfy the so-called Manhattan-world assumption [5]. In this paper we present a method that exploits this as- sumption to create hypotheses for the calibration between a ToF camera and a calibrated multi-camera system. We then use these hypotheses to formulate the multi-modal sensor network calibration as a MAP problem and solve it by min- imizing the corresponding energy function. We assume that the ToF camera is able to record its environment e.g. by being mounted on a pan-tilt unit (Figure 1). To our knowl- edge, this is the first self-calibration approach which is able to calibrate a ToF camera with a multi-camera system. The Manhattan-world assumption has also been ex- ploited by Furukawa et al. [6, 7] for dense 3D reconstruction of planar, non-textured surfaces like buildings or building interiors. Note that even though we are adapting some of their ideas we are aiming to solve a totally different prob- lem. They use a calibrated multi-camera system to estimate a dense 3D reconstruction of planar, non-textured surfaces. In contrast we aim to estimate the transformation between the coordinate systems of a ToF camera and a multi-camera system. The remainder of this paper is structured as follows. We state the problem and give an overview of our method in Section 2. Section 3 shows how to estimate the initial hy- potheses. The MAP estimation of the final calibration is described in Section 4. Section 5 presents our experiments and results. Conclusions are given in Section 6. 2. Problem Statement and Method Overview Starting point of our approach is a calibrated multi- camera system [1, 2, 17] and a ToF camera. They both have their own coordinate system. We assume that the ToF camera is able to change its point of view to record images from its environment. In our experiments we mount the ToF camera on a pan-tilt unit, but it could also be mounted e.g. on some mobile robot. The setup that we have in mind is a multi-sensor system consisting of pan-tilt unit mounted CCD and ToF cameras, similar to the one in Figure 1. Normally, if one likes to add some camera to an al- ready calibrated multi-camera system, the rotation R and the translation t need to be estimated. However, as the ToF camera is able to measure metric distances, we formulate the calibration as a similarity transformation from the co- ordinate system of the the multi-camera system. This al- lows us to correct the typically unknown scale of the multi- camera calibration [1, 17]. A similarity transformation S def= ( asRs ts 0 1 ) (1) consists of a rotation Rs, a translation ts and a scale factor as. Written in this matrix form it can be used to transform homogeneous 3D coordinates from one coordinate system to the other. Our approach aims to estimate this similarity transforma- tion. To this end we exploit the Manhattan-world assump- tion [5]. It assumes that most surfaces in the surrounding world of the camera are piecewise planar and aligned to three orthogonal dominant directions. This assumption is satisfied for the typical environment of multi-camera sys- tems like building interiors or urban scenes. Figure 3 gives a schematic overview of our proposed method. It consists of two steps. First we estimate two 3D reconstructions of the complete environment: one us- ing the multi-camera system and another one using the ToF camera (Section 3.1). For each of these 3D reconstructions we estimate the three orthogonal dominant directions by computing a histogram over the surface normal directions (Section 3.2). Given these dominant directions we can cal- culate hypotheses for the similarity transformation between the two coordinate systems. In the second step we use these hypotheses as prior for a MAP estimation (Section 4) that aims to align 3D points of the multi-camera system, that are on edges parallel to the dominant directions, with edges in the ToF intensity images, that are parallel to the same dominant direction. 946 Figure 3. Overview of the proposed method. Each step is explained in the indicated Section. Note that we use a superscript f for all ToF camera re- lated terms and a superscript c for terms related to the multi- camera system throughout the paper. 3. Estimating the Initial Candidates 3.1. 3D Reconstruction of the Environment The input of our approach are n intensity images If def= {If1 , . . . , Ifi , . . . , Ifn} and n depth images Df def= {Df1 , . . . , Dfi , . . . , Dfn} from the environment recorded by the ToF camera. For each of these images the pinhole ma- trixKfi and the relative pose consisting of a rotationR f i and translation tfi are assumed to be known. With these data it is possible to estimate a 3D point cloud. The position of a 3D point is calculated by Xf def= Dfi ( xf )∥∥∥Kfi −1xf∥∥∥ 2 Rfi T Kfi −1 xf −Rfi T tfi , (2) whereDfi (x f ) is the measured depth in the i-th depth image at image coordinate xf . This 3D point estimation is done for each image coordinate xf in each depth image Dfi ∈ Df . We estimate the surface normal nf of each 3D point Xf by local plane fitting. From the multi-camera system we get m images Ic = {Ic1 , . . . , Icm} with known relative poses and pinhole matri- ces. Given these data we estimate a 3D reconstruction using the multi-view stereo approach of Furukawa and Ponce [9]. This results in a set of 3D points Pc. The approach also estimates the surface normal nc of each 3D point Xc ∈ Pc. The modalities of the two 3D reconstructions are very different. The images in Figure 3 give an impression of these. While the ToF reconstruction is very dense but quite noisy, the reconstruction using the multi-camera system is very accurate but since 3D points can only be estimated near textured objects it is also quite sparse. 3.2. Estimating the Dominant Directions The surfaces in a Manhattan-world are aligned to three orthogonal dominant directions. For each of our 3D recon- structions we want to estimate these directions. Similar to [6], we first build two histograms hc and hf of surface nor- mal directions n (with ‖n‖2 = 1) over a unit hemisphere. Since we are only interested in the directions of the sur- face normals, we do not need a histogram over a complete sphere. Hence the normal directions −n and n are mapped to the same histogram bin. Note, that the reconstructions of many realistic scenes consist only of two orthogonal dominant directions (see ex- ample histograms in Figure 3). This is why we search the histograms for the two unit normals nˆq, nˆr that argmax nq,nr (h (nq) + h (nr)) with nTq nr = 0 . (3) These correspond to the three orthogonal dominant direc- tions d1 def= nˆq , d2 def= nˆr and d3 def= nˆq × nˆr. At the end of this procedure, we have the three dominant directions of the ToF camera reconstruction Vf def= {df1 ,df2 ,df3} and of the multi-camera reconstruction Vc def= {dc1,dc2,dc3}. The estimated directions are basically the same but es- timated in different coordinate systems. There are exactly 24 rotations {R˜s1 . . . , R˜sk, . . . , R˜s24} that align the three or- thogonal directions of one set with the directions of the other set (also considering negative directions). 947 We now form a set of 24 initial similarity transformations {S˜1, . . . , S˜k, . . . , S˜24} using the rotations R˜sk and calculat- ing the translation and scale by minimizing the distance be- tween the depth measured by the ToF camera and the depth of the transformed 3D point set Pc argmin ts,as ∑ Xc∈Pc min i ∣∣∣Dfi (piSi (Xc))− ∥∥cSi (Xc)∥∥2∣∣∣ . (4) The function cSi (X c) def= Rfi (a sRsXc + ts) + tfi (5) transforms a 3D pointXc from the coordinate system of the multi-camera system to the coordinate system of the i-th image of the ToF camera using the similarity transformation S. The function piSi (X c) def= Kfi c S i (X c) (6) projects a 3D point Xc from the multi-camera coordinate system into the i-th image of the ToF camera using the similarity transformation S. We use the downhill simplex method [13] for the optimization of (4). One might assume that an energy minimization similar to (4) (optimizing also the rotation) should suffice to esti- mate the correct similarity transformation. Unfortunately this is not the case. This is due to the different modalities of the two 3D data sets. The 3D reconstruction obtained by the multi-camera system consists only of few 3D points on planar surfaces, most points are along object edges. Ex- actly at these object edges the depth measurement of ToF cameras is very inaccurate. The accuracy also suffers from a systematic depth measurement error of the ToF camera [11, 15]. Furthermore the cost function of (4) is not reliable for deciding which of the similarity transformations S˜k is the correct one. 4. Estimating the Similarity Transformation We avoid using the depth images of the ToF camera in our final step. Instead we use the 24 similarity transforma- tion hypotheses as prior for a Maximum a Posteriori (MAP) estimation. This MAP estimation aims to align 3D points of the multi-camera system, which are on edges parallel to the dominant directions, to edges in the ToF intensity images, which are parallel to the same dominant direction. 4.1. Edges Parallel to Dominant Directions Similar to [6], we use the estimated dominant directions Vf and Vc to find edges in the camera images If and Ic that are parallel to one of these directions. We search the images for image coordinates lying on image edges [3] that are parallel to one of the dominant directions d ∈ V . An image edge at image point x is parallel to a dominant di- rection d if it passes through x and the vanishing point Figure 4. Intensity image of the ToF camera and the resulting edge image. Each color encodes a different dominant direction. vd of the direction d. All the image points x f i,d from the ToF intensity images that fulfill these constraints for the direction d are stored in the set J fd . For each of these image points the index of its image i is known. The set J fV def= {J fd1 ,J f d2 ,J fd3} contains the image point sets of the three directions. Figure 4 shows a ToF intensity image and the extracted edges. Each edge direction is colored dif- ferently. The same procedure is repeated for the images of the multi-camera system Ic. But instead of storing the im- age points that lie on an edge parallel to one of the dom- inant directions d, we store the 3D points, that are pro- jected to one of these edges, in the point set Pcd. The set PcV def= {Pcd1 ,Pcd2 ,Pcd3} contains the 3D point sets of the three directions. 4.2. MAP Estimation Given J fV and PcV we want to estimate the similarity transformation S. In order to increase the robustness against noise and outliers we formulate a Maximum a Posteriori (MAP) problem argmax S p ( S | J fV ,PcV ) ∼ p ( J fV ,PcV | S ) p (S) , (7) where p(J fV ,PcV | S) is the likelihood and p (S) is the prior. We define the likelihood as p ( J fV ,PcV | S ) def∼ (8)∏ d∈V ∏ Xcd∈Pcd max xfi,d∈J fd e−λdf(x f i,d,X c d) where λ is the parameter of the exponential distribution and df ( xfi,d,X c d ) def= min ( dc, ∥∥∥xfi,d − piSi (Xcd)∥∥∥ 2 ) (9) is the minimum between the length of the image diagonal dc in pixels and the Euclidean distance between the projec- tion of Xcd and the image point x f i,d, both lying on an edge parallel to the same dominant direction d. This distance function increases the robustness against outliers. The like- lihood (8) is high if each 3D point is projected on an image 948 Figure 5. Example images of the two rooms used for the exper- iments: a low textured office (left) and a laboratory with several occlusions and ambiguities (right). edge with the same dominant direction. Note that the as- signment of the dominant directions is given by the rotation of the similarity transformation. The prior should be high if the distance of S to one of the initial 24 similarity transformations S˜k is small. The distance between two similarity transformations is actually defined by three distances: the rotation distance dR(S, S˜k) in degree, the translation distance dt(S, S˜k) in meters and the relative scale difference da(S, S˜k). We model the prior p (S) def= min k e−λRdR(S,S˜k)e−λtdt(S,S˜k)e−λada(S,S˜k) (10) as the product of three exponential distributions, where λR, λt and λa are the parameters of these distributions. In order to find the optimum of this MAP estimation, we initialize an optimization at each of the initial similarity transformations. The optimization is done using the down- hill simplex method [13]. The similarity transformation with the highest probability is the final MAP estimate. 5. Experiments and Results 5.1. Experimental Setup In our experiments we use a PMDTechnologies PMD[vision] 19K ToF camera (resolution 160 × 120) that is mounted on a Directed Perception PTU-46-17.5 pan-tilt unit (Figure 1, bottom right). The posesRfi , t f i of these ToF images are provided by the pan-tilt unit. The CCD images are recorded with two statically mounted AVT Pike cameras (resolution 1388× 1038) and a handheld Canon 500D camera (resolution 2352×1568). For the calibration of the CCD images we use the Bundler soft- ware of Snavely [16]. 3D reconstruction is done using the PMVS software of Furukawa and Ponce [8]. For the ground truth calibration between the two static cameras and the ToF camera we use the calibration pattern based MultiCamera- Calibration software of Schiller [14]. This software is also used to estimate the intrinsic parameters of the ToF cam- era. We use the point correspondences extracted from the calibration pattern to calculate the reprojection error of our calibration. The calibration errors presented in Section 5.2 are between the ToF camera and the two static cameras. Our method is tested on a total of 12 calibrations using different camera setups and two different rooms. Figure 5 shows some example images of the two rooms. The rooms are typical office/laboratory environments. They differ in size, inventory and amount of texture. Depending on the camera setup 90 − 160 images are recorded, about half of these are ToF images. Note, that in none of our experiments the ToF and multi- camera 3D reconstructions cover the complete or exactly the same part of the room. However, the 3D reconstructions need to provide enough information to estimate the domi- nant directions. This is why in our experiments at least two walls of the room are included in the 3D reconstructions. In our experiments we use 1005 bins for the normal his- tograms and the parameter of the exponential distribution (8) is set to λ = 1.0. The three parameters of the prior (10) are set to λR = 0.01, λt = 0.025 and λa = 10.0. The choice of these parameters is not

本文档为【计算机视觉应用】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

计算机视觉应用

热门搜索

历史搜索