v realnem ˇ casu z uporabo standardnih kamer

(1)

Univerza v Ljubljani

Fakulteta za raˇcunalniˇstvo in informatiko

Peter Peer

Gradnja globinskih panoramskih slik

v realnem ˇ casu z uporabo standardnih kamer

Doktorska disertacija

Mentor: prof. dr. Franc Solina

Ljubljana, 2003

(2)

(3)

University of Ljubljana

Faculty of Computer and Information Science

Peter Peer

Real Time Panoramic Depth Imaging Using Standard Cameras

Doctoral Dissertation

Supervisor: Prof. Dr. Franc Solina

Ljubljana, 2003

(4)

(5)

v

Real Time Panoramic Depth Imaging Using Standard Cameras

Peter Peer

Supervisor: prof. dr. Franc Solina

Abstract

A computer vision is a special kind of scientific challenge as we are all users of our own vision systems. Our vision is definitely a source of the major part of information we acquire and process each second. A stereo vision is perhaps even greater challenge, since our own vision system is a stereo one and it performs a complex task,which supplies us with 3D information on our surroundings in a very effective way.

Making machines see is a diﬃcult problem. On one side we have psychological aspects of human visual perception,which try to explain how the visual information is processed in the human brain. On the other side we have technical solutions,which try to imitate human vision. Normally,it all starts with capturing digital images that store the basic information about the scene in a similar way that humans see.

But this information represents only the beginning of a diﬃcult process. By itself it does not reveal the information about the objects on the scene,their color,distances etc. to the machine. For humans,visual recognition is an easy task,but the human brain processing methods are still a mistery to us.

One part of the human visual perception is estimating the distances to the objects on the scene. This information is also needed by robots if we want them to be completely autonomous.

In this dissertation we present a stereo panoramic depth imaging system.

The basic system is mosaic-based,which means that we use a single standard rotating camera and assemble the captured images in a multiperspective panoramic image. Due to a setoff of the camera’s optical center from the rotational center of the system we are able to capture the motion parallax effect,which enables the stereo reconstruction. The camera is rotating on a circular path with the step defined by an angle equivalent to one-pixel column of the captured image. To find the corresponding points on a stereo pair of panoramic images the epipolar geometry needs to be determined. It can be shown that the epipolar geometry is very simple if we perform the reconstruction based on a symmetric pair of stereo panoramic images. We get a symmetric pair of stereo panoramic images when we take symmetric columns on the left and on the right side from the captured image center column.

(6)

This system however cannot generate panoramic stereo pair in real time. That is why we have suggested a real time extension of the system,based on simultaneously using many standard cameras. We have not physically built the real time sensor, but we have performed simulations to establish the quality of results.

Both systems have been comprehensively analysed and compared. The analyses revealed a number of interesting properties of the systems. According to the basic system accuracy we deﬁnitely can use the system for autonomous robot localization and navigation tasks. The assumptions made in the real time extension of the basic system have been proved to be correct,but the accuracy of the new sensor generally deteriorates in comparison to the basic sensor.

Generally speaking,the dissertation can serve as a guide for panoramic depth imaging sensor design and related issues.

Key words

computer vision,stereo vision,reconstruction,depth image,multiperspective panoramic image,mosaicing,motion parallax eﬀect,standard camera,real time,depth sensor

(7)

of panoramas. . . 79 3.2 The results obtained with four different widths of the stripes (Ws). 80 3.3 Reconstruction from stripe panoramas — Different room. . . 82 3.4 Reconstruction from stripe panoramas — Different cameras. . . 84 A.1 Ilustracija napake (∆l) za en slikovni element pri oceni kotaθ. . . . 99 A.2 Rezultati eksperimentov (#1). . . 101 A.3 Rezultati eksperimentov (#2). . . 105

xii

(13)

To my family

(14)

(15)

Chapter 1 Introduction

(16)

1.1 Description of the narrow scientiﬁc area

In the last 30 years one of the most interesting areas of research is building machines that would complement human life with the help of artiﬁcial intelligence. This area is full of diﬀerent challenges and one among them is to imitate human vision.

Analogically this discipline is called computer vision. The basic idea is to discover properties of 3D world by using only 2D information from an image. A lot of eﬀort was put into this area of research,which eventually led to progress in areas such as object recognition,picture understanding and 3D reconstruction.

3D reconstruction is an important area,since it enables tasks like modeling,visualization,CAD (Computer Aided Design) model construction,localization,navigation etc. Generally speaking 3D shape of objects and environments can be captured in three diﬀerent ways: using CMM (Coordinate Measuring Machine) machine,Time of Flight method or with the help of optical devices. The latter approach is most widely used,partly because of favorable price and safety conditions. On the other hand such a system is the only real computer vision system for 3D reconstruction, since it is based merely on input images. With the help of optical scanners (range ﬁnders) we gather 3D data about the object surface by processing 2D images captured with standard cameras. Optical scanners can be divided into two main groups.

Active scanners project light on the object,which assures eﬀective and reliable 3D information. In many cases active scanners use structured light for reconstruction purpose,i.e. a light pattern is projected onto the object. A disadvantage of such a scanner is that the images should be taken under strict laboratory conditions,like scanning in complete darkness. On the other side we have passive scanners that estimate the distance to the object only on present textural information on images, which are captured completely without contact (physical contact,contact of a laser beam or structured light). Traditional approaches for acquiring depth from such images are based on stereo methods. Under the term stereo reconstruction we understand the generation of depth images from two or more captured images. In this case the reconstruction suﬀers if the reconstructed object is not well textured. The result of the reconstruction is a depth image. Each depth image stores the estimates of distances to the object from one view point.

Currently,we are living in an era of vision research when some shape-from-X problems,for example stereo,have been almost completely solved,and furthermore are being used in industry. Other shape-from-X problems in their original formu- lations,like shape from motion,have proved to be very diﬃcult,therefore some special cases are being tackled. The remaining shape-from-X problems,like shape from shading and shape from texture,have become less interesting,and,what is even more important,less applicable [50].

We wish that input images would have the property that the same points and lines are visible in all images of the scene,which facilitate stereo reconstruction.

This is the property of panoramic cameras. Standard cameras have a limited ﬁeld of

(17)

1.2 Description of the problem 3

view,which is usually smaller than the human ﬁeld of view. Because of that people have always tried to generate images with a wider ﬁeld of view,up to full 360 degree panoramas.

As presented in the next chapter,one way to build panoramic images is by taking one column out of a captured image and mosaicing the columns. Such panoramic images are called multiperspective panoramic images. The crucial property of two or more multiperspective panoramic images is that they capture the information about the motion parallax eﬀect,since the columns forming the panoramic images are captured from diﬀerent perspectives.

The main problem we would like to solve in this dissertation is to analyze and determine the properties and the eﬃciency of the panoramic depth imaging system based on multiperspective panoramic images and see if these results can be used for robot localization and navigation. Only standard equipment is applied in the system construction process. A real time extension of the basic system is simulated to determine the eﬃciency of the new system in comparison to the basic one.

When we talk about the real time in the dissertation,we do not mean it so much from the processing power point of view (to increase the speed or reduce the time),but more from the accuracy point of view. Namely,nowadays there are many practical solutions how to increase processing power,but before we invest in the real system,we have to know whether the accuracy of the system is satisfactory.

Nevertheless,the processing power is also brieﬂy addressed in the dissertation.

1.2 Description of the problem

For eﬀective depth reconstruction we need high resolution images. As described in the next chapter,only mosaic-based procedures give high resolution results. Thus these procedures represent a good starting point for the development of our system.

First of all,we are interested in where are the eﬃciency borders of stereo panoramic depth imaging system,which is based on multiperspective panoramic images.

The basic system consists of only one standard camera,which is oﬀset from the system’s rotational center. It is rotated around the rotational center in angular steps corresponding to one vertical pixel-column of the captured standard image. In this way the best possible accuracy of the depth reconstruction process is achieved and consequently the results of this system serve as a ground truth for subsequent comparisons.

Therefore the focus of the first part of the dissertation (Chapter 2) is on the ex- posed research issues. In it we also prove that we can effectively constrain the search space on the epipolar line,that the confidence in the estimated depth is variable and that the system can be used for depth reconstruction of small rooms (having in mind an application to autonomous mobile robot navigation). The relationship between different system parameters is also presented.

(18)

The disadvantage of the mosaicing procedures lies in the time needed to capture many images. Therefore we suggest a real time extension of the system based on simultaneously using many standard cameras. Only real time execution ensures the possibility to reconstruct dynamic scenes,which is in many cases of great importance for autonomous mobile robot navigation.

The second part of the dissertation (Chapter 3) explains this solution in detail.

It reveals a new panoramic depth sensor,its properties,analyze the number of needed standard cameras in order to ensure a good compromise between the speed and the accuracy of the new sensor,because the panoramic images are built from wider stripes and not from only one column of the captured image,and new results are compared to the results of the basic sensor,described in the ﬁrst part of the dissertation.

As mentioned,suggested real time panoramic depth sensor can only be build if we use an adequate number of cameras simultaneously. We geometrically prove that such a sensor can be build out of standard cameras,which are available on the market. We have not physically built it,but we have performed simulations to establish the quality of results.

The in-depth analysis of such mosaicing approach should reveal whether the system could be used for real time panoramic depth imaging and consequently for autonomous mobile robot navigation.

Since our ﬁnal goal is to determine the usability of our system for mobile robot navigation,we perform all the tests on real world images,so that the results reﬂect the applicability of implemented algorithms in the real world.

1.3 Structure of the dissertation

Basic introduction to the field of computer vision,related to the title of the dissertation,and the problem statement are given in this chapter. In the next chapter (part one of the dissertation) the basic system based on a camera mounted on a rotational arm so that the optical center of the camera is offset from the vertical axis of rotation is introduced and evaluated. The descriptions of different panoramic cameras and related work are also part of this chapter. In Chapter 3 (second part of the dissertation),we suggest a real time extension of the basic system,reveal its properties and evaluate its effectiveness,also in comparison to the basic system. The summary of the dissertation is given in the last chapter,along with the conclusions, the contributions to science and the ideas for future work. We end the dissertation with the extended summary in Slovenian language,which is given in the appendix.

(19)

Chapter 2 Basic System

(20)

2.1 Introduction

2.1.1 Motivation

Standard cameras have a limited field of view,which is usually smaller than the human field of view. Because of that people have always tried to generate images with a wider field of view,up to full 360 degree panorama [16].

Under the term stereo reconstruction we understand the generation of depth images from two or more captured images. A depth image is an image that stores distances to points on the scene. The stereo reconstruction procedure is based on relations between points and lines on the scene and images of the scene. If we want to get a linear solution of the reconstruction procedure then the images can inter- act with the procedure in pairs,triplets or quadruplets,and relations are named accordingly to the number of images as epipolar constraint,trifocal constraint or quadrifocal constraint [22]. We want the images to have the property that the same points and lines are visible in all images of the scene,which facilitate stereo reconstruction. This is the property of panoramic cameras and it presents our fun- damental motivation. The stereo reconstruction in this dissertation is done from two symmetric multiperspective panoramic images.

In this work we address only the issue how to enlarge the horizontal ﬁeld of view of images. The vertical ﬁeld of view of panoramic images can be enlarged by using wide angle camera lenses [44],by using mirrors [25,32] or by moving the camera also in the vertical direction and not only in the horizontal direction [16].

If we tried to build two panoramic images simultaneously by using two standard cameras which are mounted on two rotational robotic arms,we would have problems with non-static scenes. Clearly,one camera would capture the motion of the other camera. So we have decided to use one camera only. In the ﬁrst part of our work we develop a mosaic-based panoramic depth imaging system using only one standard camera and analyze its performance to see if it can be used for robot localization and navigation in a room.

2.1.2 Basics about the system

In Fig. 2.1 the hardware part of our system can be seen: a color camera is mounted on a rotational robotic arm so that the optical center of the camera is oﬀset from the vertical axis of rotation. The camera is looking outward from the system’s rotational center. Panoramic images are generated by repeatedly shifting the rotational arm by an angle which corresponds to a single pixel column of the captured image. By assembling the center columns of these images,we get a mosaic panoramic image.

One of the drawbacks of mosaic-based panoramic imaging is that dynamic scenes are not well captured.

It can be shown that the epipolar geometry is very simple if we perform the reconstruction based on a symmetric pair of stereo panoramic images. We get a

(21)

2.2 Panoramic cameras 7

Figure 2.1: Hardware part of our system.

symmetric pair of stereo panoramic images when we take symmetric columns on the left and on the right hand side from the captured image center column. These columns are assembled in a mosaic stereo pair. The column from the left hand side of the captured image is mosaiced in the right eye panoramic image and the column from the right hand side of the captured image is mosaiced in the left eye panoramic image.

2.1.3 Structure of the chapter

In the next section we compare different panoramic cameras with emphasis on mosaicing. In Sec. 2.3 we give an overview of related work and briefly present the contribution of our work towards the discussed subject. Sec. 2.4 describes the geometry of our system,Sec. 2.5 is devoted to the epipolar geometry and Sec. 2.6 describes the procedure of stereo reconstruction. The focus of this chapter is on the analysis of system capabilities,given in Sec. 2.7. In Sec. 2.8 we present experimen- tal results. In the very end of this chapter we summarize the main conclusions of the first part of the dissertation.

2.2 Panoramic cameras

Every panoramic camera belongs to one of three main groups of panoramic cameras:

catadioptric cameras,dioptric cameras and cameras with moving parts. The basic property of a catadioptric camera is that it consists of a mirror (or mirrors [18]) and a camera. The camera captures the image which is reﬂected from the mirror.

A dioptric camera is using a special type of lens,e.g. ﬁsh-eye lens,which increases the size of the camera’s ﬁeld of view. A panoramic image can also be generated by

(22)

moving the camera along some path and mosaicing together the images captured in diﬀerent locations on the path.

Type of Number of Resolution of Real References panoramic camera images panoramic images time

catadioptric 1 low yes [15,18,25,28,29,33,52]

camera

dioptric 1 low yes [3,7]

camera

moving a lot high no [1,8,9,10,12,13,14,16]

parts [17,19,20,21,23,25,26]

[27,32,33,35,36,39,43]

[44]

Table 2.1: Comparison of diﬀerent types of panoramic cameras with respect to the number of standard images needed to build a panoramic image,the resolution of panoramic images and the capability of building a panoramic image in real time.

The comparison of diﬀerent types of panoramic cameras is shown in Tab. 2.1.

All types of panoramic cameras enable 3D reconstruction. The camera has a single viewpoint or a projection center if all light rays forming the image intersect in a single point. Cameras with this property are also called central cameras. Rays forming a non-central image do not pass through a single point,but rather intersect a line [10],a conic [25,39,40,49],do not intersect at all [46] or are bound by other constraints suiting the practical or the theoretical demands [13,17].

Mosaic-based procedures can be marked as non-central (we do not deal with a single center of projection),they do not execute in real time,but they give high resolution results. High resolution images enable eﬀective depth reconstruction, since by increasing the resolution the number of possible depth estimates is also increasing. Thus mosaicing is not appropriate for capturing dynamic scenes and consequently not for reconstruction of dynamic scenes. The systems described in [1,16] are exceptions because the light rays forming the mosaic panoramic image intersect in the rotational center of the system. These two systems are central systems. The system presented in [30,41,42] could also be treated as mosaic- based procedure,though its concept for generating panoramic depth images is very diﬀerent from our concept. Because the system is more related to the topic of the second part of the dissertation (Chapter 3),it is presented in Sec. 3.1.1.

Dioptric panoramic cameras with wide angle lenses can be marked as non-central [29],they build a panoramic image in real time and they give low resolution results.

Cameras with wide angle lenses are appropriate for fast capturing of panoramic images and processing of captured images,e.g. for detection of obstacles or for localization of a mobile robot,but are less appropriate for reconstruction. Please note that we are talking about panoramic cameras here. Generally speaking,dioptric

(23)

2.3 Related work 9

cameras can be central.

Only some of the catadioptric cameras have a single viewpoint. Cameras with a mirror (or mirrors) work in real time and they give low resolution results. Only two mirror shapes,namely hyperbolic and parabolic mirrors,can be used to construct a central catadioptric panoramic camera [29,52]. Such panoramic cameras are appropriate for low resolution reconstruction of dynamic scenes and for motion estimation. It is also true that only for panoramic systems with hyperbolic and parabolic mirrors the epipolar geometry can be simply generalized [29,52].

Since dioptric and catadioptric cameras give low resolution results,they are more appropriate for use with view-based systems [59] and less for use with reconstruction systems.

Of course,combinations of diﬀerent cameras exist: e.g. a combination of the mosaicing camera and the catadioptric camera [25,32] or a combination of the mosaicing camera and the wide angle camera [44]. Their main purpose is to enlarge the camera’s vertical ﬁeld of view.

2.3 Related work

We can generate panoramic images either with the help of special panoramic cameras or with the help of a standard camera and with mosaicing standard images into panoramic images. If we want to generate mosaic 360 degree panoramic images,we have to move the camera on a closed path,which is in most cases a circle.

One of the best known commercial packages for creating mosaic panoramic images is QTVR (QuickTime Virtual Reality). It works on the principle of sewing together a number of standard images captured while rotating the camera [8]. Pe- leg et al. [27] introduced the method for creation of mosaiced panoramic images from standard images captured with a handheld video camera. A similar method was suggested by Szeliski and Shum [12],which also does not strictly constraint the camera path but assumes that a great motion parallax eﬀect is not present. All methods mentioned so far are used only for visualization purposes since the authors did not try to reconstruct the scene.

The crossed-slits (X-slits) projection [53,56,61] uses a similar mosaicing tech- nique with one important diﬀerence: the mosaiced strips are sampled from varying positions in the captured images. This makes the generation of virtual walkthroughs possible,i.e. we are again dealing with the visualization with the help of image-based rendering or new view synthesis.

Ishiguro et al. [1] suggested a method which enables scene reconstruction. They used a standard camera rotating on a circular path. The scene is reconstructed by means of mosaicing panoramic images together from the central column of the captured images and moving the system to another location where the task of mosaicing is repeated. The two created panoramic images are then used as the input

(24)

to a stereo reconstruction procedure. The depth of an object was first estimated using projections in two images captured in different locations of the camera on the camera path. But since their primary goal was to create a global map of the room, they preferred to move the system attached to the robot about the room. Clearly, by moving the robot to another location and producing the second panoramic image of a stereo pair in this location rather than producing a stereo pair in a single location,they enlarged the disparity of the system. But this decision also has a few drawbacks: we cannot estimate the depth for all points on the scene,the time of capturing a stereo pair is longer and we have to search for the corresponding points on the sinusoidal epipolar curves. The depth was then estimated from two panoramic images taken at two different locations of the robot in the room.

Peleg and Ben-Ezra [19,26] introduced a method for creation of stereo panoramic images without actually computing the 3D structure — the depth eﬀect is created in the viewer’s brain.

In [20],Shum and Szeliski described two methods used for creation of panoramic depth images,which use standard procedures for stereo reconstruction. Both methods are based on moving the camera on a circular path. Panoramic images are built by taking one column out of a captured image and mosaicing the columns. The authors call such panoramic imagesmultiperspective panoramic images. The crucial property of two or more multiperspective panoramic images is that they capture the information about the motion parallax eﬀect,since the columns forming the panoramic images are captured from diﬀerent perspectives. The authors use such panoramic images as the input in a stereo reconstruction procedure. In [21],Shum et al. proposed a non-central camera called an omnivergent sensor in order to reconstruct scenes with minimal reconstruction error. This sensor is equivalent to the sensor presented in this chapter.

However,multiperspective panoramic images are not something new to the vision community [20]: they are a special case ofmultiperspective panoramic images for cel animation[13],a special case ofcrossed-slits (X-slits) projection[53,56,61],they are very similar to images generated by a procedure calledmultiple-center-of-projection [17],by the manifold projection procedure [27] and by the circular projection procedure [19,26]. The principle of constructing multiperspective panoramic images is also very similar to the linear pushbroom camera principle for creating panoramic images [10].

The papers closest to our work [1,20,21] seem to lack two things: a comprehen- sive analysis of 1) the system’s capabilities and 2) the corresponding points search using the epipolar constraint. Therefore,the focus of this chapter is on these two issues. While in [1] the authors searched for corresponding points by tracking the feature from the column building the ﬁrst panorama to the column building the second panorama,the authors in [20] used an upgraded plane sweep stereoprocedure.

A key idea behind the approach in [21] is that it enables optimizing the input to traditional computer vision algorithms for searching the correspondences in order to

(25)

2.4 System geometry 11

produce superior results.

Further details about the related work are revealed in in the following sections, where we discuss speciﬁcs of our system.

2.4 System geometry

Let us begin this section with description of how the stereo panoramic pair is generated. From the captured images on the camera’s circular path we always take only two columns,which are equally distant from the middle column. We assume that the middle column that we are referring to in this work,is the middle column of the captured image,if not mentioned otherwise. The column on the right hand side of the captured image is then mosaiced in the left eye panoramic image and the column on the left hand side of the captured image is mosaiced in the right eye panoramic image. So,we are building each panoramic image from just a single pixel column of the captured image. Thus,we get a symmetric pair of stereo panoramic images,which yields a reconstruction with optimal characteristics (simple epipolar geometry and minimal reconstruction error) [21].

The geometry of our system for creating multiperspective panoramic images is shown in Fig. 2.2. The panoramic images are then used as the input to create panoramic depth images. Point C denotes the system’s rotational center around which the camera is rotated. The offset of the camera’s optical center from the rotational center C is denoted as r,describing the radius of the circular path of the camera. The camera is looking outward from the rotational center. The optical center of the camera is marked with O. The column of pixels that is sewn in the panoramic image contains the projection of pointP on the scene. The distance from point P to point C is the depth l,while the distance from point P to point O is denoted byd. Further, θ is the angle between the line defined by pointsC and O and the line defined by pointsCand P. In the panoramic image the horizontal axis represents the path of the camera. The axis is spanned byµand defined by pointC, a starting pointO0,where we start capturing the panoramic image,and the current point O. ϕdenotes the angle between the line defined by point O and the middle column of pixels of the image captured by the physical camera looking outward from the rotational center (the latter column contains the projection of the pointQ),and the line defined by pointO and the column of pixels that will be mosaiced into the panoramic image (the latter column contains the projection of the pointP). Angle ϕcan be thought of as a reduction of the camera’s horizontal view angleα.

The geometry of capturing multiperspective panoramic images can be described with a pair of parameters (r, ϕ). By increasing (decreasing) each of them,we increase (decrease) the baseline (2r0 [39],r0 =r·sinϕ(Fig. 2.2)) of our stereo system.

Wei et al. [43] proposed an approach to solve the parameter (r, ϕ) determination problem for a symmetric stereo panoramic camera. The image acquisition parame-

(26)

O0

ϕ−θ µ

C r r0

2θ O

2ϕ l 2ϕ

d d·sinϕ

d·cosϕ Q

P

α W W_2ϕ

virtual camera

physical camera (image plane)

viewing circle camera path

important columns camera optical

axis

Figure 2.2: Geometry of our system for constructing multiperspective panoramic images. Note that a ground-plan is presented. The optical axis of the camera is kept horizontal.

ters (r, ϕ) are calculated based on (subjectively) given parameters: the nearest and the furthest distances of the region of interest,the height of the region of interest and the width of the angular disparity interval. They conclude that neither the parameter r nor ϕ can satisfactorily match application requirements on their own and report that a general study of relations among parameters is in progress as they have discovered certain exceptions in experiments that require further researches.

The system in Fig. 2.2 is obviously a non-central since the light rays forming the panoramic image do not intersect in one point called the viewpoint,but instead are tangent (ϕ= 0) to a cylinder with radius r0,called the viewing cylinder (Fig.

2.3). Thus,we are dealing with panoramic images formed by a projection from a number of viewpoints. This means that a captured point on the scene is seen in the panoramic image from one viewpoint only. This is why the panoramic images captured in this way are called multiperspective panoramic images.

For stereo reconstruction we need two images. If we look at only one circle on the viewing cylinder (Fig. 2.2) then we can conclude that our system is equivalent

(27)

2.4 System geometry 13

virtual camera

physical camera (image plane) C

O

viewing cylinder

light rays important

column

Figure 2.3: All the light rays forming the panoramic image are tangent to the viewing cylinder.

to a system with two cameras. In our case,two virtual cameras are rotating on a circular path,i.e. a viewing circle,with radius r0. The optical axis of a virtual camera is always tangent to the viewing circle. The panoramic image is generated from only one pixel from the middle column of each image captured by a virtual camera. This pixel is determined by the light ray which describes the projection of a scene point on the physical camera image plane. If we observe a point on the scene P,we see that both virtual cameras,which see this point,form a traditional stereo system of converging cameras.

Obviously,a symmetric pair of panoramic images used in the stereo reconstruction process could be captured also with a bunch of cameras rotating on a circular path with radiusr0,where the optical axis of each camera is tangent to the circular path (Fig. 2.3).

Two images diﬀering in the angle of rotation of the physical camera setup (for example,two image planes marked in Fig. 2.2) are used to simulate a bunch of virtual cameras on the viewing cylinder. Each column of the panoramic image is obtained from a diﬀerent position of the physical camera on a circular path. In Fig.

2.4 we present two symmetric pairs of panoramic images.

To automatically register captured images directly from the knowledge of the camera’s viewing direction,the camera lens’ horizontal view angle α and vertical view angleβ are required. If we know this information,we can calculate the resolution of one angular degree,i.e. we can calculate how many columns and rows are within an angle of one degree. The horizontal view angle is especially important

(28)

2ϕ= 29.9625^◦

2ϕ= 3.6125^◦

Figure 2.4: Two symmetric pairs of panoramic images generated using diﬀerent values of the angle ϕ. In Sec. 2.7.1 we explain where these values for the angleϕ come from. Each symmetric pair of panoramic images comprises the motion parallax eﬀect. This fact enables the stereo reconstruction.

in our case,since we move the rotational arm only around it’s vertical axis. To calculate these two parameters,we use an algorithm described in [16]. It is designed to work with cameras whose zoom settings and other internal camera parameters are unknown. The algorithm is based on the mechanical accuracy of the rotational arm. The basic step of our rotational arm corresponds to an angle of 0.0514285^◦. In general,this means that if we tried to turn the rotational arm for 360 degrees, we would perform 7000 steps. Unfortunately,the rotational arm that we use cannot turn for 360 degrees around it’s vertical axis. The basic idea of the algorithm is to calculate the translationdx (in pixels) between two images,while the camera is rotated for a known angle dγ in the horizontal direction. Since we know the exact angle by which we move the camera,we can calculate the horizontal view angle of the camera:

α = W

dx·dγ, (2.1)

whereW is the width of the captured image in pixels.

The major drawback of this method is that it relies on the accuracy of the rotational arm. Because of that we rechecked the values of the view angles by calibrating the camera using a static camera and a checkboard pattern [11,31,54].

The input into the calibration procedure is a set of images with varying position of the pattern in each image. The results obtained were very similar,though the second method should be more reliable as it reveals more information about the camera model and also uses sub-pixel accuracy procedure. The latter calibration estimates the focal length,the principal point,the skew coeﬃcient and distortions,

(29)

2.5 Epipolar geometry 15

to name just the most important parameters for us. It also reveals the errors of all estimated parameters. If we assume that the principal point is in the middle of the captured image,we can calculate the horizontal view angle of the camera from the estimated parameters:

α= 2 arctanW/2

f , (2.2)

wheref is the estimated focal length.

Distortion parameters are also important,because we also investigate the inﬂu- ence of distortion on the system’s results.

In any case,now that we know the value ofα,we can calculate the resolution of one angular degreex0:

x0 = W α .

This equation enables us to calculate the width of the stripeWsthat will be mosaiced in the panoramic image when the rotational arm moves for an angleθ0:

Ws=x0·θ0. (2.3)

From the above equation we can also calculate the angle of the rotational arm for which we have to move the rotational arm if the stripe is only one pixel column wide.

We used three diﬀerent cameras in the experiments:

• a camera with the horizontal view angle α = 34^◦ and the vertical view angle β = 25^◦,

• a camera with the horizontal view angleα= 39.72^◦and the vertical view angle β = 30.54^◦,

• a camera with the horizontal view angleα= 16.53^◦and the vertical view angle β = 12.55^◦.

In the process of the panoramic image construction we did not vary these two parameters. From here on,the ﬁrst camera is used in the calculations and the experiments, if not stated diﬀerently.

2.5 Epipolar geometry

Searching for the corresponding points in two images is a diﬃcult problem. Generally speaking,the corresponding point can be anywhere in the second image. That is why we would like to constrain the search space as much as possible. Using the epipolar constraint we reduce the search space from 2D to 1D,i.e. to an epipolar

(30)

line [4]. In Sec. 2.7.3 we prove that in our system we can eﬀectively reduce the search space even on the epipolar line.

In this section we will only illustrate the procedure of the proof that the epipolar lines of the symmetric pair of panoramic images are image rows. This statement is true for our system geometry. For proof see [20,23,35,51].

The proof in [23] is based on radius r0 of the viewing cylinder (Figs. 2.2 and 2.3). We can expressr0 in the terms of known parametersr and ϕas:

r0=r·sinϕ .

We carry out the proof in three steps: first,we have to execute the projection equation for the line camera, then we have to write the projection equation for a multiperspective panoramic image and,in the final step,we prove the property of the epipolar lines for the case of a symmetric pair of panoramic images. In the first step,we are interested in how the point on the scene is projected to the camera’s image plane [4],which is of dimensionn×1 pixels in our case,since we are dealing with a line camera. In the second step,we have to write the relation between different notations of a point on the scene and the projection of this point on the panoramic image: notation of the scene point in Euclidean coordinates of the world coordinate system and in cylindric coordinates of the world coordinate system,notation of the projected point in angular coordinates of the (2D) panoramic image coordinate system and in pixel coordinates of the (2D) panoramic image coordinate system.

When we know the relations between the above-mentioned coordinate systems,we can write the equation for projection of scene points on the cylindric image plane of the panorama. Based on the angular coordinates of the panoramic image coordinate system property,we can in the third step show that the epipolar lines of the symmetric pair of panoramic images are actually rows of panoramic images. The basic idea for the last step of the proof is as follows: If we are given an image point in one panoramic image,we can express the optical ray deﬁned by a given point and the optical center of the camera in 3D world coordinate system. If we project this optical ray described in world coordinate system on the second panoramic image,we get an epipolar line corresponding to the given image point in the ﬁrst panoramic image.

After introducing proper relations valid for the symmetric case into the obtained equation,our hypothesis is conﬁrmed.

The same result can be found in [20],where the authors proved the property of symmetric pair of panoramic images by directly investigating the presence of the vertical motion parallax eﬀect in the panoramic images captured from the same rotational center. The generalization to the non-symmetric case for the camera looking inward and outward can be found in [51]. Even a more general case,in some respect,where the panoramic images can be captured from diﬀerent rotational centers,is discussed in [35].

It was shown that the notion of the epipolar geometry,well known for both central perspective cameras [4,22,34] and central catadioptric cameras [28,29,52],

(31)

2.6 Stereo reconstruction 17

can be generalized to some non-central cameras [37,40,46,49]. The epipolar surfaces extend from planes to double-ruled quadrics: planes,rotational hyperboloids and hyperbolic paraboloids.

2.6 Stereo reconstruction

Let us go back to Fig. 2.2. Using trigonometric relations evident from the sketch, we can write the equation for the depth estimationl of a pointP on the scene. By the basic law of sines for triangles,we have:

r

sin(ϕ−θ) = d

sinθ = l

sin(180^◦−ϕ). (2.4)

From this equation we can express the equation for depth estimationl as:

l= r·sin(180^◦−ϕ)

sin(ϕ−θ) = r·sinϕ

sin(ϕ−θ). (2.5)

Eq. (2.5) implies that we can estimate depthlonly if we know three parameters:

r, ϕ and θ. r is given. Angle ϕ can be calculated on the basis of the camera’s horizontal view angleα (Eq. (2.1)) as:

2ϕ= α

W ·W2ϕ, (2.6)

whereW is the width of the captured image in pixels andW2ϕ is the width of the captured image between columns forming the symmetric pair of panoramic images, given also in pixels. To calculate the angleθ,we have to ﬁnd corresponding points on panoramic images. Our system works by moving the camera for the angle corresponding to one pixel column of the captured image. If we denote this angle byθ0, we can express the angleθas:

θ=dx·θ0

2, (2.7)

wheredxis the absolute value of diﬀerence between the corresponding points image coordinates on the horizontal axisxof the panoramic images.

Note that Eg. (2.5) does not contain the focal lengthf explicitly,but since the relationships between α and f on one side (Eq. (2.2)) and α and ϕ on the other side (Eq. (2.6)) exist, ϕalso depends uponf (the two models for estimating angle ϕ(Eqs. (2.6) and (2.8)) are discussed in Sec. 2.7.2):

ϕ= arctanW2ϕ/2

f . (2.8)

Eq. (2.5) estimates the distance l to the perpendicular projection of the scene pointP on the plane deﬁned by the camera’s circular (planar) path. The projection

(32)

C

O

r

P

l ( α ) d ( α )

l ( α, β ) d ( α, β )

ω

1

ω

2

Y P

Figure 2.5: Important relations between system parameters for addressing the vertical reconstruction.

of the scene point P is marked with P in Fig. 2.5. Since this estimation is an approximation of the real l,we have to improve the estimation by addressing the vertical reconstruction,i.e. by incorporating the vertical view angleβ into Eq. (2.5).

Let us here adopt the following notation to introduce the inﬂuence of β on estimation of l: if a variable l or d depends on α only,we mark that as l(α) and d(α) (until now,these variables were marked simply l and d),but if a variable l or ddepends on α and β,we mark that asl(α, β) andd(α, β). According to Fig. 2.5 the distance to the point P on the scene can be calculated as:

l(α, β) =

l(α)²+Y² =

l(α)²+ (l(α)·tanω2)².

Because the value of ω2 is unknown,we have to express it in terms of known parameters. We can do that,whileY can also be written as:

Y =d(α)·tanω1.

We can calculateω1 similarly as we calculatedϕ (Eqs. (2.6) and (2.8)):

2ω1 = β

H ·H2ω₁ or ω1 = arctanH2ω₁/2 f ,

(33)

2.6 Stereo reconstruction 19

whereH is the height of the captured image in pixels andH2ω₁ is the height of the captured image between the image row that contains the projection of the scene point P and the symmetric row on the other side from the middle row,given also in pixels. Andd(α) follows from Eq. (2.4):

d(α) = l(α)·sinθ sinϕ . Now,we can write the equation forl(α, β) as:

l(α, β) =

l(α)²+

l(α)·sinθ

sinϕ ·tanω1

2

. (2.9)

From now on,l=l(α) and when l(α, β) is used,this is explicitly stated.

The inﬂuence of addressing the vertical reconstruction on the reconstruction accuracy is discussed in Secs. 2.7.6 and 2.8.4.

(34)

2.7 Analysis of the system’s capabilities

2.7.1 Time complexity of panoramic image creation

The biggest disadvantage of our system is that it cannot produce panoramic images in real time since we create them stepwise by rotating the camera for a very small angle. Because of mechanical vibrations of the system,we also have to ensure to capture an image when the system is completely still. The time that the system needs to create a panoramic image is much too long to allow it work in real time.

In a single circle around the system’s vertical axis our system constructs 11 panoramic images: 5 symmetric pairs and a panoramic image from the middle columns of the captured images. It captures and saves 1501 images with resolution of 160×120 pixels,where radius isr = 30 cm and the shift angle isθ0= 0.205714^◦. We have choosen the resolution of 160×120 pixels because it represents a good compromise between overall time complexity of the system and its accuracy,as it is shown in the following sections. We cannot capture 360/θ0 images because of the limitation of the rotational arm. Namely,the rotational arm cannot turn for 360 degrees around its vertical axis.

The middle column of the captured image was in our case the 80th column. The distances between the columns building up symmetric pairs of panoramic images were 141,125,89,53 and 17 columns. These numbers include two columns building up each pair. In consequence the values of the angle 2ϕ(Eq. (2.6)) are 29.9625^◦(141 columns),26.5625^◦(125 columns),18.9125^◦(89 columns),11.2625^◦(53 columns) and 3.6125^◦ (17 columns),respectively. (Here we used the camera with the horizontal view angleα= 34^◦.)

The acquisition process takes little over 15 minutes on a 350 MHz Intel PII PC.

The steps of the acquisition process are as follows:

1. Move the rotational arm to its initial position.

2. Capture and save the image.

3. Contribute image parts to the panoramic images.

4. Move the arm to the new position.

5. Check in the loop if the arm is already in the new position. The communication between the program and the arm is written in the ﬁle for debugging purposes.

After the program exits the loop,it waits for 300 ms in order to stabilize the arm in the new position.

6. Repeat steps 2 to 5 until the last image is captured.

7. When the last image has been captured,contribute image parts to the panoramic images and save them.

(35)

2.7 Analysis of the system’s capabilities 21

We could achieve faster execution since our code is not optimized. For example,we did not optimize the waiting time (300 ms) after the arm is in the new position. No computations are done in parallel.

2.7.2 Inﬂuence of parameters r, ϕ and θ⁰ on the reconstruction accuracy

In order to estimate the depth as precisely as possible,the parameters involved in the calculation also have to be estimated precisely. In this section we reveal the methods used for estimation of parametersr,ϕand θ0.

θ0 denotes the angle corresponding to one pixel column of the captured image, for which we rotate the camera. It can be calculated from Eq. (2.3):

θ0= α

W. (2.10)

Forα = 34^◦ and W=160 pixels,we get θ0 = 0.2125^◦. On the other hand,we know that the accuracy of our rotational arm is ε = 0.0514285^◦,so the best possible approximate value isθ0 = 0.205714^◦. Since each column in the panoramic image in reality describes the latter angleθ0,we always use in calculationsθ0=n·ε,n∈IN, which is closest to the result obtained from Eq. (2.10). The experiment in Sec. 2.8.5 conﬁrms that this decision is correct. To discriminate the two values between each other,let us mark them as θ0(α) (Eq. (2.10)) and θ0(ε) (the estimation based on the accuracy of our rotational arm). We use them from now on,but where onlyθ0

is given,thenθ0 =θ0(ε).

C O

r α

α d

di W

i

/2

mm grid paper

Figure 2.6: The relation between the parameters,which are important for determining the radiusr.

r represents the distance between the rotational center of the system and the optical center of the camera. Since the exact position of the optical center is normally not known (not given by the manufacturer),we have to estimate its position. Optical ﬁrms with their special equipment would do the best job,but since this has not been an option for us,we have used a simple method,which has been proved quite useful

(36)

(Fig. 2.6): First the camera horizontal view angleα has been estimated. Then we have captured a few images of the mm grid paper from known distances di from one point on the camera to the paper. The optical axis has been assumed to be perpendicular to the paper surface. From each image we have read the width Wi

of it in mm and used all now known values (α,di and Wi) to estimate the distance dfrom the paper to the optical center by manually drawing a geometrically precise relation between the parameters. More distances di have been used to check the consistency of all estimates. At the end the position of the optical center has been calculated as an average over all estimated values. Because we know the distances di andd,we also know the position of the optical center with respect to the point on the camera from which we have measured the distancesdi. Finally,we can measure the distancer. Nevertheless,this is a rough estimation of the optical center position, but it can be optimized as shown in the experiment in Sec. 2.8.9.

ϕ determines the column of each captured image,which is mosaiced into the panoramic image. The two models for estimating angle ϕ (Eqs. (2.6) and (2.8)) differ from one another: the first one is linear,while the second one is not. But since we use cameras with the maximal horizontal view angle α = 39.72^◦,the biggest possible difference between the models is only 0.3137^◦ (at the point,where ratio W2ϕ/W = 91/160). In the experiments we use such values of ϕthat the difference is very small,i.e. the biggest difference is lower than 0.1^◦. The experiment in Sec.

2.8.6 shows that we obtain slightly better results with the linear model for a given (estimated) set of parameters. This is why the linear model was used in all other experiments.

We discuss the angle θ0 and the radiusr in relation with the one-pixel error in estimation of the angle ϕin the end of Sec. 2.7.4.

2.7.3 Constraining the search space on the epipolar line

Knowing that the width of the panoramic image is much bigger than the width of the captured image,we would have to search for a corresponding point along a very long epipolar line (Fig. 2.7a). Therefore we would like to constraint the search space on the epipolar line as much as possible. This means that the stereo reconstruction procedure executes faster. A side eﬀect is also an increased conﬁdence in the estimated depth.

From Eq. (2.5) we can derive two conclusions,which nicely constraint the search space:

1. Theoretically,the minimal possible estimation of depth is lmin = r. This is true forθ= 0^◦. However,this is impossible in practice since the same point on the scene cannot be seen in the column that will be mosaiced in the panorama for the left eye and at the same time in the column that will be mosaiced in the panorama for the right eye. If we observe the horizontal axis of the panoramic

(37)

a) unconstrained length of the epipolar line: 1501 pixels

b) constrained length of the epipolar line: 145 pixels,2ϕ= 29.9625^◦

c) constrained length of the epipolar line: 17 pixels,2ϕ= 3.6125^◦ Figure 2.7: We can eﬀectively constrain the search space on the epipolar line.

image with respects to the direction of the rotation,we can see that every point on the scene that is shown on both panoramic images (Fig. 2.4) is ﬁrst imaged in the panorama for the left eye and then in the panorama for the right eye. Therefore,we have to wait until the point imaged in the column building up the left eye panorama moves in time to the column building up the right eye panorama. If θ0 presents the angle by which the camera is shifted,then 2θmin = θ0. In consequence,we have to make at least one basic shift of the camera to enable a scene point projected in a right column of the captured image forming the left eye panorama to be seen in the left column of the captured image forming the right eye panorama.

Based on this fact,we can search for the corresponding point in the right eye panorama starting from the horizontal image coordinate x+ ^2θ_θ^min₀ = x+ 1 forward,wherexis the horizontal image coordinate of the point in the left eye panorama for which we are searching the corresponding point. Thus,we get the value +1 since the shift for the angle θ0 describes the shift of the camera for a single column of the captured image.

In our system,the minimal possible depth estimationlmindepends on the value of the angle ϕ:

lmin(2ϕ= 29.9625^◦) = 302 mm ...

lmin(2ϕ= 3.6125^◦) = 318 mm.

(38)

Figure 2.8: Constraining the search space on the epipolar line in case of 2ϕ = 29.9625^◦. In the left eye panorama (top image) we have denoted the point for which we are searching the corresponding point with a green cross. In the right eye panorama (bottom image) we have used green color to mark the part of the epipolar line on which the corresponding point must lie. The best corresponding point is marked with a red cross. With blue crosses we have marked a number of points which presented temporary best corresponding point before we actually found the point with the maximal correlation.

2. Theoretically,the estimation of depth is not constrained upwards,but from Eq. (2.5) it is evident that the denominator must be non-zero. Practically, this means that for the maximal possible depth estimationlmaxthe diﬀerence ϕ−θmax must be equal to the value in the interval (0,^θ₂⁰). We can write this fact as: θmax=n·^θ₂⁰,wheren=ϕdiv ^θ₂⁰ and ϕmod ^θ₂⁰ = 0.

If we write the constraint for the last point,which can be a corresponding point on the epipolar line,in analogy with the case of determining the starting point that can be a corresponding point on the epipolar line,we have to search for the corresponding point in the right eye panorama to including the horizontal image coordinatex+^2θ_θ^max

0 =x+n. Herexis the horizontal image coordinate of the point on the left eye panorama for which we are searching the corresponding point.

(39)

Equivalently,like in case of the minimal possible depth estimation lmin,the maximal possible depth estimation lmax also depends upon the value of the angle ϕ:

lmax(2ϕ= 29.9625^◦) = 54687 mm ...

lmax(2ϕ= 3.6125^◦) = 86686 mm.

In the following sections we show that we cannot trust the depth estimates near the last point of the epipolar line search space,but we have proven that we can eﬀectively constrain the search space.

To illustrate the use of speciﬁed constraints on real data,let us present the following example which describes the working process of our system: while the width of the panorama is 1501 pixels,when searching for a corresponding point,we have to check only ϕdiv ^θ₂⁰ = 145 pixels in case of 2ϕ= 29.9625^◦ (Figs. 2.7b and 2.8) and only 17 in case of 2ϕ= 3.6125^◦ (Fig. 2.7c).

From the last paragraph we could conclude that the stereo reconstruction procedure is much faster for a smaller angle ϕ. However,in the next section we show that a smaller angleϕ,unfortunately,has also a negative property.

2.7.4 Meaning of the one-pixel error in estimation of the angle θ

a) 2ϕ= 29.9625^◦ b) 2ϕ= 3.6125^◦

Figure 2.9: The dependence of depth l on angle θ (Eq. (2.5), r = 30 cm and two diﬀerent values ofϕare used). To visualize the one-pixel error in estimation of the angle θ,we have marked the interval of width ^θ₂⁰ = 0.102857^◦ between the vertical lines near the third point.

Let us ﬁrst deﬁne what we mean under the term one-pixel error. As the images are discrete,we would like to know what is the value of the error in the depth estimation if we miss the right corresponding point for only one pixel. And we would like to have this information for various values of the angleϕ.

v realnem ˇ casu z uporabo standardnih kamer

Univerza v Ljubljani

Fakulteta za raˇcunalniˇstvo in informatiko

Peter Peer

Gradnja globinskih panoramskih slik

v realnem ˇ casu z uporabo standardnih kamer

Doktorska disertacija

Mentor: prof. dr. Franc Solina

Ljubljana, 2003

University of Ljubljana

Faculty of Computer and Information Science

Peter Peer

Real Time Panoramic Depth Imaging Using Standard Cameras

Doctoral Dissertation

Supervisor: Prof. Dr. Franc Solina

Ljubljana, 2003

Contents

To my family

Chapter 1

Introduction

1.1 Description of the narrow scientiﬁc area

1.2 Description of the problem

1.3 Structure of the dissertation

Chapter 2

Basic System

2.1 Introduction

2.2 Panoramic cameras

2.3 Related work

2.4 System geometry

2.5 Epipolar geometry

2.6 Stereo reconstruction

O

P

l ( α ) d ( α )

l ( α, β ) d ( α, β )

ω

ω

Y P

2.7 Analysis of the system’s capabilities