Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2897186

Formats

Article sections

- Abstract
- 1. Motivation and related work
- 2. Proposed approach
- 3. Experiments
- 4. Conclusions and future work
- References

Authors

Related links

SIAM J Imaging Sci. Author manuscript; available in PMC 2010 July 6.

Published in final edited form as:

SIAM J Imaging Sci. 2010 March 3; 3(1): 110–132.

doi: 10.1137/080741653PMCID: PMC2897186

NIHMSID: NIHMS160917

Romeil Sandhu: ude.hcetag@uhdnasr; Anthony Yezzi: ude.hcetag.ece@izzey.ynohtna; Allen Tannenbaum: ude.hcetag.ece@abnennat

See other articles in PMC that cite the published article.

In this work, we present an approach to jointly segment a rigid object in a two-dimensional (2D) image and estimate its three-dimensional (3D) pose, using the knowledge of a 3D model. We naturally couple the two processes together into a shape optimization problem and minimize a unique energy functional through a variational approach. Our methodology differs from the standard monocular 3D pose estimation algorithms since it does not rely on local image features. Instead, we use global image statistics to drive the pose estimation process. This confers a satisfying level of robustness to noise and initialization for our algorithm and bypasses the need to establish correspondences between image and object features. Moreover, our methodology possesses the typical qualities of region-based active contour techniques with shape priors, such as robustness to occlusions or missing information, without the need to evolve an infinite dimensional curve. Another novelty of the proposed contribution is to use a unique 3D model surface of the object, instead of learning a large collection of 2D shapes to accommodate the diverse aspects that a 3D object can take when imaged by a camera. Experimental results on both synthetic and real images are provided, which highlight the robust performance of the technique in challenging tracking and segmentation applications.

Two-dimensional (2D) image segmentation and 2D-3D pose estimation are key tasks for numerous computer vision applications and have received a great deal of attention in the past few years. These two fundamental techniques are usually studied separately in the literature. In this work, we combine both approaches in a single variational framework. To appreciate the contribution of this work, we recall some of the results and specifics of both fields.

2D-3D pose estimation aims at determining the pose of a 3D object relative to a calibrated camera from a single or a collection of 2D images. By knowing the mapping between the world coordinates and image coordinates from the camera calibration matrix, and after establishing correspondences between 2D features in the image and their 3D counterparts on the model, it is then possible to solve for the pose transformation parameters (from a set of equations that express these correspondences). The literature concerned with 3D pose estimation is very large, and a complete survey is beyond the scope of this paper. However, most methods can be distinguished by the type of *local* image features used to establish correspondences, such as points [1], lines or segments [2, 3], multipart curve segments [4], or complete contours [5, 6]. Segmentation consists of separating an object from the background in an image. The geometric active contour (GAC) framework, in which a curve is evolved continuously to capture the boundaries of an object, has proved to be quite successful at performing this task. Originally, the method focused on extracting local image features such as edges to perform segmentation; see [7, 8] and the references therein. However, edge-based techniques can suffer from the typical drawbacks that arise from using local image features: high sensitivity to noise or missing information, and a multitude of local minima that result in poor segmentations. Region-based approaches, which use global image statistics inside and outside the contour, were shown to drastically improve the robustness of segmentation results [9, 10, 11, 12]. These techniques are able to deal with various statistics of the object and background such as distinct mean intensities [10], Gaussian distributions [11, 12], or intensity histograms [13, 14, 15], as well as a wide variety of photometric descriptors such as grayscale values, color, or texture [16]. Further improvement of the GAC approach consists of learning the shape of objects and constraining the contour evolution to adopt familiar shapes to make up for poor segmentation results obtained in the presence of noise, clutter, or occlusion or when the statistics of the object and background are difficult to distinguish (see, e.g., [17, 18, 19, 20]).

Our goal is to combine the strengths of both techniques (and to try to avoid some of their typical weaknesses) in order to both robustly segment 2D images and estimate the pose of an arbitrary 3D object whose shape is known.

In particular, we use a region-based approach to continuously drive the pose estimation process. This global approach avoids using local image features and, hence, addresses two shortcomings that typically arise from doing so in many 2D-3D pose estimation algorithms. First, finding the correspondence between local features in the image and on the model is a nontrivial task, due, for instance, to their viewpoint dependency—no local correspondences need to be found in our global approach. Second, local image features may not even exist or can be difficult to detect in a reliable and robust fashion in the presence of noise clutter or occlusion. Furthermore, simplifying assumptions usually need to be made on the class of shapes that a 2D-3D pose estimation technique can handle. Many approaches are limited to relatively simple shapes that can be described using geometric primitives such as corners, lines, circles, or cylinders. Recent work focused on free-form objects, which admit a manageable parametric description as in [5]. However, even this type of algebraic approach can become unmanageable for objects of arbitrary and complex shape. Our approach can deal with rigid objects of *arbitrary* shape, represented by a 3D level set [21] or a 3D cloud of points (see Figure 1).

Next, a major shortcoming of the GAC framework using shape priors is that 2D shapes are usually learned to segment 2D images. Hence, a large collection of 2D shapes needs to be learned in order to represent the wide variation in aspect that most natural 3D objects take when projected onto the 2D image plane. Our region-based approach benefits from the knowledge of the object shape that is compactly described by a *unique* 3D model. Acquisition of 3D models can be readily accomplished using range scans [22] or structure from motion approaches [23]. In addition, and in contrast to the GAC framework, the proposed method does not involve the evolution of an infinite dimensional contour to perform segmentation but only solves for the finite dimensional pose parameters (as is common for 2D-3D pose estimation approaches). This results in a much simplified framework that avoids dealing with problems such as infinite dimensional curve representation, evolution, and regularization.

In this paper, we expand the method presented in [24]. Our technique exploits many ideas from recent variational approaches that address the problem of structure from motion and stereo reconstruction from multiple cameras [25, 26, 23]. Originally, the authors in [26, 23] presented a method for reconstructing the 3D shape of an object from multiple 2D views obtained from calibrated cameras. The present contribution aims at performing a somewhat opposite task: given the 3D model of an object, perform the segmentation of 2D images and recover the 3D pose of the object relative to a *single* camera. To the best of our knowledge, this is the first time that the framework of [26, 23] has been adapted and employed in the specific context of segmenting 2D images from a single camera, using the knowledge of a 3D model. The framework in [23] has also recently been extended in [27] to address the problem of multiple camera calibration. In the present work, the camera is assumed to be calibrated. However, this assumption could easily be dropped by also solving for the optimal camera calibration parameters as presented in [27].

We note that, although the use of 3D shape knowledge to perform the 2D segmentation of regions presents obvious advantages, the literature dealing with this type of approach is strikingly thin. An early attempt to solve the problem of viewpoint dependency of the aspect of 3D objects can be found in [28]. In these papers, a region-based active contour approach is proposed that uses a unique shape prior. The prior shape is represented by a generalized cone based on *one* reference view of an object. The unlevel sections of the cone correspond to possible instances of the segmenting contour. Although the method performs well in the presence of variations in aspect of the object due to projective transformations, the method cannot cope with images involving a view of the object that is radically different from the reference view. The closest piece of work to our proposed contribution is probably [29], which has been extended in [30]. In [29], the authors evolve an (infinite dimensional) active contour as well as 3D pose parameters to minimize a joint energy functional encoding both image information and 3D shape knowledge. Our method differs from the aforementioned approach in many crucial aspects. For example, we optimize a *single* energy functional, which allows us to circumvent the need to determine ICP-like^{1} correspondences and to perform costly backprojections between the segmenting contour and the shape model at each iteration. Also, we perform optimization *only* in the finite dimensional space of the Euclidean pose parameters. In addition to being computationally efficient, this allows our technique to be less likely to be trapped in local minima, resulting in robust performances as demonstrated in the experimental part. In [30], the method of [29] is successfully simplified by eliminating the need to evolve an active contour and by performing energy minimization only in the space of 3D pose parameters. Thus, the method of [30] and our contribution present some similarities, notably in the use of the classical region-based energy functional introduced in [10] and [11]. However, the approaches to energy minimization and the resulting algorithms are radically different: In [30], an algebraic approach is used that involves establishing correspondences and backprojections between the 3D and 2D worlds, as well as linearizing the resulting system of equations. Consequently, important information about the geometry of the 3D model is lost through the algebraic approach. In contrast, our approach relies on surface differential geometry (see e.g., [31]) to link geometric properties of the model surface and its projection in the image domain. This allows us to derive the partial differential equations necessary to perform energy optimization. The resulting variational approach offers a complete and novel understanding of the problem of 3D pose estimation from 2D images. In addition, the knowledge of the 3D object is exploited to its full extent within our framework. In [32] the authors also successfully performed simultaneous 2D segmentation and 3D pose estimation using an entirely different approach. In their work, a cost function based on a Markov random field (MRF) was optimized using a dynamic graph cut approach (see [33] and the references therein). Also, and in contrast to our work, the 3D knowledge of the shape of an object was encoded via an articulated stick model instead of a 3D surface.

Our technique uses a 3D shape prior in a region-based framework and can thereby be expected to be robust to noise or occlusion. Hence, an obvious application of the proposed approach is the robust tracking of 3D rigid objects in 2D image sequences. Thus, our approach is also related to a wealth of methods concerned with the problem of model-based monocular tracking, one crucial difference being that most such approaches use local features in images (see [34] for a recent survey). In particular, in [35] a geometric approach to the 3D pose estimation problem is proposed: The authors use the knowledge of the occluding curve (i.e., the curve delimiting the visible part of the object from the camera) to search for edges in images and convincingly improve tracking performances. Similarly, the occluding curve plays a cornerstone role in our methodology.

This paper is organized as follows: In section 2, we detail our methodology by describing our choice of notation and energy functional, as well as by deriving the energy gradient to solve the problem at hand. Then, we present experimental results for segmentation and tracking tasks that highlight the robustness of our technique to noise, clutter, occlusion, or poor initializations. Finally, we present our conclusions and future work.

We suppose that we have at our disposal the 3D surface model of an object. Our goal is to find the 3D (Euclidean) transformation that needs to be applied to the model so that it coincides with the object of interest in the referential attached to a calibrated camera. To this end, we solve a typical shape optimization problem, in which we seek to segment the object in the 2D image plane with the 2D shape given by the projection of the 3D model for a given 3D transformation. The 3D transformation of the 3D model that results in an optimal segmentation of the object in the 2D image plane is expected to describe the actual position of the 3D object with respect to the camera. Therefore, the shape space (over which segmentation is performed) is the set of all 2D shapes determined by projection from the 3D model. This is a manifold, in which variational segmentation on the 3D transformation parameters can be performed. An overview of the method can be found in Figure 2. We now describe the proposed approach in detail, starting with our choice of notation.

Schema summarizing our segmentation/pose estimation approach from a 3D model. Left: First, the 3D model is transformed (**X** = g(**X**_{0})) and projected onto the 2D image plane (**x** = π(**X**)). The resulting yellow curve is the “silhouette,” **...**

Let **X** = [*X, Y, Z*]* ^{T}* denote the coordinates of a point in

Let *S* be the smooth surface in ^{3} defining the shape of the object of interest. The (outward) unit normal to *S* at each point **X** *S* will be denoted by **N** = [*N*_{1}*, N*_{2}*, N*_{3}]* ^{T}*. To determine the pose of

Let *R* = *π*(*S*) Ω be the region of the image on which the surface *S* projects (i.e., the region of Ω corresponding to imaging *S*). Let *R ^{c}* = Ω\

In [23], the authors employed an image formation approach to define a cost functional measuring the discrepancy between the photometric properties of the surface *S* (as well as the 3D background) and the pixel intensities of multiple images. The resulting energy involved backprojections to the surface *S* to guarantee the coherence between the measurements obtained from multiple cameras.

In the present work, we are interested in segmenting a *unique* image and we adopt a shape optimization approach, directly inspired from region-based active contours techniques [10, 11, 12, 13, 14]. Many segmentation approaches assume that the pixels corresponding to the object of interest or the background are distinct with respect to a certain grouping criterion. Within the GAC framework, region-based techniques perform segmentation by evolving a closed curve to increase the discrepancy between the statistics of the pixels located in the interior and exterior of the curve. Most region-based algorithms can be distinguished along three typical choices that are combined to separate the object from the background: The choice of the photometric variable (grayscale intensity, color, or texture vector), the choice of the statistical model for the photometric variables (probability density function), and the choice of the measure of similarity among distributions. These techniques minimize energies of the following form:

$$E={\int}_{\text{R}}{r}_{\text{in}}(I(\mathbf{x}),\widehat{c})d\mathrm{\Omega}+{\int}_{{\text{R}}^{\text{c}}}{r}_{\text{out}}(I(\mathbf{x}),\widehat{c})d\mathrm{\Omega},$$

(2.1)

where *r*_{in}: , Ω and *r*_{out}: , Ω are two monotonically decreasing functions measuring the matching quality of the image pixels with a statistical model over the regions *R* and *R ^{c}*, respectively. The space corresponds to the photometric variable chosen to perform segmentation. Hence, depending on the choices for

The energy *E* measures the discrepancy between the statistical properties of the pixels located inside and outside the silhouette (curve *ĉ*) and does not involve any backprojections. Although many measures of statistical similarity (e.g., Bhattacharyya distance as in [13] or mutual information as in [14]) could be chosen to define *E*, we use the log-likelihood function in this paper for simplicity.^{4} Accordingly, one has

$${r}_{\text{in}}=log({P}_{\text{in}})\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{r}_{\text{out}}=log({P}_{\text{out}}),$$

(2.2)

where *P*_{in} and *P*_{out} are the probability density functions (PDFs) of the pixels inside and outside the segmenting curve. We now detail possible choices of PDFs to model pixel statistics.

In [10], a method is proposed to segment images composed of regions of different mean intensities, using GACs. The resulting flow can be shown to be equivalent to comparing the log-likelihood of the Gaussian densities

$${P}_{\text{in}}(I,\widehat{c})=\frac{1}{\sqrt{2\pi {\mathrm{\sum}}_{0}}}{e}^{-{\scriptstyle \frac{{(I-{\mu}_{\text{in}})}^{2}}{2{\mathrm{\sum}}_{0}}}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{P}_{\text{out}}(I,\widehat{c})=\frac{1}{\sqrt{2\pi {\mathrm{\sum}}_{0}}}{e}^{-{\scriptstyle \frac{{(I-{\mu}_{\text{out}})}^{2}}{2{\mathrm{\sum}}_{0}}}},$$

(2.3)

where the intensity averages of the pixels located inside and outside the curve *ĉ* are denoted by *μ*_{in} and *μ*_{out}, respectively,^{5} and
${\mathrm{\sum}}_{0}={\scriptstyle \frac{1}{2}}$. The averages *μ*_{in} and *μ*_{out} are computed at each step of the curve evolution as

$${\mu}_{\text{in}}(\widehat{c})=\frac{{\int}_{\text{R}}I(\mathbf{x})d\mathrm{\Omega}}{{A}_{\text{in}}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{\mu}_{\text{out}}(\widehat{c})=\frac{{\int}_{{R}^{c}}I(\mathbf{x})d\mathrm{\Omega}}{{A}_{\text{out}}}$$

(2.4)

with *A*_{in}(*ĉ*) = ∫* _{R}d*Ω and

With the above notation and our particular choice of similarity measure and simplifying constant terms, the energy *E* can be defined as

$${r}_{\text{in}}=-{(I(\mathbf{x})-{\mu}_{\text{in}})}^{2}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{r}_{\text{out}}=-{(I(\mathbf{x})-{\mu}_{\text{out}})}^{2}.$$

(2.5)

In [11, 39], a method is proposed to segment images composed of regions with distinct Gaussian densities, using the estimates

$${P}_{\text{in}}(I,\widehat{c})=\frac{1}{\sqrt{2\pi {\mathrm{\sum}}_{\text{in}}}}{e}^{-{\scriptstyle \frac{{(I-{\mu}_{\text{in}})}^{2}}{2{\mathrm{\sum}}_{\text{in}}}}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{P}_{\text{out}}(I,\widehat{c})=\frac{1}{\sqrt{2\pi {\mathrm{\sum}}_{\text{out}}}}{e}^{-{\scriptstyle \frac{{(I-{\mu}_{\text{out}})}^{2}}{2{\mathrm{\sum}}_{\text{out}}}}},$$

(2.6)

where the variances of the pixels located inside and outside the curve *ĉ* are denoted by Σ_{in} and Σ_{out}, respectively.^{6} The variances Σ_{in} and Σ_{out} are supposed to be distinct. ^{7} The intensity averages *μ*_{in} and *μ*_{out} are computed as above, and the variances Σ_{in} and Σ_{out} are computed at each step of the curve evolution as

$${\mathrm{\sum}}_{\text{in}}(\widehat{c})=\frac{{\int}_{R}{(I(\mathbf{x})-{\mu}_{\text{in}})}^{2}d\mathrm{\Omega}}{{A}_{\text{in}}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{\mathrm{\sum}}_{\text{out}}(\widehat{c})=\frac{{\int}_{{R}^{c}}{(I-{\mu}_{\text{out}})}^{2}d\mathrm{\Omega}}{{A}_{\text{out}}}.$$

(2.7)

In this case, with our particular choice of similarity measure and simplifying constant terms, the energy *E* can be defined as

$${r}_{\text{in}}=-log({\mathrm{\sum}}_{\text{in}})-\frac{{(I(\mathbf{x})-{\mu}_{\text{in}})}^{2}}{{\mathrm{\sum}}_{\text{in}}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{r}_{\text{out}}=-log({\mathrm{\sum}}_{\text{out}})-\frac{{(I(\mathbf{x})-{\mu}_{\text{out}})}^{2}}{{\mathrm{\sum}}_{\text{out}}}.$$

(2.8)

The Gaussian models alluded to above can be too simplistic to accurately separate the object from the background. One solution is to use less constrained models of the distributions of the object and background, e.g., Parzen estimators and generalized histograms. This has been investigated in [13, 14] within the GAC framework, as well as in [15] within a model-based segmentation approach that also aimed at estimating the pose parameters of medical structures. In a similar manner, the PDFs *P*_{in} and *P*_{out} are computed from the silhouette as

$${P}_{\text{in}}(z,\widehat{c})=\frac{{\int}_{R}\mathbf{K}(I(\mathbf{x})-z)d\mathrm{\Omega}}{{A}_{\text{in}}}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{P}_{\text{out}}(z,\widehat{c})=\frac{{\int}_{{R}^{c}}\mathbf{K}(I(\mathbf{x})-z)d\mathrm{\Omega}}{{A}_{\text{out}}}$$

(2.9)

with **K**(*χ*) typically being a smooth version of the Dirac function, e.g.,
$\mathbf{K}(\chi )={\scriptstyle \frac{1}{\sqrt{2\pi {\sigma}^{2}}}}{e}^{-{\scriptstyle \frac{{\chi}^{2}}{2{\sigma}^{2}}}}$ for a sufficiently small value of *σ*.

Following the region-based segmentation paradigm, the energy *E* is expected to be minimal when *R* and *R ^{c}* correspond to the object and background in

The partial differentials of *E* with respect to the pose parameters *λ _{i}*’s can be computed using the chain rule:

$$\frac{dE}{d{\lambda}_{i}}={\int}_{\widehat{c}}({r}_{\text{in}}(I(\mathbf{x}))-{r}_{\text{out}}(I(\mathbf{x})))\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}+{\int}_{R}\langle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}+{\int}_{{R}^{c}}\langle \frac{\partial {r}_{\text{out}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}.$$

(2.10)

The gradient in (2.10) involves the computation of the shape derivative
${\scriptstyle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}}}$, which describes the directions of deformation of the 2D curve (under projection) with respect to the 3D pose parameter. The gradient is composed of three terms. The first is the dot product of a typical 2D region-based gradient (i.e., (*r*_{in} − *r*_{out}).; see e.g., Chan and Vese’s model [10]) with the shape derivative: for each point on the 2D curve, the deformation direction is compared to the normal,
${\scriptstyle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}}}$, , and weights the statistical comparison term, *r*_{in} − *r*_{out}. The average over each point of the curve determines the optimal direction of variation of the pose parameter *λ _{i}* (i.e., the sign of the derivative
${\scriptstyle \frac{dE}{d{\lambda}_{i}}}$). The two last terms simply measure the variation of the statistical measures

In the remainder of this section, we first detail each of the three terms in (2.10) for the different statistical models presented above. Then we present further computations to express the gradient as a function of the known terms. Finally, we conclude the section by presenting remarks concerning the gradient and its implementation.

When the regions inside and outside the silhouette are modeled by Gaussian PDFs as in subsection 2.2.1, the second term in (2.10) may be calculated using the chain rule as

$$\begin{array}{l}{\int}_{R}\langle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}={\int}_{R}\langle 2(I(\mathbf{x})-{\mu}_{\text{in}})\frac{\partial {\mu}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}\\ =2\langle \frac{\partial {\mu}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle {\int}_{R}(I(\mathbf{x})-{\mu}_{\text{in}})d\mathrm{\Omega}\\ =2\langle \frac{\partial {\mu}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle [{\mu}_{\text{in}}\xb7{A}_{\text{in}}-{\mu}_{\text{in}}\xb7{A}_{\text{in}}]=0.\end{array}$$

(2.11)

Similarly, the third term in (2.10) can also be shown to collapse. Hence, the partial derivative of (2.10) is simply

$$\frac{dE}{d{\lambda}_{i}}={\int}_{\widehat{c}}({(I(\mathbf{y})-{\mu}_{\text{out}})}^{2}-{(I(\mathbf{y})-{\mu}_{\text{in}})}^{2})\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}.$$

(2.12)

When the regions inside and outside the silhouette are modeled by Gaussian PDFs as in subsection 2.2.2, the second term in (2.10) may be computed using the chain rule as

$$\begin{array}{l}{\int}_{R}\langle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}=\underset{=0\phantom{\rule{0.38889em}{0ex}}(\text{see}\phantom{\rule{0.16667em}{0ex}}\text{above})}{\underbrace{{\int}_{R}\langle 2\left(\frac{I(\mathbf{x})-{\mu}_{\text{in}}}{{\mathrm{\sum}}_{\text{in}}}\right)\frac{\partial {\mu}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}}}-{\int}_{R}\langle \left(\frac{{\mathrm{\sum}}_{\text{in}}-{(I(x,y)-{\mu}_{\text{in}})}^{2}}{{\mathrm{\sum}}_{\text{in}}^{2}}\right)\frac{\partial {\mathrm{\sum}}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}\\ =-\frac{1}{{\mathrm{\sum}}_{\text{in}}^{2}}\langle \frac{\partial {\mathrm{\sum}}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle {\int}_{R}({\mathrm{\sum}}_{\text{in}}-{(I(x,y)-{\mu}_{\text{in}})}^{2})d\mathrm{\Omega}\\ =-\frac{1}{{\mathrm{\sum}}_{\text{in}}^{2}}\langle \frac{\partial {\mathrm{\sum}}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle \phantom{\rule{0.16667em}{0ex}}({A}_{\text{in}}{\mathrm{\sum}}_{\text{in}}-{A}_{\text{in}}{\mathrm{\sum}}_{\text{in}})=0.\end{array}$$

(2.13)

Similarly, the third term in (2.10) can also be shown to collapse. Hence, the partial derivative of (2.10) is simply

$$\frac{dE}{d{\lambda}_{i}}={\int}_{\widehat{c}}\left(log\left(\frac{{\mathrm{\sum}}_{\text{out}}}{{\mathrm{\sum}}_{\text{in}}}\right)+\frac{{(I(\mathbf{y})-{\mu}_{\text{out}})}^{2}}{{\mathrm{\sum}}_{\text{out}}}-\frac{{(I(\mathbf{y})-{\mu}_{\text{in}})}^{2}}{{\mathrm{\sum}}_{\text{in}}}\right)\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}.$$

(2.15)

For generalized histograms as computed in (2.9) and using the chain rule, one can compute the second term of (2.10) as

$${\int}_{R}\langle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}={\int}_{R}\langle \frac{1}{{P}_{\text{in}}(I(\mathbf{x}))}\frac{\partial {P}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}.$$

(2.15)

Using the calculus of variations, one may derive that at a particular point **y** *ĉ*

$$\frac{\partial {P}_{\text{in}}}{\partial \widehat{c}}(z,\widehat{c})=\frac{\mathbf{K}(I(\mathbf{y})-z)-{P}_{\text{in}}(z,\widehat{c})}{{A}_{\text{in}}}\widehat{\mathbf{n}}(\mathbf{y}).$$

Plugging this into (2.15), one gets

$${\int}_{R}\langle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}={\int}_{R}\left({\int}_{\widehat{c}}\frac{\mathbf{K}(I(\mathbf{y})-I(\mathbf{x}))-{P}_{\text{in}}(I(\mathbf{x}))}{{P}_{\text{in}}(I(\mathbf{x}))\xb7{A}_{\text{in}}}\langle \widehat{\mathbf{n}}(\mathbf{y}),\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}(\mathbf{y})\rangle d\widehat{s}\right)d\mathrm{\Omega},$$

(2.16)

where we expressed the fact that the scalar product ., . in the left-hand side is a line integral on *ĉ* (since
${\scriptstyle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}}}$ and
${\scriptstyle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}}}$ are vector fields on *ĉ*). Swapping integrals (all integrations being done on compact sets), one can write

$${\int}_{R}\langle \frac{\partial {r}_{\text{in}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}={\int}_{\widehat{c}}{\mathcal{R}}_{\text{in}}(I(\mathbf{y}))\langle \widehat{\mathbf{n}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\widehat{s}$$

(2.17)

with

$${\mathcal{R}}_{\text{in}}(z)={\int}_{R}\frac{\mathbf{K}(z-I(\mathbf{x}))-{P}_{\text{in}}(I(\mathbf{x}))}{{A}_{\text{in}}\xb7{P}_{\text{in}}(I(\mathbf{x}))}d\mathrm{\Omega}=\frac{1}{{A}_{\text{in}}}{\int}_{R}\frac{\mathbf{K}(z-I(\mathbf{x}))}{{P}_{\text{in}}(I(\mathbf{x}))}d\mathrm{\Omega}-1.$$

(2.18)

The third term of (2.10) can be computed in a similar fashion, yielding

$${\int}_{{R}^{c}}\langle \frac{\partial {r}_{\text{out}}}{\partial \widehat{c}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\mathrm{\Omega}={\int}_{\widehat{c}}{\mathcal{R}}_{\text{out}}(I(\mathbf{y}))\langle \widehat{\mathbf{n}},\frac{\partial \widehat{c}}{\partial {\lambda}_{i}}\rangle d\widehat{s}$$

(2.19)

with

$${\mathcal{R}}_{\text{out}}(z)=1-\frac{1}{{A}_{\text{out}}}{\int}_{{R}^{c}}\frac{\mathbf{K}(z-I(\mathbf{x}))}{{P}_{\text{out}}(I(\mathbf{x}))}d\mathrm{\Omega}.$$

(2.20)

Hence, the partial derivative of (2.10) is simply

$$\frac{dE}{d{\lambda}_{i}}={\int}_{\widehat{c}}\{{r}_{\text{in}}-{r}_{\text{out}}+{\mathcal{R}}_{\text{in}}+{\mathcal{R}}_{\text{out}}\}(I(\mathbf{y}))\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}.$$

(2.21)

Note that when the Dirac function is used as the kernel **K** to compute *P*_{in} and *P*_{out} in (2.9), one can show that the terms and collapse (this is done using the sifting property of the Dirac function in (2.18) and (2.20)).

As can be seen from (2.12), (2.14), and (2.21), for each statistical model the partial derivatives ${\scriptstyle \frac{dE}{d{\lambda}_{i}}}$ are of the form

$$\frac{dE}{d{\lambda}_{i}}={\int}_{\widehat{c}}\mathcal{R}(I(\mathbf{y}))\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}$$

(2.22)

with : , a function depending on the choice of statistical model.

This line integral and in particular the term
${\scriptstyle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}}}$, are difficult to compute since the parameter *λ _{i}* acts on 3D coordinates, while

Using the arc-length *s* of *C* and the
${\scriptstyle \frac{\pi}{2}}$-rotation matrix
$J=\left[\begin{array}{cc}\hfill 0\hfill & \hfill 1\hfill \\ \hfill -1\hfill & \hfill 0\hfill \end{array}\right]$ (ensuring that the normal vector points outwards), one has

$$\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}=\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},J\frac{\partial \widehat{c}}{\partial \widehat{s}}\rangle d\widehat{s}=\langle \frac{\partial \pi (C)}{\partial {\lambda}_{i}},J\frac{\partial \pi (C)}{\partial s}\frac{ds}{d\widehat{s}}\rangle d\widehat{s}=\langle \frac{\partial \pi (C)}{\partial {\lambda}_{i}},J\frac{\partial \pi (C)}{\partial s}\rangle ds.$$

(2.23)

Letting denote the Jacobian of *π*(**X**) with respect to the spatial coordinates, we have that

$$\mathcal{J}=\frac{1}{{Z}^{2}}\left[\begin{array}{ccc}Z& 0& -X\\ 0& Z& -Y\end{array}\right].$$

From (2.23), one gets

$$\begin{array}{l}\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}=\langle \mathcal{J}\frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},J\mathcal{J}\frac{\partial \mathbf{X}}{\partial s}\rangle ds=\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},{\mathcal{J}}^{T}J\mathcal{J}\frac{\partial \mathbf{X}}{\partial s}\rangle ds\\ =\frac{1}{{Z}^{3}}\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},\left[\begin{array}{ccc}0& Z& -Y\\ -Z& 0& X\\ Y& -X& 0\end{array}\right]\frac{\partial \mathbf{X}}{\partial s}\rangle ds=\frac{1}{{Z}^{3}}\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},\frac{\partial \mathbf{X}}{\partial s}\times \mathbf{X}\rangle ds.\end{array}$$

(2.24)

In (2.24), the point **X** belongs to the occluding curve *C*. A necessary condition for a point **X** to belong to the occluding curve is that **X**, **N**= 0 (since the associated vector **X**, with origin at the center of the camera, corresponds to the projection/viewing direction and is tangent to the surface *S* at **X**; see Figure 3). The vector
$\mathbf{t}={\scriptstyle \frac{\partial \mathbf{X}}{\partial s}}$ is the tangent to the curve *C* at the point **X**. Since the vectors **t** and **X** belong to the tangent plane to *S* at **X**, one has
${\scriptstyle \frac{\partial \mathbf{X}}{\partial s}}\times \mathbf{X}=\left|\right|\mathbf{X}\left|\right|\mathbf{N}sin(\theta )$, with
$\theta =(\widehat{\mathbf{t},\mathbf{X}})$ the angle between **t** and **X**. For **X** *C*, we have that

Schema visualizing the occluding curve of a 3D object (dashed line) from the viewpoint of the camera and our notation in the 3D world.

$$\frac{\partial}{\partial s}\langle \mathbf{X},\mathbf{N}\rangle =0=\underset{=0}{\underbrace{\langle \frac{\partial \mathbf{X}}{\partial s},\mathbf{N}\rangle}}+\langle \frac{\partial \mathbf{N}}{\partial s},\mathbf{X}\rangle =\langle d\mathbf{N}(\mathbf{t}),\mathbf{X}\rangle =\text{II}(\mathbf{t},\mathbf{X}).$$

(2.25)

Since the second fundamental form II(**t, X**) = 0, the vectors **t** and **X** are conjugate (see [31]). Hence, using the Euler formula, one can show that *K* sin^{2} *θ* = *κ _{X} κ_{t}*, where

$$\langle \frac{\partial \widehat{c}}{\partial {\lambda}_{i}},\widehat{\mathbf{n}}\rangle d\widehat{s}=\frac{\left|\right|\mathbf{X}\left|\right|}{{Z}^{3}}\sqrt{\frac{{\kappa}_{X}{\kappa}_{t}}{K}}\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},\mathbf{N}\rangle ds.$$

(2.26)

Thus, the flow becomes a simple line integral on *C*:

$$\frac{dE}{d{\lambda}_{i}}={\int}_{C}\mathcal{R}(I(\pi (\mathbf{X})))\frac{\left|\right|\mathbf{X}\left|\right|}{{Z}^{3}}\sqrt{\frac{{\kappa}_{X}{\kappa}_{t}}{K}}\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},\mathbf{N}\rangle ds.$$

(2.27)

We now compute the term
${\scriptstyle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}}}$, **N** when *λ _{i}* is a translation or rotation parameter:

- For
*i*= 1, 2, 3 (i.e.,*λ*is a translation parameter) and $\mathbf{T}=\left[\begin{array}{l}{t}_{x}\hfill \\ {t}_{y}\hfill \\ {t}_{z}\hfill \end{array}\right]=\left[\begin{array}{l}{\lambda}_{1}\hfill \\ {\lambda}_{2}\hfill \\ {\lambda}_{3}\hfill \end{array}\right]$, one has_{i}$$\begin{array}{l}\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},\mathbf{N}\rangle =\langle \frac{\partial \mathbf{R}{\mathbf{X}}_{0}+\mathbf{T}}{\partial {\lambda}_{i}},\mathbf{N}\rangle =\langle \frac{\partial \mathbf{T}}{\partial {\lambda}_{i}},\mathbf{N}\rangle =\langle \left[\begin{array}{c}{\scriptstyle \frac{\partial {\lambda}_{1}}{\partial {\lambda}_{i}}}\\ {\scriptstyle \frac{\partial {\lambda}_{2}}{\partial {\lambda}_{i}}}\\ {\scriptstyle \frac{\partial {\lambda}_{3}}{\partial {\lambda}_{i}}}\end{array}\right],\mathbf{N}\rangle \\ =\langle \left[\begin{array}{c}{\delta}_{1,i}\\ {\delta}_{2,i}\\ {\delta}_{3,i}\end{array}\right],\mathbf{N}\rangle ={N}_{i},\end{array}$$(2.28)where the Kronecker symbol*δ*was used (_{i,j}*δ*= 1 if_{i,j}*i*=*j*and 0 otherwise). - For
*i*= 4, 5, 6 (i.e.,*λ*is a rotation parameter), and using the expression of the rotation matrix written in exponential coordinates,_{i}$$\mathbf{R}=exp\left(\left[\begin{array}{ccc}0& -{\lambda}_{6}& {\lambda}_{5}\\ {\lambda}_{6}& 0& -{\lambda}_{4}\\ -{\lambda}_{5}& {\lambda}_{4}& 0\end{array}\right]\right),$$one has$$\langle \frac{\partial \mathbf{X}}{\partial {\lambda}_{i}},\mathbf{N}\rangle =\langle \frac{\partial \mathbf{R}{\mathbf{X}}_{\mathbf{0}}}{\partial {\lambda}_{i}},\mathbf{N}\rangle =\langle \mathbf{R}\left[\begin{array}{ccc}0& -{\delta}_{3,i}& {\delta}_{2,j}\\ {\delta}_{3,i}& 0& -{\delta}_{1,i}\\ -{\delta}_{2,i}& {\delta}_{1,i}& 0\end{array}\right]{\mathbf{X}}_{0},\mathbf{N}\rangle .$$(2.29)

In (2.27), the computation of the gradients involves the explicit determination of the occluding curve *C*. Intuitively, this curve allows us to understand and take into account the dependency of the aspect of the object with respect to the point of view. From the definition, one can compute

$$C=\{\mathbf{X}\in {\mathcal{V}}^{+}\cap {\mathcal{V}}^{-}\phantom{\rule{0.38889em}{0ex}}\text{such}\phantom{\rule{0.16667em}{0ex}}\text{that}\phantom{\rule{0.38889em}{0ex}}\pi (\mathbf{X})\in \widehat{c}\},$$

(2.30)

where = {**X** *S*, so that (s.t.) **X, N** ≥ 0} and = {**X** *S*, s.t. **X, N** ≤ 0}.

In practice, the two sets and can be easily computed from the available data **X** and **N** and by using a small value of *ε*_{1} instead of 0 in the definitions of and to ensure the intersection comprises a sufficient number of points:^{8}
${\mathcal{V}}_{{\epsilon}_{1}}^{+}=\{\mathbf{X}\in S,\text{s}.\text{t}.\langle \mathbf{X},\mathbf{N}\rangle \ge -{\epsilon}_{1}\}$ and
${\mathcal{V}}_{{\epsilon}_{1}}^{-}=\{\mathbf{X}\in S,\text{s}.\text{t}.\langle \mathbf{X},\mathbf{N}\rangle \le {\epsilon}_{1}\}$. In the general case of nonconvex shapes, the intersection of the two sets comprises points that project inside the 2D projection *R* of the 3D model (e.g., image(b) in Figure 4) and must be filtered by ensuring that the necessary and sufficient condition to belong to *C*, *π*(**X**) *ĉ*, is fulfilled. This can be implemented by selecting only points such as ||*π*(**X**)−*ĉ*|| ≤ *ε*_{2}, with *ε*_{2} a chosen (small) parameter. One can obtain *ĉ* by using morphological operations on *R: ĉ* *R*\(*R*), with denoting the erosion operation for a chosen kernel [40]. Figure 4 presents different visualizations of an occluding curve computed in this fashion. One can also note that the set *V* (respectively, *V ^{c}* =

Understanding the occluding curve. (a) Projection of the 3D object in the 2D image plane. (b) Candidates for the occluding curve (points belonging to ∩ ) that need to be filtered with the condition “ π(**X**) ĉ.” **...**

The term
$\sqrt{{\scriptstyle \frac{{\kappa}_{X}{\kappa}_{t}}{K}}}$ can be computed at each iteration of the algorithm using the principal curvatures and principal directions for each point **X** *S*, and the Euler formula (see [31]; N.B.: the principal directions and curvatures can be precomputed). To save computational time, and noting that
$\sqrt{{\scriptstyle \frac{{\kappa}_{X}{\kappa}_{t}}{K}}}\ge 0$, we used the approximation
$\sqrt{{\scriptstyle \frac{{\kappa}_{X}{\kappa}_{t}}{K}}}\simeq 1$ in our implementation of (2.27), which still decreased the energy *E*. Note that this approximation is poorer when *θ* 0. However, the condition *θ* = 0 implies that the viewing direction **X** and the tangent to the occluding curve are identical. This occurs only for a finite number of points on the occluding curve for regular surfaces and, thus, can be expected to have little impact on the sign of the derivative
${\scriptstyle \frac{\partial E}{\partial {\lambda}_{i}}}$ (which is a sum over an infinite number of points of the curve C). By contradiction, let us suppose that two neighboring points *X*_{1} and *X*_{2} of the occluding curve (as such *X*_{1} and *X*_{2} must be visible points) verify the condition *θ* = 0 (e.g.,
${\theta}_{1}=(\widehat{\mathbf{t},{\mathbf{X}}_{\mathbf{1}}})$). We thus have
$\mathbf{t}=\overrightarrow{{X}_{1}{X}_{2}}={\mathbf{X}}_{1}={\mathbf{X}}_{2}$, which contradicts the fact that both *X*_{1} and *X*_{2} are visible (either *X*_{1} occludes *X*_{2} or *X*_{2} occludes *X*_{1}).

We now report experimental results obtained for both synthetic and real datasets. Different 3D models of rigid objects (see Figure 1) were used to perform segmentation and tracking tasks that highlight the robustness of our technique to *initialization*, *noise*, and *missing or imperfect information*. The shapes of the objects, notably the horse, the elephant, and the Van Gogh bust, cannot readily be described in terms of geometric primitives (lines, ellipses, etc.) or even algebraically, and thus they do not satisfy the working hypotheses of standard pose estimation techniques [2, 3, 5, 6].

Figure 5 shows segmentation results (and 3D coordinate recoveries) obtained using our approach for a synthetic color image. Results were obtained running (2.14) until convergence. Figure 6 shows results for diverse natural color images, obtained using (2.21). Despite initializations that are quite far from the truth (e.g., large errors in translation or angular position), accurate segmentations are obtained. Figure 7 shows tracking results obtained for a real sequence, using the flow of (2.12). The sequence is composed of 32 images of a rigid toy horse. The images were taken from discrete positions of a calibrated camera that underwent a complete rotation around the object. The camera “jumps” between successive images, creating large changes in the pose of the object that needs to be recovered (e.g., changes in the angular position of the camera can exceed 15° between frames). Tracking this sequence would be challenging for many 3D pose estimation techniques available in the literature: A number of techniques using local features such as points or edges (e.g., [1, 3]) are likely to be thrown off by the textured/noisy background (false features) and get trapped in local minima. The sequence was tracked with our technique, using a very simple scheme: For each image, initialization was performed using the pose parameters corresponding to the minimum of the energy obtained for the preceding image, and our approach was run until convergence. Despite the difficulties described above, very satisfying tracking performances were observed. This highlights the robustness of the technique to initialization since the large camera jumps are accommodated and the method is not trapped in local minima. We note that to save computational time, a down-sampled and smoothed version of the 3D model obtained in [23] was used, explaining that some finer details (e.g., with high curvature such as the ears of the horse) are not captured by the segmentation. This highlights another robustness aspect of the methodology: The 3D model does not need to be perfect to lead to satisfying results. Also, it can be noticed that region-based active contour techniques, such as [10], would lead to reasonably accurate segmentations on this particular sequence. However, these approaches would not also determine the pose of the object, which is valuable information for tracking applications.

Robustness to initialization—segmentation of a synthetic color image. Left: initialization. Middle: intermediate steps of the evolution. Right: final result.

Robustness to initialization—segmentation of natural color images. (a_{n})’s: challenging initializations (e.g., large error in translation or angular positions (green curve)). (b_{n})’s: final results with the proposed approach (green **...**

To test the robustness of our technique to noise, a sequence of 200 images was constructed by continuously transforming the 3D model of the “2D3D” logo and projecting it onto the image plane using the parameters of a simulated calibrated camera (e.g., focal length *f* = 200). The translation parameters, rotation axis, and angle were continuously varied (i.e., the total angle variation over the sequence exceeded 160°) to ensure a large variation of the aspect and position of the object throughout the sequence. From the basic sequence obtained, diverse levels of Gaussian noise were added, with standard deviation ranging from *σ _{n}* = 10% to

Robustness to noise. Visual tracking results for the sequences involving the “ 2D3D” logo (green curves). First row: tracked sequence with Gaussian noise of standard deviation σ_{n} = 10%. Second row: tracked sequence for σ **...**

Typical visual results obtained using our approach (flow of (2.12) combined with the tracking scheme alluded to above) are reproduced in Figure 8. For all noise levels, which can be rather large (e.g., in the case *σ _{n}* = 100% object and background are barely distinguishable), tracking was maintained throughout the whole sequence. Table 1 reproduces the results of the pose estimation procedure. For each image, percent

To test the robustness of our technique to missing information, we created two sequences by adding two different occlusions in the basic sequence featuring the “2D3D” model (see Figure 9). The first occlusion is a gray rectangle that can mask more than 2/3 of the “2D3D” logo. The second occlusion is the word “SHAPE” written in black letters that can mask the object at several places. Gaussian noise of standard deviation 30% was also added to both resulting sequences. Figure 9 presents the results of tracking the sequences of 200 frames with our approach. One notes that despite the occlusions (and noise), accurate segmentations are obtained: In particular, missing letters or parts are accurately localized and reconstructed. Track was maintained throughout both sequences. For the first sequence mean %-absolute error (over the 200 frames) in the transformation parameters was 1.08% for translation (**T**) and 1.57% for rotation (**R**) with standard deviation 0.45% for **T** and 0.75% for **R**. For the second sequence mean %-absolute error was 0.87% for **T** and 1.19% for **R** (standard deviation 0.34% for **T** and 0.53% for **R**).

Robustness to missing information. Tracking results (green curves) for the “ 2D3D” sequences with occlusions. First row: sequence with Rectangular occlusion. Second row: sequence with Word “SHAPE” as occlusion. Gaussian **...**

In Figure 10, we used images extracted from the horse sequence and occluded different parts of the horse body (e.g., the legs, which have valuable information about its angular position). Diverse pose parameters quite far from the truth were used as initializations (e.g., angular position could be off by more than 30°). Despite the occlusions with various pixel intensities or texture (and poor initializations), very convincing segmentations were obtained. Also, the positions of the object in the camera referential were accurately recovered. As can be noticed by comparing with Figure 7, the results in the presence of occlusion are very comparable to those without occlusion.

In Figures 11 and and12,12, we present segmentation results where the background and object are difficult or impossible to distinguish based on pixel statistics only (due to specular reflections on the object, similar colors in object and background, or occlusions). The results obtained with the (infinite dimensional) active contour flow of [11], which is the region-based segmentation technique underlying our approach, are not satisfying since the contour leaks into the background. Robust results are obtained using our approach.

The experiments of Figures 9, ,10,10, ,11,11, and and1212 would pose a major challenge to most region-based active contour techniques, even using shape priors [17, 19, 20]: Statistics alone are not sufficient to segment the images, and the aspect of the object changes drastically from one image to the other. Hence, a large catalogue of 2D shapes would need to be learned to achieve similar performances using the method in [17, 19, 20], for instance.

In this section, we present tracking results for three challenging sequences of images. The first two sequences are composed of 250 frames. In addition to a cluttered background, important changes in the size and aspect of the object occur due to camera motion. The third sequence is composed of 450 frames. In this sequence, the object is manually moved, which creates a partial occlusion as well as changes in the background and angular position of the object. Using the flow of (2.21) and our tracking scheme, the three sequences were convincingly tracked in their integrality. Figure 13 presents some of the typical results obtained.

In this work, we presented a region-based approach to the 3D pose estimation problem. This approach differs from other 3D pose estimation algorithms since it does not rely on local image features. Our method allows one to employ global image statistics to drive the pose estimation process. This confers a satisfying level of robustness to noise and initialization to our framework and bypasses the need to establish correspondences between image and object features, contrary to most 3D pose estimation approaches.

Furthermore, the approach possesses the typical qualities of a region-based active contour technique with shape prior, such as robustness to occlusion or missing information, without the need to evolve an infinite dimensional contour. Also, the prior knowledge of the shape of the object is compactly represented by a unique 3D model, instead of a dense catalogue of 2D shapes.

The main advantage of the proposed technique is that it enables one to locate the object not only in 2D images (a task typically handled by GAC approaches) but also in the world (a task typically handled by 2D-3D pose estimation algorithms). This makes the method particularly suitable for tracking applications involving a unique calibrated camera.

A possible direction for future research is to extend the proposed approach to include the knowledge of multiple 3D shapes. In particular, the method in [18] (where evolution of parameters in the shape space is performed in addition to pose parameters) could be adapted to the problem at hand. It is expected that the resulting framework will allow one to learn the possible deformations of the object and lead to robust performances for nonrigid registration and tracking tasks.

This work was supported in part by grants from NSF, AFOSR, ARO, and MURI, as well as by a grant from NIH (NAC P41 RR-13218) through Brigham and Women’s Hospital. This work is part of the National Alliance for Medical Image Computing (NAMIC), funded by the National Institutes of Health through the NIH Roadmap for Medical Research, grant U54 EB005149. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics.

^{1}This refers to the Iterative Closest Point Algorithm.

^{2}More general models of cameras (see [36, 37]) can be straightforwardly handled. We make this assumption here to simplify the presentation.

^{3}One can assume that the center of gravity of *S*_{0} coincides with the camera center and that the rotation is known.

^{4}Moreover, intrinsic behaviors due to a particular choice of similarity measure that can be observed in the GAC framework, where an infinite dimensional curve is evolved, are likely to be less prominent in our particular framework where the shape of the segmenting curve can only be possible silhouettes of the 3D object.

^{5}For grayscale images, *μ*_{O/B} are scalars. For color images, *μ*_{O/B} ^{3}.

^{6} For grayscale images, Σ_{O/B} is a scalar. For color images, Σ_{O/B} ^{3×3}. Texture can also be used; see [16].

^{7}The case where Σ_{in} = Σ_{out} is treated as above.

^{8}We refer to the condition **X**, **N**= 0, which is rarely exactly met in practice due to the sampling of the 3D surface.

1. Quan L, Lan ZD. Linear n-point camera pose determination. IEEE Trans Pattern Anal Mach Intell. 1999;21:774–780.

2. Dhome M, Richetin M, Lapreste JT. Determination of the attitude of 3D objects from a single perspective view. IEEE Trans Pattern Anal Mach Intell. 1989;11:1265–1278.

3. Marchand E, Bouthemy P, Chaumette F. A 2D-3D model-based approach to real-time visual tracking. Image Vision Comput. 2001;19:941–955.

4. Zerroug M, Nevatia R. Pose estimation of multi-part curved objects. Proceedings of the International Symposium on Computer Vision (ISCV ’95); 1995. p. 431.

5. Rosenhahn B, Perwass C, Sommer G. Pose estimation of free-form contours. Int J Comput Vision. 2005;62:267–289.

6. Drummond T, Cipolla R. Real-time tracking of multiple articulated structures in multiple views. Proceedings of the 6th European Conference on Computer Vision (ECCV); 2000. pp. 20–36.

7. Caselles V, Kimmel R, Sapiro G. Geodesic active contours. Int J Comput Vision. 1997;22:61–79.

8. Kichenassamy S, Kumar S, Olver P, Tannenbaum A, Yezzi A. Conformal curvature flow: From phase transitions to active vision. Arch Rational Mech Anal. 1996;134:275–301.

9. Chun Zhu S, Yuille AL. Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. IEEE Trans Pattern Anal Mach Intell. 1996;18:884–900.

10. Chan T, Vese L. Active contours without edges. IEEE Trans Image Process. 2001;10:266–277. [PubMed]

11. Paragios N, Deriche R. Geodesic active regions: A new paradigm to deal with frame partition problems in computer vision. J Vis Commun Image Represent. 2002;13:249–268.

12. Dambreville S, Yezzi A, Niethammer M, Tannenbaum A. A variational framework combining level-sets and thresholding. proceedings of the British Machine Vision Conference (BMVC); 2007. pp. 266–280.

13. Michailovich O, Rathi Y, Tannenbaum A. Image segmentation using active contours driven by the Bhattacharyya gradient flow. IEEE Trans Image Process. 2007;16:2787–2801. [PMC free article] [PubMed]

14. Kim J, Fisher J, Yezzi A, Cetin M, Willsky A. Nonparametric methods for image segmentation using information theory and curve evolution. Proceedings of the 2002 IEEE International Conference on Image Processing (ICIP); 2002. pp. 797–800. [PubMed]

15. Rousson M, Cremers D. Efficient kernel density estimation of shape and intensity priors for level set segmentation. Procceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI); Berlin, Heidelberg: Springer-Verlag; 2005. pp. 757–764. [PubMed]

16. Paragios N, Deriche R. Geodesic active regions for supervised texture segmentation. Proceedings of the IEEE International Conference on Computer Vision (ICCV); 1999. pp. 926–932.

17. Leventon M, Grimson E, Faugeras O. Statistical shape influence in geodesic active contours. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); 2000. pp. 316–323.

18. Tsai A, Yezzi T, Wells W, Tempany C, Tucker D, Fan A, Grimson E, Willsky A. A shape-based approach to the segmentation of medical imagery using level sets. IEEE Trans Med Imaging. 2003;22:137–153. [PubMed]

19. Cremers D, Kohlberger T, Schnoerr C. Shape statistics in kernel space for variational image segmentation. Pattern Recognition. 2003;36:1929–1943.

20. Dambreville S, Rathi Y, Tannenbaum A. Shape-based approach to robust image segmentation using kernel PCA. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2006. pp. 977–984. [PMC free article] [PubMed]

21. Osher S, Fedkiw R. Level Set Methods and Dynamic Implicit Surfaces. Springer-Verlag; New York: 2003.

22. Turk G, Levoy M. Proceedings of SIGGRAPH. ACM; New York: 1994. Zippered polygon meshes from range images; pp. 311–318.

23. Yezzi A, Soatto S. Structure from motion for scenes without features. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); 2003. pp. 171–178.

24. Dambreville S, Sandhu R, Yezzi A, Tannenbaum A. *Robust* 3*D pose estimation and efficient* 2*D region-based segmentation from a* 3*D shape prior*. Proceedings of the European Conference on Computer Vision (ECCV); 2008. pp. 169–182.

25. Faugeras OD, Keriven R. Variational principles, surface evolution PDEs, level set methods, and the stereo problem. IEEE Trans Image Process. 1998;7:336–344. [PubMed]

26. Yezzi A, Soatto S. Stereoscopic segmentation. Int J Comput Vision. 2003;53:31–43.

27. Unal G, Yezzi A, Soatto S, Slabaugh G. A variational approach to problems in calibration of multiple cameras. IEEE Trans Pattern Anal Mach Intell. 2007;29:1322–1338. [PubMed]

28. Riklin-Raviv T, Kiryati N, Sochen N. *Exploiting occluding contours for real-time* 3*D tracking: A unified approach*. Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2005. pp. 204–211.

29. Rosenhahn B, Brox T, Weickert J. Three-dimensional shape knowledge for joint image segmentation and pose tracking. Int J Comput Vision. 2007;73:243–262.

30. Schmaltz C, Rosenhahn B, Brox T, Cremers D, Weickert J, Wietzke L, Sommer G. Pattern Recognition and Image Analysis, Lecture Notes in Comput. Sci. 4478. Springer-Verlag; Berlin: 2007. Region-based pose tracking; pp. 56–63.

31. DoCarmo MP. Differential Geometry of Curves and Surfaces. Prentice–Hall; Englewood Cliffs, NJ: 1976.

32. Bray M, Kohli P, Torr P. *Posecut: Simultaneous segmentation and* 3*D pose estimation of humans using dynamic graph-cuts*. Procceedings of the European Conference on Computer Vision (ECCV); 2006. pp. 642–655.

33. Kohli P, Rihan J, Bray M, Torr P. Simultaneous segmentation and pose estimation of humans using dynamic graph cuts. Int J Comput Vision. 2008;79:285–298.

34. Lepetit V, Fua P. Monocular model-based 3D tracking of rigid objects: A survey. Found Trends Comput Graph Vis. 2005;1:1–89.

35. Li G, Tsin Y, Genc Y. *Exploiting occluding contours for real-time* 3*D tracking: A unified approach*. Proceedings of the IEEE Conference on Computer Vision (ICCV); 2007. pp. 1–8.

36. Forsyth D, Ponce J. Computer Vision: A Modern Approach. Prentice–Hall; Englewood Cliffs, NJ: 2003.

37. Hartley R, Zisserman A. Multiple View Geometry in Computer Vision. Cambridge University Press; Cambridge, UK: 2000.

38. Ma Y, Soatto S, Kosecka J, Sastry S. An Invitation to 3D Vision. Springer-Verlag; New York: 2005.

39. Rousson M, Deriche R. A variational framework for active and adaptative segmentation of vector valued images. Proceedings of the Workshop on Motion and Video Computing, IEEE Computer Society; Washington, DC. 2002.

40. Gonzalez RC, Woods RE. Digital Image Processing. 3. Prentice–Hall; Upper Saddle River, NJ: 2008.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |