|Home | About | Journals | Submit | Contact Us | Français|
3D surfaces are important geometric models for many objects of interest in image analysis and Computational Anatomy. In this paper, we describe a Bayesian inference scheme for estimating a template surface from a set of observed surface data. In order to achieve this, we use the geodesic shooting approach to construct a statistical model for the generation and the observations of random surfaces. We develop a mode approximation EM algorithm to infer the maximum a posteriori estimation of initial momentum μ, which determines the template surface. Experimental results of caudate, thalamus, and hippocampus data are presented.
3D surfaces are important geometric models for many objects of interest in image analysis and Computational Anatomy. For example, they are often used to represent the shape of 3D objects, the surface of human faces, and the boundaries of brain structures or of other human organs. Most data analysis methods in this domain are template-centered, and a proper estimation of a template plays an important role to obtain high quality results. This paper is devoted to the description of statistically supported template estimation method which is adapted to surface data sets.
Our approach will be to build a generative statistical shape model in which the template is a parameter. We will then estimate it using maximum likelihood. This model relies on the very natural setting for which an observed surface is a noisy version of a random deformation of the template. This is the most generic and most basic approach of the deformable template paradigm, even if we add a small refinement by including a prior distribution on the template, based on what we will call a hypertemplate. Even with this global scheme which is fairly simple, we will see that implementing it in the context of surfaces will constitute a significant theoretical and numerical challenge.
At the exception of the recent work of , this approach significantly differs from what has been mostly proposed in the literature, in which most of the methods compute templates as averages over specific common parametrizations of the surfaces (using, for example, the sphere as a parameter space ). Parametric representations, however, are limited by the fact that, because they are defined a priori and independently for each object, they cannot be assumed to suitably align important features in a given data set of surfaces (i.e., give similar coordinates to similar features in the surfaces). This usually results in oversmoothed template surfaces (which is the equivalent of getting blurry template images in the case of image averaging). In , a similar diffeomorphic transformation model is used, but, as we will see, our Bayesian construction will provide a well-specified template whereas  needs to rely to topologically unconstrained approximations to end up with a manageable template.
In addition to the references above, there have been several publications addressing the issue of shape averaging over a dataset, although most of them work with 3D volume data or landmark points set. In several cases, the average is based on metric properties of the space of shapes [3–7], and the template is computed as an intrinsic average, minimizing the sum of square distances to each element of the set (Fréchet mean). Such methods have been implemented in the context of diffeomorphic deformation models (which are also models of interest in the present paper) for landmark matching , for 3D average digital atlas construction , and to quantify variability in heart geometry . Other definitions of the average, adapted to situations in which the data is corrupted by noise, have been proposed [11–13], based on variational approaches (but not relying on a generative statistical model).
Our approach to build a generative statistical shape model is reminiscent of the construction developed in  for linear models of deformations, and in  for large diffeomorphic deformations in 3D volume image averaging. Adapting these ideas to surfaces will however require new algorithms and numerical developments.
In order to present our model, we need to first provide some background materials and notation, describing in particular the geodesic shooting equations that we will use to generate deformed surfaces. We will then introduce a random statistical model describing the generation and the observations of random surfaces. We then develop a Mode Approximation EM algorithm for surface averaging, to estimate the template from observations. In the optimization part, we derive and implement a new variational scheme, which is also applicable to surface matching, providing an alternative approach to the one originally proposed in [16, 17]. Finally, we present and discuss experimental results on caudate, thalamus, and hippocampus data.
We will base our random shape model on the so-called EPDiff equation, which describes the evolution of deformable structures (like images, surfaces, or landmarks) under the action of groups of diffeomorphisms. It is a geodesic equation for a Riemannian metric on diffeomorphisms, and describes a momentum conservation law in the associated Hamiltonian system. The reader interested by the theory behind this equation can refer to [18–20], but most of this background will not be needed for the present paper, in which we will only use the specific form of the equations for surface evolution. The term EPDiff comes from its determination as an Euler-Poincaré equation in the group of diffeomorphisms, as introduced in . One of its main interests here is that it provides a numerically stable, easily described, Hamiltonian evolution over diffeomorphisms, which will constitute an ideal support for our shape models.
The EPDiff equations describe the combined time evolution of a diffeomorphism, denoted ϕ(t, ·) and of what can be interpreted as a momentum, denoted p(t, ·). The initial conditions are always ϕ(0, x) = x for ϕ, and some initial value, p0, for p. This initial momentum will be a key component of the statistical model that will be built later on.
Let us start with the simplest form of the equation, which assumes that p0 is a vector-valued function over d, that is, p0 : d → d. It involves a smoothing kernel, K, defined on 3 × 3, a typical choice being
Letting 1K denote the gradient of K with respect to its first variable, the corresponding EPDiff equation is
Here, the notation a · b refers to the usual dot product between vectors in 3. If K is smooth enough, this system has solutions for arbitrary large t, and the mapping x ϕ(t, x) is a diffeomorphism at all times.
The interesting fact about these equations is that they can have singular variants that are described in a similar way and have the same existence properties. The simplest way to relate the variants to the previous equation is to replace the Lebesgue's measure in the integrals by another, possibly singular, measure. For example, taking a surface S0 in 3, we can use the volume form on S0 as a reference measure and obtain the equations (in which p0 and p(t, ·) need only to be defined on S0)
where dωS0 is the volume form on S0. Note that the first equation is defined for x 3, but it suffices to use it for x S0 to obtain an equation for the evolving surface
We can write a discrete form of the equations by replacing dy by a sum of Dirac measures (at points x1,…, xL in 3), which gives, letting al(t) = p(t, xl),
Similarly to (3), the first equation is valid for all x 3, but it suffices to solve it for x = xl, l = 1,…, L to obtain the evolution of the point set
The evolution of point sets is the most important from a practical point of view, since it is an ODE that can be easily implemented. Assuming a radial kernel K(x, y) = γ(||x−y||2) like in (1), and denoting γkl = γ(||xk−xl||2), and γkl′ = γ′(||xk−xl||2), (5) can be rewritten as
Once the initial position of the vertices, x(0) = (x1(0),…, xL(0)), and the initial momentum, a(0) = (a1(0),…, aL(0)), are provided, the evolution of the point set is uniquely determined.
If a triangulated template surface T with vertices x(T) is given, and we solve, until time t = 1, (5) initialized with x(0) = x(T) and a random initial momentum a(0) = α, the displaced vertices provide a random perturbation of the initial surface that will be denoted by Tα. This is stated in the following definition.
Let T be a triangulated surface with vertices x(T) = (x1(T),…, xL(T)). Let α (3)L be a collection of L vectors in 3. Let (x(t), a(t)) be the solution of (5) with initial condition x(0) = x(T) and a(0) = α. One defines Tα to be the triangulated surface with vertices x(Tα) = x(1) and the same topology as T.
By letting α be random, we build Tα as a random deformation of T. This will form the “ideal”, unobserved, surface, of which only a noisy version is observable (the noise process will be described in the next section).
Following , we will use a Bayesian formulation in which T is itself represented as a random deformation T0,μ : = (T0)μ, where T0 is a fixed surface that we will call the hypertemplate, and μ is a prior initial momentum shooting from T0 to T (same notation as in Definition 1). One of the main interests of using a hypertemplate is to fix the topology of T so that it belongs, by construction, to the same class of objects as T0.
So, if N surfaces are observed, we need to model the probability distribution of the prior momentum, μ (starting at T0), which specifies T = T0,μ and of N deformation momenta α(1),…, α(N) which specify the surfaces Tα(1),…, Tα(N). We now provide a statistical model for the joint probability distribution of μ, α(1),…, α(N).
We first introduce some notation. Letting K be the kernel introduced in the previous section to define the geodesic shooting equations, we let ΓT be the 3L by 3L matrix formed with the 3 by 3 blocks K(xk(T), xl(T))Id3. We define, for a triangulated surface T with L vertices x(T), and α 3L,
We define the joint distribution of μ, α(1),…, α(N) on 3L × (3L)N by
where λ is a fixed parameter regulating the weight on the hypertemplate.
There is a technical difficulty here, which is that one must make sure that this probability can be normalized (Z exists), which requires that the exponential is integrable. That this is true is not straightforward, and we have not been able to find a proof that works with any choice of the kernel K. One way to deal with this is to introduce a constant Aμ (which can be chosen arbitrarily large so that it does not interfere with the algorithms that will follow), and add to the model the constraint that ||μ||T0 is smaller than Aμ. Under such an assumption, one obtains (after integrating out the α's)
This is finite, since, for any given μ, the transformation x(T0) → x(T0,μ) is the restriction of a diffeomorphism to the vertices of T0 (as seen from (5)). This implies that ΓT0,μ is nonsingular, and its determinant is bounded away from 0 when μ is restricted to a compact space.
In fact, the choice Aμ = ∞ can be proved to be acceptable for a large class of kernels. Those are kernels for which the smallest eigenvalue of ΓT decreases at a speed which is at most polynomial in the minimal distance, hT, between the vertices. A list of kernels satisfying this property can be found in . For such kernels, we find that (L being fixed) log det ΓT = O(loghT). Just sketching the argument here, one can prove, using elementary properties of dynamical systems, that hT = O(exp(−C||μ||T0,μ) for some constant C, so that the log determinant in (9) is linear in ||μ||T0,μ and Z is well defined, even with AT0,μ = ∞. For very smooth kernels, including the Gaussian, bounds on the smallest eigenvalue of ΓT are much worse (with a decay which is exponential in (−1/hT2)), and the previous argument does not work. Since the bounds in  hold uniformly with respect to the number of points, a polynomial bound may still be valid for a fixed L, although we were unable to discover it.
Notice that, conditionally to the template, the momenta α are independent and follow a Gaussian distribution with inverse covariance matrix given by ΓT. An example of simulated random deformations obtained using such a model is provided in Figure 1.
The second part of our generative model is to describe the observation process, which takes an ideal surface Tα generated according to the model above, and returns the noisy observable.
Modeling such a noise process is a tricky issue. Obvious choices (like adding noise to the vertices of Tα) do not work because one cannot assume that the observed surfaces are discretized consistently with the template. In this paper, we will work around this issue by assimilating the observation of a surface that of a singular Gaussian process.
For this, we consider that surfaces in 3 are not observable directly, but only via their action on test functions, that we will call sensors. We define a sensor to be a smooth vector field w over 3 (typically with small support). Given an oriented surface S, define
where NS is the normal to S. The real number, (S, w) is the measurement of S through the sensor w.
Now, modeling noisy surfaces will result in assuming that, given any w, the measurement (S, w) is a random variable. We will assume that it is Gaussian, and more generally, that, given m sensors, w1,…, wm, the random vector ((S, wj), j = 1,…, m) is Gaussian.
S, via its action on sensors, is therefore modeled as a Gaussian random field. Given an ideal surface Tα, we will assume that its mean is given by
and the process is thereafter uniquely characterized by its covariance operator
We will assume that this covariance is associated to a symmetric operator Lobs so that
(The apparently redundant parameter σ2, which could have been included in Lobs, appears here because it can be easily estimated by the algorithm, with the operator Lobs remaining constant.)
To finalize our model, it remains to describe how S is discretized, that is, to make explicit a finite family of sensors through which S is measured. Let (zj, j J) form a regular grid of points in Ω. Let γs be a radial basis function (a Gaussian, e.g.,) and define, for j J and d = 1,2, 3
where ed is the dth vector of the standard basis of 3 (this therefore specifies 3|J| sensors). The resulting observed variables are
where NS(d) = NS · ed is the dth coordinate of NS. These variables are, by assumption, jointly Gaussian, with means mj,d = (Tα, wj,d) and covariance matrix
Assuming that Lobs is translation invariant, the resulting expression is a function of zi − zi′ that we will denote
Let Robs = (rij(obs)) be the inverse matrix of the one with coefficients (γobs(zj − zj′), j, j′ J). The log likelihood of the process will include error terms taking the form
Replacing yj,d by its expression in (16), we have
Let us analyze the first term. We have
with the notation
Treating the other terms similarly, we can rewrite the error term in the form
and we will abbreviate this (introducing a notation for the right-hand side) as obs = (1/σ2)||S−Tα||obs2. Thus, we can write
We have the following important proposition.
Assume that γs is taken equal to the Green function of Lobs. Then, when the grid J becomes finer and Kobs is given by (22), one has
See the appendix for a proof of this proposition (which requires some background on the theory of Hilbert spaces with a reproducing kernel) and possible extensions. For practical purposes, we will use γobs instead of Kobs in ||·||obs, therefore assuming that the sensors are chosen according to the proposition. It is interesting to notice that the resulting norm in this case is precisely the norm that has been introduced in  to compare surfaces, when they are considered as elements of a reproducing kernel Hilbert space of currents. We refer to  for details on the mathematical construction.
In the rest of the paper, we develop a parametric procedure to estimate the template T from the observation of i.i.d. surfaces S(1), S(2),…, S(N) generated as described above. This includes in particular N hidden deformation momenta α(1), α(2),…, α(N), such that the complete distribution of observed and unobserved variables is
where we have written for short T = T0,μ.
We now discuss the estimation of the parameter μ, and of the associated template T0,μ, which is the main purpose of this paper. This will be implemented with a mode approximation of the EM algorithm, as described in the next section.
Let Θ = (α(1), α(2),…, α(N)) be the hidden part of the process, representing the collection of initial momenta, and let S = (S(1), S(2),…, S(N)) be the collection of observed surfaces. The complete distribution for the process, including the prior is given by (26).
The EM algorithm is an iterative method that updates a current estimation of μ using the following two E- and M-steps.
E-step: determine the conditional expectation .
M-step: maximize this expression with respect to and replace the current estimation of μ by the obtained maximizer.
The conditional expectation can be expanded as
Considering the highly nonlinear relation between α(n) and the deformed surface Tα(n), an explicit computation in (27) is impossible. We will therefore rely on the classic mode approximation in the EM which replaces the conditional distribution by a Dirac measure taken at the conditional mode, yielding
This results in a maximum a posteriori procedure that maximizes alternatively in μ and in the α(n)'s. Like in , we will refer to it as a mode approximation to the EM (MAEM) rather than a MAP algorithm, in order to strengthen the fact that it is an approximation of the maximum a posteriori procedure relying on the likelihood of the observed data only. As illustrated in , the MAEM can be biased (leading to inexact estimation of the template, even with a very large number of samples), especially when the noise is important, but it is obviously more feasible than the exact EM. Notice that the MAEM method is a special form of EM algorithm, and as such optimizes a lower bound of the log likelihood of the observed data.
We summarize the two steps of the ith iteration of the MAEM in our case. Suppose μ and the α(n)'s are the current variables to be updated. Then the next iteration is as follows.
MAE step: with μ (and therefore T) fixed, find, for n = 1,…, N, to minimize
and replace α(n) by .
M step: with α(n) fixed, update μ with the minimizer of
We now discuss how each of these steps can be implemented.
Our goal in this section is to optimize (30). We work with fixed n and drop it from the notation to simplify the expressions. The objective function is
This problem is equivalent to the surface matching algorithm considered in , with a slightly different formulation since  optimize an energy with respect to a time-dependent momentum instead of just the initial momentum (i.e., they solved simultaneously the geodesic estimation and the matching problems). These two formulations are equivalent when using continuous time (they produce the same minima), but they yield different results when discretized. In our setting, formulating the problem as in (32) is natural, and focuses on the modeled random variable, α.
We need to compute the variation of E with respect to α. This computation will be useful for the M-step also. We first discuss the discretization of the error term, which follows . Let S be a triangulated surfaces, with vertices x(S) = (x1(S),…, xL(S)) and faces F(S) = (f1(S),…, fM(S)). Each face is represented by an ordered triple of vertices: f = (x1f, x2f, x3f), and we define the face centers and area-weighted normals by
Then a discrete approximation of ||S−S′||obs2 is
where S is considered as fixed and U is therefore considered as a function of the vertices, x(S′), of S′.
With this notation, we can write
We want to compute the gradient of E and for this, apply the chain rule to the transformations α → Tα → U(x(Tα)), yielding
where A* is the transpose matrix of A.
The gradient of U with respect to x(S′) has been computed in , and is given as follows. We denote by Fl(S) the set of faces (triangles) that contain a vertex xl in a triangulated surface S. For f Fl(S), we let el(f) denote the edge opposed to xl in f, positively oriented in f. With S′ = Tα, we have
where ag = 1 if g F(S′) and a(g) = −1 if g F(S).
Now let us derive the variation of x(Tα) with respect to the initial momentum, α. We know that x(Tα) = x(1), where x and a evolve according to the system (7)
with x(0) = x(T) and a(0) = α (and γkl, γkl′ are short for γ(||xk−xl||2) and γ′(||xk−xl||2)). Now an infinitesimal variation α → α + δα in the initial condition induces infinitesimal variations a + δa and x + δx over time, and the pair (δx, δa) obeys the following differential system, that can be obtained from a formal differentiation of (7):
with γkl′′ = γ′′(||xk−xl||2).
One can rewrite it in the matrix form:
Solving this system with initial condition δx(0) = 0 and δa(0) = δα provides what we have denoted
One does not need to compute all the coefficients of the matrix dx(Tα)/dα using this equation in order to apply the transpose in (36). This is fortunate because this would constitute a computationally demanding effort given that this matrix is 3L by 3L with L large. The right hand side of (36) can be in fact computed directly using a single dynamical system, given by
This is a simple consequence of the theory of linear differential systems (a proof is provided in the Appendix for completeness). Note that the matrix J(t) depends on the solution of (7) computed with initial conditions x(0) = x(T) and a(0) = α. To emphasize this dependency, we will denote it J(t) = J(T,α)(t) in the following.
Given this, we see that a variation α → α + δα induces a first-order variation δE of the energy given by
where the T-dot product is δα , αT = δα · (ΓTα) and ΓT is the matrix formed with 3 by 3 blocs γklId3.
We choose to operate the gradient descent with respect to this dot product and therefore choose a variation proportional to δα = −(2α + ΓT−1ηα(0)). So, the algorithm to compute an optimal α is the following.
This algorithm has to be applied N times (for all αk, k = 1,…, N) in the MAE step.
The matrix ΓT being typically very badly conditioned, we numerically compute ΓT−1ηα after adding a small positive number to the diagonal of ΓT. The inversion itself is computed using conjugate gradient.
There are many similarities between the M-step and the E-step variational problems, so that we will be able to only sketch the detail of the computation here. We need to minimize
Let us consider the variation of each term in the sum (fixing n, that we temporarily drop from the notation). Since
we can write
The function U being defined as before, we see that the derivative of the second term is given by applying the chain rule again, this time in the form
Like in the previous section, the transpose of the differential applied to the gradient of U can be computed by solving a dynamical system backward in time. In fact, it is the same system as with the variation in α, namely,
still initialized with ηx(1) = x(Tα)U and ηα(1) = 0, but the relevant result now is ηx(0). The gradient of is then
Once this is computed, the next step is to compute (reintroducing n in the notation)
This follows a similar procedure, using (44), with T0 instead of T and μ instead of α. This requires solving
initialized with and ημ(1) = 0. The variation of associated to an infinitesimal variation of μ is then
We summarize the M step in the following algorithm.
We finally summarize the surface template estimation algorithm:
Having the hypertemplate T0 and observed surfaces S(1),…, S(N), the goal is to estimate the template T. Let T, μ, α(1),…, α(N) denote the current estimation with initial guess T = T0, μ = 0, α(n) = 0. Then, in the next iteration, update with the following steps:
We applied the algorithm to surface data of human brain's caudate, thalamus, and hippocampus. All data are courtesy of Center for Imaging Science at Johns Hopkins University. Each surface has around 5–10 thousand triangle cells. We randomly chose one as the hypertemplate and the others as observed surfaces. In these experiments, we set λ = 1.0 and σ2 = 1.0.
We also applied the algorithm to 101 hippocampus surface data in the BIRN Project (Biomedical Informatics Research Network). In Figure 4, Panels (a)–(h) are 8 examples of the 101 observations. Panel (i) is the hypertemplate and Panel (j) is the estimated template.
The result is visually satisfying in the sense that the estimated template is found to agree with a qualitative representation of the population. For example, in the caudate experiment, the estimated template has an obviously narrower upper tip than the hypertemplate. This captures the population characteristic since most observed data have narrower upper tips. Notice that the obtained template does not look smoother than the rest of the population, as would typically yield template estimation methods that average over fixed parametrizations.
Figure 5 shows how the energy in (31) changes with the iteration in the caudate experiment. This confirms the effectiveness and convergence of the algorithm. One can see the energy drops quickly in the first twenty iterations, then gradually slows down. After the 35th iteration, the energy changes little and the estimated template remains stable.
In our model, the hypertemplate can be provided by an atlas obtained from other studies, although we here simply choose one of the surfaces in the population. Actually, as Figure 6 shows, different choices for the hypertemplate yield very similar results.
In this paper, we have presented a Bayesian approach for surface template estimation. We have built, for this purpose, a generative statistical model for surfaces: the construction first applies a random initial momentum to the template surface, then assumes an observation process using test functions and noise. The template is assumed to be generated as a random deformation of a hypertemplate, completing the Bayesian model. We used a mode approximation EM scheme to estimate the surface template, and introduced for this purpose a novel surface matching algorithm optimizing with respect to the initial momentum. The procedure has been tested with caudate, thalamus, and hippocampus surface data, showing its effectiveness and convergence, and also experimentally proved to be robust to variations in the choice of the hypertemplate.
This paper is partially supported by R01-MH056584, R01-EB000975, R01-MH60883, P50-MH071616, P01-AG003991, P01-AG0262761, P41-RR015241, R24-RR021382, and NSF DMS-0456253.
Let us assume that Lobsγs = δ0, that is, γs is the Green Kernel of the operator Lobs. This assumption implies, in particular, that γs = γobs. Given a smooth function f, denote
Define the vector space
Introduce the RKHS V with scalar product given by
Then fJ can be identified to the orthogonal projection of f on Vj, because r(obs) is the inverse of the Gram matrix (matrix of dot products) of (γobs(·−zj), j J), which are the generators of VJ, and
Now, let Jm be a sequence of progressively finer grids with resolution ϵm tending to 0 when m tends to infinity. We prove that ||fJm−f||V → 0, for which it suffices to prove that VJm is dense in V. This is equivalent to the fact that no nonzero g in V can be orthogonal to all the VJm's. But g orthogonal to VJm is only possible when, for all j Jm,
Since the zj′s form an arbitrarily fine grid in V and functions in V are continuous, this implies g = 0.
So, fJ → f in V, which implies, for example, pointwise convergence as soon as V is embedded in the set of continuous functions (that is, if γobs is continuous). This directly implies Proposition 1 in the case γs = γobs, by taking f(x) = γobs(x − y) for a given y.
Extensions of this result is when γs is obtained by applying some linear operator, say A, to γobs. One then has
and passing to the limit in the sum (for which one needs γs V and A continuous on V), one gets .
We here justify the procedure described in Section 4, and prove the following fact: consider the solution, z(t) p, of the ordinary differential equation
where J is a time-dependent operator (a p by p matrix). Then, for any u p, we have
where η is the solution of the differential equation
with η(1) = u.
Let us prove this result. First remark that since z is the solution of a linear equation, it depends linearly on the initial condition, z(0). More precisely, let M(s, t) be the p by p matrix such that
and M(s, s) = Idp. Then z(t) = M(0, t)z(0), and, obviously, η(0) = M(0,1)*u. Using the identity M(t, s)M(s, t) = Idp, we obtain
so that sM(s, t) = −M(s, t)J(s). Computing the transposed equation yields and taking t = 1
Thus, if η(s) = M(s,1)*u, we have η(1) = u and sη = −J(s)*η as announced.