|Home | About | Journals | Submit | Contact Us | Français|
A common problem that affects object alignment algorithms is when they have to deal with objects with unseen intra-class appearance variation. Several variants based on gradient-decent algorithms, such as the Lucas-Kanade (or forward-additive) and inverse-compositional algorithms, have been proposed to deal with this issue by solving for both alignment and appearance simultaneously. In , Baker and Matthews showed that without appearance variation, the inverse-compositional (IC) algorithm was theoretically and empirically equivalent to the forward-additive (FA) algorithm, whilst achieving significant improvement in computational efficiency. With appearance variation, it would be intuitive that a similar benefit of the IC algorithm would be experienced over the FA counterpart. However, to date no such comparison has been performed. In this paper we remedy this situation by performing such a comparison. In this comparison we show that the two algorithms are not equivalent due to the inclusion of the appearance variation parameters. Through a number of experiments on the MultiPIE face database, we show that we can gain greater refinement using the FA algorithm due to it being a truer solution than the IC approach.
The accurate alignment of images is essential to almost all tasks in computer vision. Lucas and Kanade  devised an algorithm to image alignment using a gradient-descent approach, which worked on the basis of iteratively minimising the squared difference between a template and input image. Since then, the Lucas-Kanade algorithm has become one of the most widely used techniques in computer vision. However, an inherent problem with the Lucas-Kanade algorithm is the computational cost associated with the calculating the gradients of the input image at each iteration. In the seminal work of Baker and Matthews , they alleviated this overhead by swapping the role of template and input image, thus allowing for significant savings in computation by pre-computing the gradients from the template. In this overview paper, Baker and Matthews theoretically and empirically showed that their implementation, which they called the inverse-compositional (IC) algorithm, was equivalent to the Lucas-Kanade (or otherwise known as the forward-additive (FA)) algorithm. This result was important as the IC algorithm provided an efficient framework in which tasks like object alignment and tracking could be performed in real-time .
A problem with gradient-descent approaches such as the FA and IC algorithms, are that they have poor generalisation properties when they have to deal with previously unseen intra-class object variation . This is due to the assumption that the template appears within the input image which is problematic. For example for aligning faces (see Figure 1), if we learn a template face from an ensemble of training face images and try to align it to an unseen face, there is a very high probability that the template face and input face will be misaligned. This is because of the discrepancy between the appearance of the template face and the input face image.
In an attempt overcome this inherent mismatch, Black and Jephson  proposed an algorithm that incorporated the appearance variation in the template through an ensemble of known appearance variation images (which were learnt from a stack of images) and unknown appearance parameters. In this work, Black and Jephson utilised the FA framework. Baker et al.  employed a similar strategy, however they employed the IC algorithm. They called this the simultaneous algorithm, as it solved for both alignment and appearance at the same time using the IC framework.
As noted previously, it was shown in  that the IC algorithm was much more computationally efficient than the FA approach, whilst achieving the same alignment performance. As such, it would be intuitive that the simultaneous IC algorithm would enjoy the same performance and efficiency improvement over the simultaneous FA algorithm. However, no such comparison has been made. In an attempt to remedy this situation, in this paper we focus on providing a comparison between these two algorithms. From this comparison, we will show that these assumptions are not valid as the appearance variation parameters are included into different terms for both approaches, thus resulting in different solutions. The appearance parameters also diminish the advantage in efficiency as no pre-computation can take place1.
In this paper, we show that the simultaneous FA algorithm provides a truer solution than the simultaneous IC algorithm as it does not include any second order terms. As a result, we show by using the simultaneous FA approach, we can get finer alignment compared to the simultaneous IC algorithm. Hence, our specific contributions stemming from this paper are:
The standard Lucas-Kanade or forward-additive (FA) alignment algorithm attempts to find the parametric warp p that minimises the sum of the squared error between the template image T and the input imageY 2, such that
where the vector is a concatenation of individual pixel 2D coordinates x within the template image T. One can map the image positions z to a set of new positions z′,
based on the warp parameters p. The warp function is canonically translation (x; p) = x + p, but (x; p) has been extended to numerous other types of warps such as affine [5,3] and piece-wise affine . In this paper we focus on the affine warp, which consists of n = 6 parameters such that p = [p0, …, p5]T. Minimising Equation (1) is a nonlinear optimisation task, so to optimise it we assume that a current estimate of p is known and then iteratively solve for increments to the parameters Δp, where p ← p + Δp. As such, we minimise the following expression
with respect to Δp. It is worth noting that from here on we shall refer to T((z; 0)) as T(z) because (z; 0) is the identity warp as (z; 0) = z. Before we can minimise Equation (3), we need to linearise Y ((z; p + Δp)). We can do this via the Taylor series expansion which yields
From this we can denote the steepest-decent images as
and the error image as
Therefore, substituting in the steepest-decent and error images back into Equation (4), the minimisation with respect to Δp gives
Setting Equation (7) to equal zero and solving for Δp yields
where the Hessian matrix
We iteratively solve for Δp and then update p ← p + Δp until convergence or a maximum number of iterations is met. This non-linear optimisation process is called Gauss-Newton optimisation , but other least-square variants such as Newton and Levenberg-Marquardt optimisations can be used .
As pointed out by Baker et al. , there is a huge computational cost associated with the FA algorithm as the Hessian matrix in Equation (9) needs to be calculated every iteration. To overcome this problem, Baker et al.  found a way of switching the roles of the template and the input image, by compositionally inverting the warp parameters. By doing this, it allows the Hessian to be calculated from the constant template image, which which allows it to be precomputed, resulting in considerable computational savings. This is called the inverse-compositional (IC) algorithm and as mentioned, it is implemented by essentially inverse-composing the warp update, i.e. (z; p) ← ((z; Δp)−1; p). Thus, instead of minimising Equation (4), we attempt to minimise
By employing a similar process to the FA algorithm, we can obtain the Δp which minimises the expression. Updating our current estimate of the warp parameters p is done by evaluating (z; p) ← ((z; Δp)−1; p). So for the IC algorithm,
where the Hessian matrix is
and the steepest-decent images3 are
and the error image is
In , the FA and IC algorithms are shown to obtain the same image alignment performance. We also show this to be the case in the experiments presented in Section 5. However, this is not the case when appearance variation is included. We investigate this in the next section.
The image alignment algorithms described in the previous section both assume that the template appears in the input image. However, as noted back in Fig. 1, this assumption is likely to cause misalignment due to the appearance variation between the template and the input image. To counteract this, Baker et al.  included this linear appearance variation within the framework of the IC algorithm. They achieved this by incorporating the appearance variation within the template by letting , where Ai, i = 1, …, m is a set of known appearance variation images and λi, i = 1, …, m is a set of unknown appearance parameters . The set of known appearance variation images Ai, i = 1, …, m, were usually computed by applying principal component analysis (PCA) to a set of training images and keeping the eigenvectors which correspond to the top 95% of the energy. Therefore, by including the appearance variation we want to minimise
simultaneously with respect to Δp and Δλ = (Δλ1, …, λm)T, and then updating the warp (z; p) ← ((z; Δp)−1; p) and the appearance parameters λ ← λ + Δλ. The appearance parameters are just additively updated as composition has no meaning for them . As the expression in Equation 15 is non-linear, we need to linearise it by performing a first order Taylor expansion on T((z; Δp)) and Ai((z; Δp)). This yields
For this expansion, we can see that the last term is a second-order term which makes solving for Δp and Δλ prohibitive (see Appendix A for details). As such this term is neglected which addressed later on in the next subsection. For the IC algorithm, neglecting this term simplifies the expression to:
From this we can determine the (n + m) × N dimensional steepest-descent images as
where q = [p, λ]T and Δq = [Δp, Δλ]T, and where q is the associated n + m dimensional column vector. The error image can then be computed as
The change in the warp parameters p and λ can therefore be found by solving for the Δq which minimises
with respect to Δq, which results in
where the Hessian matrix is
It should be noted that in the simultaneous IC algorithm, in addition to it being an approximation, the computational benefit of the inverse-composition is lost as the Hessian matrix has to be recomputed every iteration as the appearance parameters are included in the steepest-decent image calculation.
Incorporating the appearance variation within the FA algorithm is performed in the same manner as the IC algorithm, where we include the appearance variation in the template. This changes the original FA minimisation equation in (4) to
As we can see from the above equation, we only have to linearise one term and not two like for the simultaneous IC scenario. Linearising this equation gives us
which results in no second-order terms, which was the case in the simultaneous IC algorithm. Solving this equation by minimising with respect to Δq, gives us
where the Hessian matrix is
and the steepest-decent images are
The error image is described by Equation (19). In the simultaneous FA algorithm, we update both the warp parameters p and appearance parameters λ additively (i.e. p ← p + Δp and λ ← λ + Δλ). Like the other algorithms, we continue this iterative process until convergence or the maximum number of iterations is met. In the next section we show in our experiments that even though the original FA and IC algorithms yield the same alignment performance, the same can not be said for the simultaneous algorithms due to the appearance variables.
All experiments in this paper were conducted on the frontal portion of the MultiPIE face database . A total of 1128 images were employed in this subset of the database taken across 141 subjects. Face images varied substantially in expression (see Fig. 2 for examples) making detection and alignment quite challenging. This data set was separated into subject independent training and testing sets with 560 and 568 images in each set respectively. In our experiments we assume that all warped images are referenced as an 80 × 80 image. We obtained the poor initial alignment by synthetically adding affine noise to the ground-truth coordinates of the face (located at the eyes and nose). We randomly generated affine warps (z; p) in the following manner. We used the top left corner (0, 0), the top right corner (79, 0) and the center bottom pixel (39, 79). We then perturbed these points with a vector generated from white Gaussian noise. The magnitude of this perturbation was controlled to give a desired root mean squared (RMS) pixel error (PE) from the ground-truth coordinates. During learning, the initially misaligned images were defined to have between 5–10 RMS-PE. This range of perturbation was chosen as it approximately reflected the range of alignment error error seen when employing the freely available OpenCV [?] face detector (see Fig. 1 for examples of this perturbation). In all experiments we conducted in this paper our warped training and testing examples are assumed to have zero bias and unit gain.
To gauge what affect solving both for alignment and appearance has on the alignment performance on unseen faces, we compared the simultaneous FA and IC algorithms to just the normal FA and IC algorithms (no appearance variation). To do this we used an alignment convergence curve (ACC) for 5, 7.5 and 10 RMS-PE. These curves are shown in Figures 3, ,44 and and55 respectively. These curves have a threshold distance in RMS-PE on the x-axis and the probability of trials that achieved convergence (i.e., final alignment RMS-PE below the threshold) on the y-axis. A perfect alignment algorithm would receive an ACC that has 1 (or 100%) convergence for all threshold values.
Across all these results, it should first be noted that for the various initial perturbations, both the FA and IC algorithms without appearance variation achieve the same alignment performance (Figures 3(a), 4(a) and 5(a)). This backs up the results by Baker et al. in . However, the interesting results come when the appearance variation parameters are included. Specifically in Fig. 3(b) where adding the appearance variation sees both the FA and IC algorithms improve in alignment convergence over Fig. 3(a), with the simultaneous FA outperforming the IC counterpart by quite a margin (~ 8%). A major reason for this is that due to the simultaneous FA algorithm being a truer solution than the simultaneous IC algorithm (Section III(b)), this results in greater refinement.
In Fig. 4(b), it can be seen that including the appearance variation does improve the alignment convergence over the normal FA and IC algorithms (Fig. 4(a)) as well. However, both the simultaneous FA and IC algorithms achieve approximately the same performance, which is different from the results in Fig. 3(b). A possible explanation of this maybe that even though the simultaneous FA algorithm may provide greater alignment as seen in Fig. 3(b), the fact that the initial error is farther away (~ 7.5 pixels) may mean that this improvement is somewhat diminished as sometimes the algorithm may get caught in local minima. It appears that due formulation of the simultaneous IC algorithm including the appearance variation in the steepest-decent images (compare Equations (18) to (27)), more of a blurring effect occurs within this process, making it less prone of to getting caught in local minima. The trade-off with this though is that closer refinement is difficult to achieve, as seen in Fig. 3(b).
This above hypothesis is somewhat reinforced by the results in Fig. 5(b) with an initial error of 10 RMS-PE. For this experiment, only the simultaneous IC algorithm achieved a improvement in alignment convergence using the appearance variation parameters, with the simultaneous FA algorithm achieving the same as the original FA algorithm. We believe this to be caused again by the algorithm getting caught in local minima. Again, a possible explanation of the simultaneous IC algorithm being superior for when the initial error is large, is that due to it including the appearance variation in the steepest-decent images, it can circumvent this problem.
In this paper we presented a comparison between the simultaneous FA and IC algorithms, which has not been done before. In this comparison we showed that these two algorithms were not equivalent theoretically due to the inclusion of the appearance variation parameters into the template image. As the role of the template is switched between the FA and IC algorithms, this resulting in the simultaneous IC algorithm containing second-order terms, which had to be omitted. This switching of roles also affected the calculation of the steepest-decent images for both algorithms, with the simultaneous IC algorithm having both the gradients of the template and appearance images within them compared to just the template for the simultaneous FA algorithm. Due to this we found through our experiments on the MultiPIE database that the simultaneous FA algorithm performance significantly better when the initial alignment error was low (~ 5 RMS-PE) as the gradients were much finer, while the simultaneous IC gradients were much coarser. This can also account for the results when the initial error was larger, with the results of the simultaneous FA algorithm diminishing as it is more likely to get caught in local minima with the finer gradients. Conversely, with the coarser gradients in the simultaneous IC algorithm, the alignment performance improved.
Section III illustrates the derivation of the Simultaneous Inverse Composition Algorithm for image alignment including appearance variation. The derivation presented involves neglecting a second order term in order to make a solution possible. Here, we expand what was briefly mentioned in Section III by illustrating to the reader the complications which arise by retaining the second order term.
where Δλ = [Δλ1, …, Δλm]T and f (Δp, Δλ) is the second order term
The partial derivatives for all terms in Equation 28 except for the the partial derivatives of f() are shown in Section III. The partial derivatives of f() are shown in Equations 30 and 31 respectively.
As both partial derivatives are still functions of Δp and Δλ, the Δq which minimises Equation 28 requires obtaining Δq which satisfies the equation
Minimising this expression is at best, non-trivial, and is the justification for the removal of the second order term.
1It is worth noting that the simultaneous IC framework does allow for the “project-out” algorithm  which is a very computationally efficient algorithm. However, this algorithm suffers from poor alignment when the appearance variation is large due to simplifying assumptions. As we are interested in the performance, the “project-out” algorithm was not considered in this paper.
2For convenience we employ the notation that ||a−b||2 = (a−b)T(a−b), where a and b are column vectors
3It is worth noting that when we are finding the Jacobian matrix as the Jacobian is found as