Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3002233

Formats

Article sections

- Abstract
- 1. Introduction
- 2. MM algorithm
- 3. Multivariate t distribution
- 4. Finite mixture models
- 5. Factor analysis
- 6. Multidimensional scaling
- 7. A one-way random effects model
- 8. Discussion
- References

Authors

Related links

Scand Stat Theory Appl. Author manuscript; available in PMC 2010 December 15.

Published in final edited form as:

Scand Stat Theory Appl. 2010 December; 37(4): 612–631.

doi: 10.1111/j.1467-9469.2009.00681.xPMCID: PMC3002233

NIHMSID: NIHMS205487

HUA ZHOU, Department of Human Genetics, University of California, Los Angeles;

Hua Zhou, Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA 90095-1766, USA. Email: ude.alcu@uohzauh

See other articles in PMC that cite the published article.

Maximum likelihood estimation in many classical statistical problems is beset by multimodality. This article explores several variations of deterministic annealing that tend to avoid inferior modes and find the dominant mode. In Bayesian settings, annealing can be tailored to find the dominant mode of the log posterior. Our annealing algorithms involve essentially trivial changes to existing optimization algorithms built on block relaxation or the EM or MM principle. Our examples include estimation with the multivariate *t* distribution, Gaussian mixture models, latent class analysis, factor analysis, multidimensional scaling and a one-way random effects model. In the numerical examples explored, the proposed annealing strategies significantly improve the chances for locating the global maximum.

Multimodality is one of the curses of statistics. The conventional remedies rely on the choice of the initial point in iterative optimization. It is a good idea to choose the initial point to be a suboptimal estimate such as a method of moments estimate. Unfortunately, this tactic does not always work, and statisticians turn in desperation to multiple random starting points. The inevitable result is an inflation of computing times with no guarantee of success.

In combinatorial optimization, simulated annealing often works wonders (Metropolis *et al.*, 1953; Kirkpatrick *et al.*, 1983; Press *et al.*, 1992). This fruitful idea from statistical physics also applies in continuous optimization, but it still entails an enormous number of evaluations of the objective function. In a little noticed paper, Ueda & Nakano (1998) adapt simulated annealing to deterministic estimation in admixture models. In modifying the standard expectation maximization (EM) algorithm for admixture estimation, they retain the annealing part of simulated annealing and drop the simulation part. Annealing operates by flattening the likelihood surface and gradually warping the substitute surface towards the original surface. By the time all of the modes reappear, the iterates have entered the basin of attraction of the dominant mode.

In this article, we explore several variations of deterministic annealing. All of these involve surface flattening and warping. A tuning parameter controls the process of warping a relatively flat surface with a single or handful of modes into the ultimate bumpy surface of the objective function. Our constructions are admittedly *ad hoc* and tailored to specific problems. Consequently, readers expecting a coherent theory are apt to be disappointed. In our defense, these problems have resisted solution for a long time, and it is unrealistic to craft an overarching theory until we better understand the nature of the enemy. Readers with a mathematical bent will immediately recognize our debt to homotopy theory in topology and central path following in convex optimization.

We specifically advocate four different tactics: (i) degrees of freedom inflation, (ii) noise addition, (iii) admixture annealing, and (iv) dimension crunching. Each of these techniques compares favourably with multiple random starts in a concrete example. In some cases, the intermediate functions constructed in annealing no longer bear a statistical interpretation. This flexibility should be viewed as a positive rather than a negative. We focus on EM algorithms (Dempster *et al.*, 1977; McLachlan & Krishnan, 2008), the closely related MM algorithms (de Leeuw, 1994; Heiser, 1995; Becker *et al.*, 1997; Lange *et al.*, 2000; Hunter & Lange, 2004; Wu & Lange, 2008) and block relaxation algorithms (de Leeuw, 1994) because they are easy to program and consistently drive the objective function uphill or downhill. In our examples, the standard algorithms require only minor tweaking to accommodate annealing. The MM algorithm has some advantages over the EM algorithm in the annealing context as MM algorithms do not require surrogate functions to be likelihoods or log-likelihoods.

Our examples rely on a positive tuning parameter *ν* attaining an ultimate value *ν*^{∞} defining the objective function. The initial *ν*^{0} starts either very high or low. When *ν*^{∞} is finite, after every *s* iterations we replace the current value *ν ^{n}* of

Our examples include: (i) estimation with the *t* distribution, (ii) Gaussian mixture models, (iii) latent class analysis, (iv) factor analysis, (v) multidimensional scaling, and (vi) a one-way random effects model. Example (vi) demonstrates the relevance of annealing to maximum *a posteriori* estimation. These well-known problem areas are all plagued by the curse of multimodality. Eliminating inferior modes is therefore of great interest. Our first vignette on the *t* distribution is designed to help the reader visualize the warping effect of annealing. Before turning to specific examples, we briefly review the MM algorithm.

The MM algorithm (de Leeuw, 1994; Heiser, 1995; Becker *et al.*, 1997; Lange *et al.*, 2000; Hunter & Lange, 2004; Wu & Lange, 2009), like the EM algorithm, is a principle for creating algorithms rather than a single algorithm. There are two versions of the MM principle. In maximization the acronym MM stands for iterative minorization-maximization; in minimization it stands for majorization-minimization. Here we deal only with the maximization version. Let *f*(*θ*) be the objective function we seek to maximize. An MM algorithm involves minorizing *f*(*θ*) by a surrogate function *g*(*θ*|*θ ^{n}*) anchored at the current iterate

$$f\left({\theta}^{n}\right)=g({\theta}^{n}\mid {\theta}^{n})$$

(1)

$$f\left(\theta \right)\ge g(\theta \mid {\theta}^{n}),\phantom{\rule{thickmathspace}{0ex}}\theta \ne {\theta}^{n}.$$

(2)

In other words, the surface $\theta \mapsto g(\theta \mid {\theta}^{n})$ lies below the surface $\theta \mapsto f\left(\theta \right)$ and is tangent to it at the point *θ* = *θ ^{n}*. Construction of the minorizing function

In the second M of the algorithm, we maximize the surrogate *g*(*θ*|*θ ^{n}*) rather than

$$f\left({\theta}^{n+1}\right)\ge g({\theta}^{n+1}\mid {\theta}^{n})\ge g({\theta}^{n}\mid {\theta}^{n})=f\left({\theta}^{n}\right),$$

reflects definitions (1) and (2) and the choice of *θ*^{n+1}. The ascent property is the source of the MM algorithm's numerical stability. Strictly speaking, it depends only on increasing *g*(*θ*|*θ ^{n}*), not on maximizing

The multivariate *t* distribution is often employed as a robust substitute for the normal distribution in data fitting (Lange *et al.*, 1989). For location vector $\mu \in {\mathbb{R}}^{p}$, a positive definite scale matrix $\Omega \in {\mathbb{R}}^{p\times p}$ and degrees of freedom *α*>0, the multivariate *t* distribution has density

$$f\left(x\right)=\frac{\Gamma \left({\scriptstyle \frac{\alpha +p}{2}}\right)}{\Gamma \left({\scriptstyle \frac{\alpha}{2}}\right){\left(\alpha \pi \right)}^{p\u22152}{\mid \Omega \mid}^{1\u22152}{[1+{\scriptstyle \frac{1}{\alpha}}{(x-\mu )}^{t}{\Omega}^{-1}(x-\mu )]}^{(\alpha +p)\u22152}},\phantom{\rule{thickmathspace}{0ex}}x\in {\mathbb{R}}^{p}.$$

Maximum likelihood estimation of the parameters *μ* and Ω for fixed *α* is challenging because the likelihood function can exhibit multiple modes. Values of *α* below 1 are particularly troublesome. The standard EM algorithm (Lange *et al.*, 1989) updates are

$$\begin{array}{cc}\hfill {\mu}^{n+1}& =\frac{1}{{v}^{n}}\sum _{i=1}^{m}{w}_{i}^{n}{x}_{i},\hfill \\ \hfill {\Omega}^{n+1}& =\frac{1}{m}\sum _{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t},\hfill \end{array}$$

where the superscripts *n* and *n*+1 indicate iteration number, *m* is the sample size, and ${v}^{n}={\sum}_{i=1}^{m}{w}_{i}^{n}$ is the sum of the case weights

$${w}_{i}^{n}=\frac{\alpha +p}{\alpha +{d}_{i}^{n}},\phantom{\rule{1em}{0ex}}{d}_{i}^{n}={({x}_{i}-{\mu}^{n})}^{t}{\left({\Omega}^{n}\right)}^{-1}({x}_{i}-{\mu}^{n}).$$

An alternative faster algorithm (Kent *et al.*, 1994; Meng & van Dyk, 1997) updates Ω by

$${\Omega}^{n+1}=\frac{1}{{v}^{n}}\sum _{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t}.$$

We will use this faster EM algorithm in the subsequent numerical examples. When estimating *μ* with both Ω and *α* fixed, one simply omits the Ω updates.

There are two strategies for flattening the likelihood surface. The first involves degree of freedom inflation. As *α* tends to ∞, the score function for the location parameter *μ* tends to the normal score with a single root equal to the sample mean. EM annealing substitutes *ν* for *α*, starts with a large value of *ν*, and works its way down to *ν*^{∞} = *α*. As the iterations proceed, the EM iterates migrate away from the sample mean towards the dominant mode.

The second annealing strategy adds noise. Given *m* independent observations *x*_{1},..., *x _{m}*, the log-likelihood is

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L(\mu ,\Omega )=-\frac{m}{2}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}\mid \Omega \mid -\frac{\alpha +p}{2}\sum _{i=1}^{m}\mathrm{ln}[\alpha +{({x}_{i}-\mu )}^{t}{\Omega}^{-1}({x}_{i}-\mu )]+c,$$

where *c* is an irrelevant constant. In annealing, we maximize the modified log-likelihood

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L(\mu ,\Omega ,v)=-v\frac{m}{2}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}\mid \Omega \mid -\frac{\alpha +p}{2}\sum _{i=1}^{m}\mathrm{ln}[\alpha +{({x}_{i}-\mu )}^{t}{\Omega}^{-1}({x}_{i}-\mu )]+c.$$

Taking the positive tuning constant *ν*<1 flattens the likelihood surface, while taking *ν*>1 sharpens it. Under annealing, the standard EM algorithm updates Ω by

$${\Omega}^{n+1}=\frac{1}{vm}\sum _{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t}.$$

(3)

Following the derivation of the alternative EM algorithm in Wu & Lange (2008), one can demonstrate with effort that the alternative EM algorithm updates Ω by

$${\Omega}^{n+1}=\frac{1}{{v}^{\ast}{v}^{n}}\sum _{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t},$$

where *ν** = *να*/[*α*+(1 − *ν*)*p*]. Observe that *ν** is an increasing function of *ν* with limit 1 as *ν* tends to 1. Annealing is achieved by gradually increasing *ν* from a small positive value to the target value 1.

Another way for adding noise is to work on the modified log-likelihood function

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L(\mu ,\Omega \u2215v)=-\frac{m}{2}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}\mid \Omega \mid -\frac{\alpha +p}{2}\sum _{i=1}^{m}\mathrm{ln}[\alpha +v{({x}_{i}-\mu )}^{t}{\Omega}^{-1}({x}_{i}-\mu )]+c.$$

The MM principle now dictates making the trivial change

$${w}_{i}^{n}=\frac{\alpha +p}{\alpha +v{d}_{i}^{n}}$$

in the case weights in updating *μ* and Ω. Again annealing is achieved by gradually increasing *ν* from a small positive value to the target value 1.

For ease of comparison, pseudocode for the original EM and the three annealing EM algorithms (aEM1-3) are given as algorithms 1–4 in the Appendix. The Matlab code for this and all following examples is available from the authors.

We now illustrate annealing by a classical textbook example for the univariate *t* (Arslan *et al.*, 1993; McLachlan & Krishnan, 2008). The data consist of the four observations −20, 1, 2 and 3. The scale Ω = 1 and degrees of freedom *α* = 0.05 are fixed. The bottom right panel of Fig. 1 shows that the log-likelihood function has modes at −19.9932, 1.0862, 1.9975 and 2.9056 for the given scale and degrees of freedom. The global mode $\widehat{\mu}=1.9975$ reflects the successful downweighting of the outlier −20 by the *t* model. The remaining panels of Fig. 1 illustrate the warping of the log-likelihood surface for different values of *ν*.

Table 1 records the progress of the fast EM algorithm and the aEM1 algorithm starting from the bad guess *μ*^{0} = *−*25. For the aEM1 algorithm, we take *r* = 0.5 and *s* = 1 and start the tuning parameter at *ν*^{0} = 100. The EM algorithm gets quickly sucked into the inferior mode −19.9932, whereas the aEM1 algorithm rapidly converges to the global mode. Doubtful readers may object that the poor performance of EM is an artefact of a bad initial value, but starting from the sample mean −3.5 leads to the inferior mode 1.0862. Starting from the sample median 1.5 does lead to the dominant mode in this example. Table 1 does not cover the performance of algorithms aEM2 and aEM3. Algorithm aEM2 collapses to the ordinary EM algorithm in fixing Ω, and algorithms aEM3 and aEM1 perform almost identically.

Multimodality is not limited to extremely small values of *α*. For the three data points {0, 5, 9}, the likelihood of the Cauchy distribution (*α* = 1 and Ω = 1) has three modes (example 1.9 of Robert & Casella, 2004). The aEM algorithms easily leap across either inferior mode and converge to the middle global mode. For the sake of brevity we omit details.

In analysing the random data from a bivariate *t* distribution displayed in Table 2, we assume *α* = 0.1 is known and both *μ* and Ω are unknown. The histograms of the final log-likelihoods found by the fast EM algorithm and the aEM algorithms from the same 100 random starting points are shown in Fig. 2. A total of 45 runs of the EM algorithm converge to the inferior mode at −130.28, whereas all runs of the aEM algorithms converge to the global mode −129.8. Despite this encouraging performance, annealing is not foolproof. There exist more extreme examples where the aEM algorithms consistently converge to an inferior mode close to the centre of the sample points. This problem merits further research.

In admixture models, the likelihood of *m* independent observations *x*_{1},..., *x _{m}* takes the form

$$L(\pi ,\theta )=\prod _{i=1}^{m}\sum _{j=1}^{d}{\pi}_{j}{f}_{j}({x}_{i}\mid {\theta}_{j}),$$

where *π* = (*π*_{1},..., *π _{d}*) is the vector of admixture parameters and

In the admixture model we can flatten the likelihood surface in two different ways. These give rise to the objective functions

$$\begin{array}{cc}\hfill {L}_{1}(\pi ,\theta ,v)& =\prod _{i=1}^{m}\sum _{j=1}^{d}{\left[{\pi}_{j}{f}_{j}({x}_{i}\mid {\theta}_{j})\right]}^{v},\hfill \\ \hfill {L}_{2}(\pi ,\theta ,v)& =\prod _{i=1}^{m}\sum _{j=1}^{d}{\pi}_{j}{\left[{f}_{j}({x}_{i}\mid {\theta}_{j})\right]}^{v},\hfill \end{array}$$

appropriate for annealing. Here *ν* varies over the interval [0, 1]. The choice *ν* = 0 produces a completely flat surface, and the choice *ν* = 1 recovers the original surface. This suggests that we gradually increase the tuning parameter *ν* from a small value such as 0.05 all the way to 1. At each level of *ν* we execute *s* steps of the EM algorithm or its MM substitute.

In the admixture setting, the MM principle exploits the concavity of the function ln(*t*). Applying Jensen's inequality to ln *L*_{1}(*π*, *θ*, *ν*) yields the minorization

$$\sum _{i=1}^{m}\mathrm{ln}\left\{\sum _{j=1}^{d}{\left[{\pi}_{j}{f}_{j}({x}_{i}\mid {\theta}_{j})\right]}^{v}\right\}\ge \sum _{i=1}^{m}\sum _{j=1}^{d}{w}_{ij}^{n}\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\left\{\frac{{\left[{\pi}_{j}{f}_{j}({x}_{i}\mid {\theta}_{j})\right]}^{v}}{{w}_{ij}^{n}}\right\}=v\sum _{i=1}^{m}\sum _{j=1}^{d}{w}_{ij}^{n}[\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{\pi}_{j}+\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{f}_{j}({x}_{i}\mid {\theta}_{j})]+{c}^{n},$$

where the weights are

$${w}_{ij}^{n}=\frac{{\left[{\pi}_{j}^{n}{f}_{j}({x}_{i}\mid {\theta}_{j}^{n})\right]}^{v}}{\sum _{k=1}^{d}{\left[{\pi}_{k}^{n}{f}_{k}({x}_{i}\mid {\theta}_{k}^{n})\right]}^{v}}$$

(4)

and *c ^{n}* is an irrelevant constant. Minorization separates the

$${\pi}_{j}^{n+1}=\frac{\sum _{i=1}^{m}{w}_{ij}^{n}}{\sum _{i=1}^{m}\sum _{k=1}^{d}{w}_{ik}^{n}}.$$

The usual manoeuvres yield the MM updates for the *θ _{k}* parameters in standard models. These updates are identical to the standard EM updates except for the differences in weights.

Minorization of ln *L*_{2}(*π*, *θ*, *ν*) follows in exactly the same manner except that the weights become

$${w}_{ij}^{n}=\frac{{\pi}_{j}^{n}{f}_{j}{({x}_{i}\mid {\theta}_{j}^{n})}^{v}}{\sum _{k=1}^{d}{\pi}_{k}^{n}{f}_{k}{({x}_{i}\mid {\theta}_{k}^{n})}^{v}}.$$

(5)

One of the virtues of this MM derivation is that it eliminates the need for normalization of probability densities.

To compare the MM and aMM algorithms, consider a Gaussian mixture model with two components, fixed proportions *π*_{1} = 0.7 and *π*_{2} = 0.3 and fixed standard deviations *σ*_{1} = 0.5 and *σ*_{2} = 1. The means (*μ*_{1}, *μ*_{2}) are the parameters to be estimated. Figure 3 shows the progress of the MM and aMM algorithms based on 500 random Gaussian deviates with *μ*_{1} = 0 and *μ*_{2} = 3. From the poor starting point (*μ*_{1}, *μ*_{2}) = (5, −2), the MM algorithm leads to the inferior local mode (3.2889, 0.0524) whereas the two aMM algorithms successfully converge to the global mode (0.0282, 3.0038). Here we start with *ν*^{0} = 0.1 and after each *s* = 5 iterations multiply *ν* by *r* = 2 until it reaches 1, i.e., every 5 iterations we replace the current value *ν ^{n}* of

Latent class analysis (LCA) is a discrete analogue of cluster analysis. It seeks to define clusters and assign subjects to them. For instance, a political party might cluster voters by answers to questions on a survey. The data could reveal conservative and liberal clusters or wealthy and poor clusters. The hidden nature of the latent classes suggest application of the EM algorithm. Unfortunately, maximum likelihood estimation with LCA is again beset by the problem of local modes.

For purposes of illustration, consider the simple latent class model of Goodman (1974) in which there are *d* latent classes and each subject is tested on *b* Bernoulli items. Conditional on the subject's latent class, the *b* tests are independent. The subject's binary response vector *y* = (*y*_{1},..., *y _{b}*) therefore has probability

$$\mathrm{Pr}(Y=y)=\sum _{j=1}^{d}{\pi}_{j}\prod _{k=1}^{b}{\theta}_{jk}^{{y}_{k}}{(1-{\theta}_{jk})}^{1-{y}_{k}},$$

where the *π _{j}* are admixture proportions, and the

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L(p,\pi )=\sum _{y}{c}_{y}\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\left[\sum _{j=1}^{d}{\pi}_{j}\prod _{k=1}^{b}{\theta}_{jk}^{{y}_{k}}{(1-{\theta}_{jk})}^{1-{y}_{k}}\right].$$

Introducing the weights

$${w}_{yj}^{n}=\frac{{\pi}_{j}^{n}\prod _{k=1}^{b}{\left({\theta}_{jk}^{n}\right)}^{{y}_{k}}{(1-{\theta}_{jk}^{n})}^{1-{y}_{k}}}{\sum _{l=1}^{d}{\pi}_{l}^{n}\prod _{k=1}^{b}{\left({\theta}_{lk}^{n}\right)}^{{y}_{k}}{(1-{\theta}_{lk}^{n})}^{1-{y}_{k}}},$$

one can easily derive the MM updates

$${\pi}_{j}^{n+1}=\frac{\sum _{y}{c}_{y}{w}_{yj}^{n}}{\sum _{l=1}^{d}\sum _{y}{c}_{y}{w}_{yl}^{n}},\phantom{\rule{1em}{0ex}}{\theta}_{jk}^{n+1}=\frac{\sum _{y}{c}_{y}{y}_{k}{w}_{yj}^{n}}{\sum _{y}{c}_{y}{w}_{yj}^{n}}.$$

For annealing, we can define the revised weights (4) and (5) using the densities

$${f}_{j}(y\mid {\theta}_{j})=\prod _{k=1}^{b}{\theta}_{jk}^{{y}_{k}}{(1-{\theta}_{jk})}^{1-{y}_{k}}.$$

The MM updates remain the same except for substitution of the revised weights for the ordinary weights.

For a numerical example, we now turn to a classical data set on pathology rating (section 13.1.2 in Agresti, 2002). Seven pathologists classified each of 118 slides for the presence or absence of carcinoma of the uterine cervix. Assuming *d* = 4 latent classes, we ran both MM and aMM (version 1) starting from the same 100 random starting points; we declared convergence when the relative change of the log-likelihood was less than 10^{−9}. Figure 4 displays the histograms of the converged log-likelihoods for the two algorithms. In 99 out of 100 runs, the aMM converges to what appears to be the global mode. Fewer than one-third of the MM runs converge to this mode. The maximum likelihood estimates of the *π _{j}* and

Final log-likelihoods found for the pathology data set by MM (left) and aMM (right) using 100 random starting points. The annealing parameters are *ν* = 0.05, *s* = 10 and *r* = 19/20.

Factor analysis models the covariation among the components of a random vector *Y* with *p* components as the sum

$$Y=\mu +FX+U,$$

where *μ* is a constant vector with *p* components, *F* is a *p* × *q* constant matrix, *X* is random vector with *q* components and *U* is a random vector with *p* components. These quantities are termed the mean vector, the factor loading matrix, the factor score and the noise, respectively. Ordinarily, *q* is much smaller than *p*. In addition, the standard model postulates that *X* and *U* are independent Gaussian random vectors with means and variances

$$\begin{array}{cc}\hfill E\left(X\right)& =0,\phantom{\rule{1em}{0ex}}\text{var}\left(X\right)=I\hfill \\ \hfill E\left(U\right)& =0,\phantom{\rule{1em}{0ex}}\text{var}\left(U\right)=D,\hfill \end{array}$$

where *D* is diagonal with *j*th diagonal entry *d _{j}*. Given a random sample

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L=-\frac{m}{2}\mathrm{ln}\phantom{\rule{thickmathspace}{0ex}}\text{det}(F{F}^{t}+D)-\frac{1}{2}\sum _{k=1}^{m}{({y}_{k}-\mu )}^{t}{(F{F}^{t}+D)}^{-1}({y}_{k}-\mu ),$$

it is clear that the maximum likelihood estimate of *μ* equals the sample mean. Therefore, we eliminate *μ* from the model and assume that the data are centred at **0**.

Estimation of *F* and *D* is afflicted by identifiability issues and the existence of multiple modes. In practice, the latter are more troublesome than the former. We attack the multiple mode problem by flattening the log-likelihood surface. This can be achieved by maximizing

$$\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L+\frac{v}{2}\mathrm{ln}\phantom{\rule{thickmathspace}{0ex}}\text{det}\phantom{\rule{thickmathspace}{0ex}}D=\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L+\frac{v}{2}\sum _{j=1}^{p}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{d}_{j}$$

for *ν*[0, *m*). In effect, this inflates the noise component of the model. We progressively adjust *ν* from near *m* to 0.

The EM algorithm is the workhorse of factor analysis, so it is natural to modify it to take into account the added noise. The complete data in the EM algorithm for estimating *F* and *D* are the random vectors (*Y _{k}*,

$$Q(F,D\mid {F}^{n},{D}^{n})=-\frac{m}{2}\sum _{j=1}^{p}\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}{d}_{j}-\frac{1}{2}tr\left[{D}^{-1}(F\Lambda {F}^{t}-F\Gamma -{\Gamma}^{t}{F}^{t}+\Omega )\right].$$

Here, the intermediate vectors and matrices are

$$\Lambda =\sum _{k=1}^{m}[{A}_{k}+{v}_{k}{v}_{k}^{t}],\phantom{\rule{thickmathspace}{0ex}}\Gamma =\sum _{k=1}^{m}{v}_{k}{y}_{k}^{t},\phantom{\rule{thickmathspace}{0ex}}\Omega =\sum _{k=1}^{m}{y}_{k}{y}_{k}^{t},$$

with

$${v}_{k}={F}^{t}{(F{F}^{t}+D)}^{-1}{y}_{k},\phantom{\rule{1em}{0ex}}{A}_{k}=I-{F}^{t}{(F{F}^{t}+D)}^{-1}F.$$

In defining *v _{k}* and

The MM principle suggests that we maximize the surrogate $Q+{\scriptstyle \frac{v}{2}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\phantom{\rule{thickmathspace}{0ex}}\text{det}\phantom{\rule{thickmathspace}{0ex}}D$ rather than the objective function $\mathrm{ln}\phantom{\rule{thinmathspace}{0ex}}L+{\scriptstyle \frac{v}{2}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{ln}\phantom{\rule{thickmathspace}{0ex}}\text{det}\phantom{\rule{thickmathspace}{0ex}}D$. If one follows the mathematical steps outlined in Lange (2004), then it is straightforward to verify the MM updates

$$\begin{array}{cc}\hfill {F}^{n+1}& ={\Gamma}^{t}{\Lambda}^{-1},\hfill \\ \hfill {d}_{i}^{n+1}& =\frac{1}{m-v}{[{F}^{n+1}\Lambda {\left({F}^{n+1}\right)}^{t}-{F}^{n+1}\Gamma -{\Gamma}^{t}{\left({F}^{n+1}\right)}^{t}+\Omega ]}_{ii}.\hfill \end{array}$$

It is obvious from the form of the update for the noise variance *d _{i}* that the amendment to the likelihood pushes the estimate of

For a numerical example, we now consider the classic data of Maxwell (1961). There are *p* = 10 variables and *m* = 148 subjects. The variables summarize various psychiatric tests on 148 children: (i) verbal ability, (ii) spatial ability, (iii) reasoning, (iv) numerical ability, (v) verbal fluency, (vi) neuroticism questionnaire, (vii) ways to be different, (viii) worries and anxieties, (ix) interests, and (x) annoyances. Table 5 lists the correlations between the 10 variables. Maxwell (1961) concludes that three factors adequately capture the variation in the data. To illustrate the problem of multiple modes, we assume *q* = 5 factors, giving a total of *p*(*q*+1) = 60 parameters. We ran both EM and aEM on the same 500 random starting points and stopped each run when the relative change of the log-likelihood was less than 10^{−9}. Figure 5 shows the histograms of the converged log-likelihoods found by the two algorithms. In all 500 runs, the aEM algorithm converges to the same mode. Fewer than half of the EM runs converge to this apparently global mode. Our discovery of several inferior modes confirms previous findings (Duan & Simonato, 1993). Table 6 lists the estimates of the noise variances *d _{i}* at the global and two local modes. In this example, we start with

Converged log-likelihoods found by EM (left) and aEM (right) from 500 random starting points. For aEM, *ν*^{0} = 147, *r* = 1/2 and *s* = 5.

Multidimensional scaling attempts to represent *q* objects as faithfully as possible in *p*-dimensional space given a non-negative weight *w _{ij}* and a non-negative dissimilarity measure

$$\begin{array}{cc}\hfill {\sigma}^{2}\left(\theta \right)& =\sum _{1\le i<j\le q}{w}_{ij}{({y}_{ij}-\Vert {\theta}_{i}-{\theta}_{j}\Vert )}^{2}\hfill \\ \hfill & =\sum _{1\le i<j\le q}{w}_{ij}{y}_{ij}^{2}-2\sum _{1\le i<j\le q}{w}_{ij}{y}_{ij}\Vert {\theta}_{i}-{\theta}_{j}\Vert +\sum _{1\le i<j\le q}{w}_{ij}{\Vert {\theta}_{i}-{\theta}_{j}\Vert}^{2},\hfill \end{array}$$

(6)

where $\Vert {\theta}_{i}-{\theta}_{j}\Vert $ is the Euclidean distance between *θ _{i}* and

The stress function tends to have multiple local minima in low dimensions (Groenen & Heiser, 1996). As the number of dimensions increases, most of the inferior modes disappear. In support of this contention, one can mathematically demonstrate that the stress has a unique minimum when *p* = *q*−1 (de Leeuw, 1993; Groenen & Heiser, 1996). In practice, uniqueness can set in well before *p* reaches *q*−1. In dimension crunching, we start optimizing the stress in some space ${\mathbb{R}}^{m}$ with *m*>*p*. The last *m*−*p* components of each *θ _{i}* are gradually subjected to stiffer and stiffer penalties. In the limit as the penalty tuning parameter

Because we want to minimize the stress, we first majorize it. In doing so, it is helpful to separate its parameters as well. The middle term in the stress (6) is majorized by the Cauchy–Schwartz inequality

$$-\Vert {\theta}_{i}-{\theta}_{j}\Vert \le -\frac{{({\theta}_{i}-{\theta}_{j})}^{t}({\theta}_{i}^{n}-{\theta}_{j}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}.$$

To separate the variables in the summands of the third term of the stress, we invoke the convexity of the Euclidean norm $\Vert \cdot \Vert $ and the square function *x*^{2}. These manoeuvres yield

$${\Vert {\theta}_{i}-{\theta}_{j}\Vert}^{2}={\Vert \frac{1}{2}\left[2{\theta}_{i}-({\theta}_{i}^{n}+{\theta}_{j}^{n})\right]-\frac{1}{2}\left[2{\theta}_{j}-({\theta}_{j}^{n}+{\theta}_{j}^{n})\right]\Vert}^{2}\le 2{\Vert {\theta}_{i}-\frac{1}{2}({\theta}_{i}^{n}+{\theta}_{j}^{n})\Vert}^{2}+2{\Vert {\theta}_{j}-\frac{1}{2}({\theta}_{i}^{n}+{\theta}_{j}^{n})\Vert}^{2}.$$

Assuming that *w _{ij}* =

$$2\sum _{i<j}{w}_{ij}\left[{\Vert {\theta}_{i}-\frac{1}{2}({\theta}_{i}^{n}+{\theta}_{j}^{n})\Vert}^{2}-\frac{{y}_{ij}{\theta}_{i}^{t}({\theta}_{i}^{n}-{\theta}_{j}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}\right]+2\sum _{i<j}{w}_{ij}\left[{\Vert {\theta}_{j}-\frac{1}{2}({\theta}_{i}^{n}+{\theta}_{j}^{n})\Vert}^{2}+\frac{{y}_{ij}{\theta}_{j}^{t}({\theta}_{i}^{n}-{\theta}_{j}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}\right]=2\sum _{i=1}^{q}\sum _{j\ne i}\left[{w}_{ij}{\Vert {\theta}_{i}-\frac{1}{2}({\theta}_{i}^{n}+{\theta}_{j}^{n})\Vert}^{2}-\frac{{w}_{ij}{y}_{ij}{\theta}_{i}^{t}({\theta}_{i}^{n}-{\theta}_{j}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}\right]$$

up to an irrelevant constant.

Setting the gradient of the surrogate equal to **0** vector gives the updates

$${\theta}_{ik}^{n+1}=\frac{\sum _{j\ne 1}\left[\frac{{w}_{ij}{y}_{ij}({\theta}_{ik}^{n}-{\theta}_{jk}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}+{w}_{ij}({\theta}_{ik}^{n}+{\theta}_{jk}^{n})\right]}{2\sum _{j\ne i}{w}_{ij}}$$

for all movable parameters *θ _{ik}*. To perform annealing, we add the penalty $v{\sum}_{i=1}^{q}{\sum}_{j=p+1}^{m}{\theta}_{ij}^{2}$ to the stress function and progressively increase

$$\begin{array}{cc}\hfill {\theta}_{ik}^{n+1}& =\frac{\sum _{j\ne i}\left[\frac{{w}_{ij}{y}_{ij}({\theta}_{ik}^{n}-{\theta}_{jk}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}+{w}_{ij}({\theta}_{ik}^{n}+{\theta}_{jk}^{n})\right]}{2\sum _{j\ne i}{w}_{ij}},\phantom{\rule{1em}{0ex}}1\le k\le p\hfill \\ \hfill {\theta}_{ik}^{n+1}& =\frac{\sum _{j\ne i}\left[\frac{{w}_{ij}{y}_{ij}({\theta}_{ik}^{n}-{\theta}_{jk}^{n})}{\Vert {\theta}_{i}^{n}-{\theta}_{j}^{n}\Vert}+{w}_{ij}({\theta}_{ik}^{n}+{\theta}_{jk}^{n})\right]}{2\left(\sum _{j\ne i}{w}_{ij}+v\right)},\phantom{\rule{1em}{0ex}}p+1\le k\le m.\hfill \end{array}$$

Taking *m* = *q −* 1 is computationally expensive if *q* is large. In this situation, we typically choose *m* much smaller than *q* but still considerably larger than *p*.

For the US city distance data summarized in Table 7, we ran both the MM and aMM algorithms for multidimensional scaling with *w _{ij}* = 1 and

Final stress values found by MM (left) and aMM (right) from 100 random starting points. The annealing parameters are *ν*^{0} = 0.001, *r* = 1.1 and *s* = 10.

Our last example is novel in three respects. It is Bayesian, it yields readily to maximum *a posteriori* estimation by block relaxation, and its transition from unimodality to bimodality is fairly well understood mathematically. In the one-way random-effects model described by Liu & Hodges (2003), the data are modelled as *y _{ij}*|

The joint probability density of the data and parameters ({*y _{ij}*}, {

$$\prod _{i,j}{\left(2\pi {\sigma}^{2}\right)}^{-1\u22152}{\mathrm{e}}^{-\left({({y}_{ij}-{\theta}_{i})}^{2}\right)\u2215\left(2{\sigma}^{2}\right)}\prod _{i}{\left(2\pi {\tau}^{2}\right)}^{-1\u22152}{\mathrm{e}}^{-\left({({\theta}_{i}-\mu )}^{2}\right)\u2215\left(2{\tau}^{2}\right)}{\left(2\pi {\eta}^{2}\right)}^{-1\u22152}{\mathrm{e}}^{-\left({(\mu -v)}^{2}\right)\u2215\left(2{\eta}^{2}\right)}\times \frac{{\beta}^{\alpha}}{\Gamma \left(\alpha \right)}{\left({\sigma}^{2}\right)}^{-\alpha -1}{\mathrm{e}}^{-\beta \u2215{\sigma}^{2}}\frac{{\delta}^{\gamma}}{\Gamma \left(\gamma \right)}{\left({\tau}^{2}\right)}^{-\gamma -1}{\mathrm{e}}^{-\delta \u2215{\tau}^{2}}.$$

This translates into the log-posterior function

$$P(\left\{{\theta}_{i}\right\},\mu ,{\sigma}^{2},{\tau}^{2}\mid \left\{{y}_{ij}\right\})=-\frac{\sum _{i}{r}_{i}+2\alpha +2}{2}\text{log}\left({\sigma}^{2}\right)-\frac{2\beta +\sum _{i,j}{({y}_{ij}-{\theta}_{i})}^{2}}{2{\sigma}^{2}}-\frac{s+2\gamma +2}{2}\text{log}\left({\tau}^{2}\right)-\frac{2\delta +\sum _{i}{({\theta}_{i}-\mu )}^{2}}{2{\tau}^{2}}-\frac{{(\mu -v)}^{2}}{2{\eta}^{2}}+c,$$

where *c* is an irrelevant constant. Maximum *a posteriori* estimation can be an end in itself or a prelude to a full Bayesian analysis by Markov chain Monte Carlo or Laplace approximation (Rue *et al.*, 2009).

The direct attempt to maximize the log-posterior is almost immediately thwarted. It is much easier to implement block relaxation, which maximizes the objective function over successive parameter subsets. Like the EM or MM algorithms, block relaxation enjoys the ascent property. With the superscripts *k* and *k*+1 denoting iteration numbers, block relaxation operates via the updates

$$\begin{array}{cc}\hfill & {\theta}_{i}^{k+1}=\frac{\frac{\sum _{j=1}^{{r}_{i}}{y}_{ij}}{{\left({\sigma}^{2}\right)}^{k}}+\frac{{\mu}^{k}}{{\left({\tau}^{2}\right)}^{k}}}{\frac{{r}_{i}}{{\left({\sigma}^{2}\right)}^{k}}+\frac{1}{{\left({\tau}^{2}\right)}^{k}}},\phantom{\rule{thickmathspace}{0ex}}i=1,\dots ,s,\hfill \\ \hfill & {\mu}^{k+1}=\frac{\frac{\sum _{i=1}^{s}{\theta}_{i}^{k+1}}{{\left({\tau}^{2}\right)}^{k+1}}+\frac{v}{{\eta}^{2}}}{\frac{s}{{\left({\tau}^{2}\right)}^{k+1}}+\frac{1}{{\eta}^{2}}},\hfill \\ \hfill & {\left({\sigma}^{2}\right)}^{k+1}=\frac{2\beta +\sum _{i,j}{({y}_{ij}-{\theta}_{i}^{k+1})}^{2}}{\sum _{i}{r}_{i}+2\alpha +2},\hfill \\ \hfill & {\left({\tau}^{2}\right)}^{k+1}=\frac{2\delta +\sum _{i=1}^{s}{({\theta}_{i}^{k+1}-{\mu}^{k+1})}^{2}}{s+2\gamma +2}.-\hfill \end{array}$$

In the case of a balanced design where the sample sizes *r _{i}* are equal, Liu & Hodges (2003) systematically study the modality of the log-posterior and determine how it depends on the parameters

As an example, we tested the peak discharge data analysed by Liu & Hodges (2003). With the settings *α* = 8, *β* = 1, *γ* = 10 and *δ* = 0.1, the log-posterior has two modes. We tried ordinary block relaxation and four versions of deterministic annealing from 100 randomly generated points. As Fig. 8 shows, every run of deterministic annealing reliably converges to its targeted mode. It would appear that the two-run tactic is highly successful.

The existence of multiple modes is one of the nagging problems of computational statistics. No one knows how often statistical inference is fatally flawed because a standard optimization algorithm converges to an inferior mode. Although the traditional remedies can eliminate the problem, they enjoy no guarantees. Bayesian inference is also not a refuge. Markov Chain Monte Carlo (MCMC) sampling often gets stuck in the vicinity of an inferior posterior mode, and it may be hard to detect departures from adequate random sampling. For these reasons, any technique for finding the dominant mode of a log-likelihood or a log-posterior function is welcome.

It is probably too much to hope for a panacea. Continuous optimization by simulated annealing comes close, but it imposes an enormous computational burden. The recent marriage of computational statistics and algebraic geometry has considerable promise (Pachter & Sturmfels, 2005). The new field, algebraic statistics, attempts to locate all of the modes of a likelihood function. This is probably too ambitious, and current progress is limited to likelihoods composed of simple polynomials and rational functions.

The EM annealing algorithm of Ueda & Nakano (1998) deserves wider use. In our opinion, the MM principle clarifies its derivation and frees it from the restriction to probability densities. In multidimensional scaling, the tunnelling method of Groenen & Heiser (1996) is a competitor to dimension crunching. It would be worthwhile to undertake a systematic comparison. Several, but not all, of the annealing techniques used for the multivariate *t* distribution extend to other elliptically symmetric families of densities such as the slash family (Lange & Sinsheimer, 1993). We will let readers explore the relevant algorithms at their leisure.

We would be remiss if we did not confess to experimenting with the annealing parameters *ν*^{0}, *r* and *s* to give good performance. We have not been terribly systematic because a broad range of values works well in many problems. Again, this is an area worthy of further investigation. Rigid guidelines are less important than rules of thumb.

In closing, let us emphasize that our purpose has been to introduce basic strategies rather than detailed tactics. Wider application of annealing will require additional devices for flattening function surfaces and moving towards the global mode. Although the MM algorithm is one among many choices, its simplicity and ascent (or descent) property are very attractive. MM algorithms tend to home in quickly on the basis of attraction of the dominant mode. Once an MM algorithm reaches this region, its rate of progress can slow dramatically. Thus, many annealing algorithms have to be accelerated to be fully effective. The challenge for the statistics community is to tackle a wider range of statistical models with multiple modes. This will have to be done piecemeal to sharpen intuition before we can hope to make a general assault on this vexing general problem.

The authors thank the referees for their many valuable comments, particularly the suggestion to add section 7. Ravi Varadhan contributed a helpful critique of an initial version of the manuscript. K. Lange was supported by United States Public Health Service grants GM53275 and MH59490.

Initialize: μ^{0}, Ω^{0} |

repeat |

${w}_{i}^{n}\leftarrow \frac{\alpha +p}{\alpha +{d}_{i}^{n}},\phantom{\rule{thickmathspace}{0ex}}{v}^{n}\leftarrow {\sum}_{i=1}^{m}{w}_{i}^{n}$ |

${\mu}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}{x}_{i}$ |

${\Omega}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t}$ |

Until convergence occurs |

Initialize: μ^{0}, Ω^{0}, ν^{0} > > α |

repeat |

if mod(n, s) = 0 then |

ν ← ^{n}rν^{n–1} + (1 – r)α |

else |

ν ← ^{n}ν^{n–1} |

end if |

${w}_{i}^{n}\leftarrow \frac{{v}^{n}+p}{{v}^{n}+{d}_{i}^{n}},\phantom{\rule{thickmathspace}{0ex}}{v}^{n}\leftarrow {\sum}_{i=1}^{m}{w}_{i}^{n}$ |

${\mu}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}{x}_{i}$ |

${\Omega}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t}$ |

until ν ≈ ^{n}α and convergence occurs |

Initialize: μ^{0}, Ω^{0}, ν^{0} < < 1 |

repeat |

if mod(n, s) = 0 then |

ν ← ^{n}rν^{n–1} + (1 – r) |

else |

ν ← ^{n}ν^{n–1} |

end if |

${w}_{i}^{n}\leftarrow \frac{\alpha +p}{\alpha +{d}_{i}^{n}},\phantom{\rule{thickmathspace}{0ex}}{v}^{n}\leftarrow {\sum}_{i=1}^{m}{w}_{i}^{n}$ |

${v}^{\ast n}\leftarrow \frac{{v}^{n}\alpha}{\alpha +(1-{v}^{n})p}$ |

${\mu}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}{x}_{i}$ |

${\Omega}^{n+1}\leftarrow \frac{1}{{v}^{\ast n}{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t}$ |

until ν ≈ 1 and convergence occurs^{n} |

Initialize: μ^{0}, Ω^{0}, ν^{0} < < 1 |

repeat |

if mod(n, s) = 0 then |

ν ← ^{n}rν^{n–1} + (1 – r) |

else |

ν ← ^{n}ν^{n–1} |

end if |

${w}_{i}^{n}\leftarrow \frac{\alpha +p}{\alpha +{v}^{n}{d}_{i}^{n}},\phantom{\rule{thickmathspace}{0ex}}{v}^{n}\leftarrow {\sum}_{i=1}^{m}{w}_{i}^{n}$ |

${\mu}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}{x}_{i}$ |

${\Omega}^{n+1}\leftarrow \frac{1}{{v}^{n}}{\sum}_{i=1}^{m}{w}_{i}^{n}({x}_{i}-{\mu}^{n+1}){({x}_{i}-{\mu}^{n+1})}^{t}$ |

Until ν ≈ 1 and convergence occurs^{n} |

HUA ZHOU, Department of Human Genetics, University of California, Los Angeles.

KENNETH L. LANGE, Departments of Biomathematics, Human Genetics, and Statistics, University of California, Los Angeles.

- Agresti A. Categorical data analysis. 2nd edn Wiley-Interscience; New York: 2002.
- Arslan O, Constable P, Kent J. Domains of convergence for the EM algorithm: a cautionary tale in a location estimation problem. Statist. Comput. 1993;3:103–108.
- Becker MP, Yang I, Lange KL. EM algorithms without missing data. Stat. Methods Med. Res. 1997;6:37–53. [PubMed]
- Bouguila N. Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 2008;20:462–474.
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B. 1977;39:1–38.
- Duan JC, Simonato JG. Multiplicity of solutions in maximum likelihood factor analysis. J. Statist. Comput. Simulation. 1993;47:37–47.
- Goodman LA. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika. 1974;61:215–231.
- Groenen PJF, Heiser WJ. The tunneling method for global optimization in multidimensional scaling. Psychometrika. 1996;61:529–550.
- Heiser WJ. Convergent computing by iterative majorization: theory and applications in multidimensional data analysis. In: Krzanowski WJ, editor. Recent advances in descriptive multivariate analysis. Clarendon Press; Oxford: 1995. pp. 157–189.
- Hunter DR, Lange KL. A tutorial on MM algorithms. Amer. Statist. 2004;58:30–37.
- Kent JT, Tyler DE, Vardi Y. A curious likelihood identity for the multivariate t-distribution. Comput. Stat. Data Anal. 1994;41:157–170.
- Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science. 1983;220:671–680. [PubMed]
- Lange KL. Optimization. Springer-Verlag; New York: 2004.
- Lange KL, Sinsheimer JS. Normal/independent distributions and their applications in robust regression. J. Comput. Graph. Statist. 1993;2:175–198.
- Lange KL, Little RJA, Taylor JMG. Robust statistical modeling using the t distribution. J. Amer. Statist. Assoc. 1989;84:881–896.
- Lange KL, Hunter DR, Yang I. Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist. 2000;9:1–59.
- de Leeuw J. Fitting distances by least squares. Unpublished manuscript. 1993. Available on http://preprints.stat.ucla.edu.
- de Leeuw J. Block relaxation algorithms in statistics. In: Bock HH, Lenski W, Richter MM, editors. Information systems and data analysis. Springer-Verlag; New York: 1994. pp. 308–325.
- Liu J, Hodges JS. Posterior bimodality in the balanced one-way random effects model. J. Roy. Statist. Soc. Ser. B. 2003;65:247–255.
- Maxwell AE. Recent trends in factor analysis. J. Roy. Statist. Soc. Ser. A. 1961;124:49–59.
- McLachlan GJ, Krishnan T. The EM algorithm and extensions. 2nd edn. Wiley-Interscience; Hoboken, NJ: 2008.
- Meng XL, van Dyk D. The EM algorithm – an old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B. 1997;59:511–567.
- Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. Equations of state calculations by fast computing machines. J. Chem. Phys. 1953;21:1087–1092.
- Pachter L, Sturmfels B. Algebraic statistics and computational biology. Cambridge University Press; Cambridge: 2005.
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in Fortran: the art of scientific computing. 2nd edn Cambridge University Press; Cambridge: 1992.
- Robert CP, Casella G. Monte Carlo statistical methods. 2nd edn Springer-Verlag; New York: 2004.
- Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations (with discussion). J. Roy. Statist. Soc. Ser. B. 2009;71:319–392.
- Ueda N, Nakano R. Deterministic annealing EM algorithm. Neural Netw. 1998;11:271–282. [PubMed]
- Wu TT, Lange KL. The MM alternative to EM. Statist. Sci. 2009 in press.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |