Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2678729

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Motivation
- 3 Derivation of the new algorithm
- 4 An example involving structured models
- 5 Local convergence properties
- 6 Simulations
- 7 Conclusions
- References

Authors

Related links

Comput Stat Data Anal. Author manuscript; available in PMC 2010 May 15.

Published in final edited form as:

Comput Stat Data Anal. 2009 May 15; 53(7): 2563–2572.

doi: 10.1016/j.csda.2008.12.005PMCID: PMC2678729

NIHMSID: NIHMS84585

Finite mixture modeling, together with the EM algorithm, have been widely used in clustering analysis. Under such methods, the unknown group membership is usually treated as missing data. When the “complete data” (log-)likelihood function does not have an explicit solution, the simplicity of the EM algorithm breaks down. Authors, including Rai and Matthews (1993), Lange (1995a) and Titterington (1984), developed modified algorithms therefore. As motivated by research in a large neurobiological project, we propose in this paper a new variant of such modifications and show that it is self-consistent. Moreover, simulations are conducted to demonstrate that the new variant converges faster than its predecessors.

Finite mixture modeling is a widely used clustering technique for moderate to highly complicated data structure (McLachlan and Basford, 1988). Such methods are model based and generate probabilities for the unknown group membership. The EM algorithm has been used in finding the parameter estimates for finite mixture models. It benefits in terms of stability and simplicity from the fact that the “complete data” (log-)likelihood function has an explicit solution. When the “complete data” (log-)likelihood function has no explicit solution, another iterative algorithm is needed in the M-step in order to maximize the conditional log-likelihood function given the observed data. This has been thought to be inefficient by many authors. Therefore, authors, including Rai and Matthews (1993), Lange (1995a) and Titterington (1984), proposed some modified algorithms which replaced the whole M-step of the EM algorithm with just one iteration of some Newton type algorithms in maximizing the conditional log-likelihood function. Given that the Newton type algorithms converges in a quadratic speed, the modified algorithms were shown to converge in a linear speed.

On the other hand, when there are both outcome variables and covariates, different names, including “regression models for conditional normal mixtures” and “regression clustering”, have been given to the same problem. The basic idea is to cluster the subjects according to the discrepancy in the regression parameters or in addition the covariance parameters. DeSarbo and Corn (1988) defined a regression model for finite normal mixtures with a univariate outcome, while Jones and McLachlan (1992) extended the model to a multivariate setting. Arminger et al. (1999) introduced several likelihood based strategies for parameter estimation, including the EM algorithm and the EM-gradient algorithm (Lange, 1995a).

As motivated by research in a large neurobiological project, in this paper a new variant of the algorithms developed by Rai and Matthews (1993) and Titterington (1984) is proposed to a general finite mixture model. The main goal of our approach is to accelerate the algorithm without sacrificing its numerical simplicity. The rest of this paper is organized as follows. The motivation from the large neurobiological project is discussed in Section 2. Section 3 presents the derivation of the new algorithm. An application to the large neurobiological project is introduced in Section 4. Section 5 establishes the self-consistency of the new algorithm, whereas numerical simulations in investigating the speed of convergences is given in Section 6. Section 7 is the conclusions.

Schizophrenia, a severe and disabling mental disease, has received a tremendous amount of research focus. A considerable amount of basic neurobiological research concerning schizophrenia have been conducted in the Conte Center for the Neuroscience of Mental Disorders in the Department of Psychiatry at the University of Pittsburgh. Differing neurobiological measurements from post-mortem brain tissue of subjects in the Brain Bank Core of the Center are involved in numerous studies. These studies typically attempt to identify neurobiological markers that are differentially expressed in subjects with schizophrenia as compared to normal controls (Konopaske et al., 2006). Consequently, it is of interest to attempt to undertake a large neurobiological project to integrate data from multiple Center’s studies. There is limited literature on previous attempts at such a data synthesis for post-mortem tissue studies in schizophrenia research. Two such papers are by Knable et al. (2001, 2002). The ultimate goal of our project is to identify possible heterogeneous groups of subjects with schizophrenia based on the various neurobiological markers, and this requires a series of major methodological steps. Clearly, the purpose of our long-term research is to provide new insights into the understanding of the neurobiology of schizophrenia.

The Conte Center databases we need to use involve numerous studies with varying subject populations and differing types of data. The main data issues include repeated measurements, differing matched controls for the same subject with schizophrenia and scientifically important covariates. Whenever repeated measurements occur, we plan to combine them into a single observation appropriate to that study. Multivariate normal models with structured means and covariance matrices have been developed by Wu and Sampson (2008) to deal with the differing matched controls and covariates.

Although the data are usually assumed to be normally distributed in the context of post-mortem tissue studies in schizophrenia, we derive our new procedure in a general setting with arbitrary component distributions. An application to the data structure in the large neurobiological project is illustrated as an example. This application uses the structured models introduced in Wu and Sampson (2008) which are revisited in Section 4. The actual data we will eventually combine from multiple studies show a considerable degree of missingness. To implement the clustering algorithm for the Center’s databases will require carefully crafted multiple imputation schemes. And due to the high computational burden in multiple imputations, time efficiency of the corresponding complete data clustering algorithms is of concern.

For a sample of *n* independent multivariate observations **y**_{1}*,* ···*,* **y*** _{n}*, we consider a setting of finite mixture models with

$$f({\mathbf{y}}_{i}\mid {z}_{ik}=1)={f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})$$

for *i* = 1*,* …, *n* and *k* = 1*,* …, *g*, where *f _{ik}*(·; ·) depends on

In the above settings, the likelihood function for the observed data *Y* = (**y**_{1}*,* ···*,* **y*** _{n}*)′ is given by

$$L(\mathit{\pi},\mathit{\theta}\mid Y)=\prod _{i=1}^{n}[\sum _{k=1}^{g}{\pi}_{k}{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})]$$

(1)

where ** π** = (

$$L(\mathit{\pi},\mathit{\theta}\mid Y,Z)=\prod _{i=1}^{n}\prod _{k=1}^{g}{[{\pi}_{k}{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})]}^{{z}_{ik}}.$$

Denote ** = (***π*′*,* ** θ**)′ for notational ease. The observed likelihood function (1) is then maximized though iterative maximizations of the conditional expectation

$$Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})={E}_{{\mathit{\vartheta}}^{(t)}}[logL(\mathit{\vartheta}\mid Y,Z)\mid Y]=\sum _{i=1}^{n}\sum _{k=1}^{g}{\tau}_{ik}^{(t)}[log{\pi}_{k}+log{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})]$$

(2)

for *t* = 0, 1, 2, *…*, where

$${\tau}_{ik}^{(t)}=\frac{{\pi}_{k}^{(t)}{f}_{ik}({\mathbf{y}}_{i};{\mathit{\theta}}^{(t)})}{{\sum}_{j=1}^{g}{\pi}_{j}^{(t)}{f}_{ij}({\mathbf{y}}_{i};{\mathit{\theta}}^{(t)})}.$$

The EM algorithm consists of iterations of an E-step which computes (2) and an M-step which maximizes (2) with respect to its first argument. The EM algorithm takes advantage from the fact that

$$\frac{\partial Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\vartheta}}=0$$

(3)

has a closed form solution. If this is not the case, then the EM algorithm loses its base merit of simplicity. In fact, it is easy to see that there is always a closed from solution for ** π** given by

$${\pi}_{k}^{(t+1)}=\frac{{\sum}_{i=1}^{n}{\tau}_{ik}^{(t)}}{n}$$

(4)

for *k* = 1, *…*, *g*.

When the closed form solution for ** θ** is not available, the EM algorithm requires another iterative algorithm, e.g., Newton-Raphson, in the M-step to solve (3). Many researchers find this second set of iterations to be inefficient. According to McLachlan and Krishnan (2008, p. 25 and pp. 149–153), Rai and Matthews (1993) proposed an EM1 algorithm where they replaced the entire M-step of the EM algorithm with one iteration of the Newton’s method given by

$${\mathit{\theta}}^{(t+1)}={\mathit{\theta}}^{(t)}-{\alpha}^{(t)}{[\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\frac{\partial Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\vartheta}}^{(t)}}$$

where 0 < *α* ≤ 1 was used to adjust the step size in each iteration to ensure the monotonicity of *L*(^{(}^{t}^{)}|*Y*) with respect to *t*. A special version of the EM1 algorithm with *a*^{(}^{t}^{)} 1 was later referred to as the EM-gradient algorithm by Lange (1995a). In addition, Lange (1995a) considered choosing *a*^{(}^{t}^{)} *a* to adjust the EM-gradient iterations and speed up convergence. This modification then served as a basis of a quasi-Newton acceleration introduced in Lange (1995b) where he noted that the EM-gradient actually acquired almost identical local convergence properties as the EM algorithm.

As another approach, Titterington (1984) proposed to update ** θ** according to

$${\mathit{\theta}}^{(t+1)}={\mathit{\theta}}^{(t)}-{\alpha}^{(t)}{[{I}_{c}(\mathit{\theta})]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\frac{\partial Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\vartheta}}^{(t)}}$$

where

$${I}_{c}(\mathit{\theta})={E}_{\mathit{\vartheta}}[\frac{{\partial}^{2}logL(\mathit{\vartheta}\mid Y,Z)}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]$$

is the complete data information matrix with respect to ** θ**. For a variety of models, e.g., mixtures with normal densities,

$${[{I}_{c}(\mathit{\theta})]}_{{\mathit{\vartheta}}^{(t)}}=E{[\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\vartheta}}^{(t)}}.$$

(5)

Hence Titterington’s algorithm can be thought of as a scoring version of the EM1 or EM-gradient algorithm. Similar as the EM1 algorithm, a fractional step achieved by a small enough *a*^{(}^{t}^{)} will ensure the increase of *L*(^{(}^{t}^{)}|*Y*).

Now let us take a detailed look at the matrices ^{2}*Q*(**|**^{(}^{t}^{)})/ *θ*** θ**′ and

$$\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}=\sum _{i=1}^{n}\sum _{k=1}^{g}{\tau}_{ik}^{(t)}[\frac{{\partial}^{2}log{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]$$

and

$${I}_{c}(\mathit{\theta})=\sum _{i=1}^{n}\sum _{k=1}^{g}{\pi}_{k}^{(t)}E[\frac{{\partial}^{2}log{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}\mid {z}_{ik}=1].$$

(6)

Titterington’s algorithm is possibly superior to the EM1 algorithms in terms of their speed of convergence when *E*[^{2}log *f _{ik}*(

$${\mathit{\theta}}^{(t+1)}={\mathit{\theta}}^{(t)}-{\alpha}^{(t)}{[H(\mathit{\theta})]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\frac{\partial Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\vartheta}}^{(t)}}$$

(7)

where

$$H(\mathit{\theta})=\sum _{i=1}^{n}\sum _{k=1}^{g}{\tau}_{ik}^{(t)}E[\frac{{\partial}^{2}log{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}\mid {z}_{ik}=1].$$

As can be seen, the new modification shares the same numerical simplicity as Titterington’s algorithm. The self-consistency of the new algorithm is shown in Section 5. In the simulations in Section 6, the new algorithm with *α*^{(}^{t}^{)} 1 is compared to the EM-gradient algorithm and Titterington’s algorithm in terms of their convergence speed and clustering accuracy.

Although the new algorithm developed in Section 3 is applicable in more general settings, in this section we introduce a special example involving structured models because these models are critical in analyzing the motivating databases. And in addition, this example is suitable for demonstrating some basic advantages of the new algorithm in analyzing complex databases.

As noted by Wu and Sampson (2008), several special considerations arise when we attempt to combine data from multiple studies in the Conte Center. First, by December 31, 2005 about 35 separate post-mortem tissue studies have been conducted. These studies involve different neurobiological measurements, such as neuron counts, neuron somal volume and certain mRNA expression levels, on differing subject populations. As a result, the combined data would involve multiple responses and a considerable degree of missingness. Carefully crafted multiple imputation schemes will be required in dealing with missing data. In this paper, we develop an efficient complete data clustering algorithm which will be implemented later in conjunction with multiple imputations. Second, In order to control for both experimental and demographical variations every subject with schizophrenia has been matched with a normal control subject in each study based on their ages at death, gender and post-mortem intervals. And paired differences between measurements on the subjects with schizophrenia and the corresponding controls are typically obtained and treated as primary data in the original analysis. This convention is adopted here. Nevertheless, the matched controls for a subject with schizophrenia might be different in different studies. To be more explicit, let’s consider *S _{i}*

$$\mathit{Cov}({S}_{ij}-{C}_{ij},{S}_{ik}-{C}_{ik})=\mathit{Cov}({S}_{ij},{S}_{ik})+\mathit{Cov}({C}_{ij},{C}_{ik})\phantom{\rule{0.38889em}{0ex}}\text{for}\phantom{\rule{0.38889em}{0ex}}j\ne k.$$

When *C _{ij}* and

$$E[{\mathbf{y}}_{i}\mid {X}_{i}]={X}_{i}\mathit{\beta},$$

where *X _{i}* is the covariates matrix whose rows corresponding to the components in

Wu and Sampson (2008) developed multivariate normal models with structured means and covariance matrices to deal with the special considerations in the Center’s databases. The structured means resulted from the covariates of the subjects with schizophrenia, whereas the structured covariance matrices were due to the differing control subjects that were matched to the same subjects with schizophrenia when they were involved in different studies. Consider the responses **y**_{1}*,* ···*,* **y*** _{n}* representing the pairwise differences between the

$${\mathrm{\sum}}_{i}(\mathit{\sigma})=\sum _{j=1}^{p}{\sigma}_{jj}{G}_{jj}+\sum _{1\le j<k\le p}({\sigma}_{jk}+{\sigma}_{jk}^{c}{I}_{i}^{jk}){G}_{jk}$$

where *G _{jk}*, 1 ≤

Following the model specification in Wu and Sampson (2008), in this example we consider a finite mixture model with component density functions ** f_{ik}(y_{i}**;

$$\frac{\partial Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial {\mathit{\beta}}_{k}}=\sum _{i=1}^{n}{\tau}_{ik}^{(t)}{X}_{i}^{\prime}{\mathrm{\sum}}_{i}^{-1}({\mathbf{y}}_{i}-{X}_{i}{\mathit{\beta}}_{k}),\phantom{\rule{0.38889em}{0ex}}k=1,\dots ,g,$$

(8)

$$\frac{\partial Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial {\sigma}_{j}}=\frac{1}{2}\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}(\sum _{k=1}^{g}{\tau}_{ik}^{(t)}{C}_{ik}-{\mathrm{\sum}}_{i})),\phantom{\rule{0.38889em}{0ex}}j=1,\dots ,q$$

(9)

where *C _{ik}* = (

$$\begin{array}{l}-\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial {\mathit{\beta}}_{k}\partial {\mathit{\beta}}_{k}^{\prime}}=\sum _{i=1}^{n}{\tau}_{ik}^{(t)}{X}_{i}^{\prime}{\mathrm{\sum}}_{i}^{-1}{X}_{i},\phantom{\rule{0.38889em}{0ex}}1\le k\le g,\\ -\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial {\sigma}_{j}\partial {\sigma}_{l}}=\frac{1}{2}\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}}{\mathrm{\sum}}_{i}^{-1}(2\sum _{k=1}^{g}{\tau}_{ik}^{(t)}{C}_{ik}-{\mathrm{\sum}}_{i}))\\ +\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{{\partial}^{2}{\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}\partial {\sigma}_{l}}{\mathrm{\sum}}_{i}^{-1}(\sum _{k=1}^{g}{\tau}_{ik}^{(t)}{C}_{ik}-{\mathrm{\sum}}_{i})),\phantom{\rule{0.38889em}{0ex}}1\le j,l\le q,\\ -\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial {\mathit{\beta}}_{j}\partial {\mathit{\beta}}_{k}^{\prime}}=0,\phantom{\rule{0.38889em}{0ex}}1\le k\ne j\le g,\\ -\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial {\mathit{\beta}}_{k}\partial {\sigma}_{j}}=\sum _{i=1}^{n}{\tau}_{ik}^{(t)}{X}_{i}^{\prime}{\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}({\mathbf{y}}_{i}-{X}_{i}{\mathit{\beta}}_{k}),\phantom{\rule{0.38889em}{0ex}}1\le k\le g,\phantom{\rule{0.38889em}{0ex}}1\le j\le q.\end{array}$$

On the other hand,

$$\begin{array}{l}-{[{I}_{c}(\mathit{\theta})]}_{{\mathit{\beta}}_{k},{\mathit{\beta}}_{k}}=\sum _{i=1}^{n}{\pi}_{k}^{(t)}{X}_{i}^{\prime}{\mathrm{\sum}}_{i}^{-1}{X}_{i},\phantom{\rule{0.38889em}{0ex}}1\le k\le g,\\ -{[{I}_{c}(\mathit{\theta})]}_{{\sigma}_{j},{\sigma}_{l}}=\frac{1}{2}\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}}),\phantom{\rule{0.38889em}{0ex}}1\le j,l\le q,\\ -{[{I}_{c}(\mathit{\theta})]}_{{\mathit{\beta}}_{j},{\mathit{\beta}}_{k}}={[{I}_{c}(\mathit{\theta})]}_{{\mathit{\beta}}_{k},{\sigma}_{l}}=0,\phantom{\rule{0.38889em}{0ex}}1\le k\ne j\le g,\phantom{\rule{0.38889em}{0ex}}1\le l\le q.\end{array}$$

Thus, our newly proposed algorithm (7) with *α*^{(}^{t}^{)} 1 yields an update given by

$${\mathit{\beta}}_{k}^{(t+1)}={[\sum _{i=1}^{n}{\tau}_{ik}^{(t)}{X}_{i}^{\prime}{\mathrm{\sum}}_{i}^{-1}{X}_{i}]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\sum _{i=1}^{n}{\tau}_{ik}^{(t)}{X}_{i}^{\prime}{\mathrm{\sum}}_{i}^{-1}{\mathbf{y}}_{i}]}_{{\mathit{\vartheta}}^{(t)}},\phantom{\rule{0.38889em}{0ex}}k=1,\dots ,g,$$

(10)

$$\begin{array}{l}{\mathit{\sigma}}^{(t+1)}={[\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}})]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}}{\mathrm{\sum}}_{i}^{-1}(\sum _{k=1}^{g}{\tau}_{ik}^{(t)}{C}_{ik}))]}_{{\mathit{\vartheta}}^{(t)}}\\ +{\mathit{\sigma}}^{(t)}-{[\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}})]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}})]}_{{\mathit{\vartheta}}^{(t)}}.\end{array}$$

(11)

When S* _{i}* is linear in the components of

$${\mathit{\sigma}}^{(t+1)}={[\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{j}}{\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}})]}_{{\mathit{\vartheta}}^{(t)}}^{-1}{[\sum _{i=1}^{n}tr({\mathrm{\sum}}_{i}^{-1}\frac{\partial {\mathrm{\sum}}_{i}}{\partial {\sigma}_{l}}{\mathrm{\sum}}_{i}^{-1}(\sum _{k=1}^{g}{\tau}_{ik}^{(t)}{C}_{ik}))]}_{{\mathit{\vartheta}}^{(t)}}.$$

(12)

Clearly, the quantities in (10) are the maximum likelihood estimates of the ** β**’s when the individual clustering probabilities and the covariance matrices are assume to be known. The calculation of (12) is easier than that of the corresponding update in the EM-gradient algorithm and as easy as that of the corresponding update in Titterington’s algorithm.

It was shown by Dempster, Laird, and Rubin (1977) that *L*(^{(}^{t}^{+1)}|*Y*) =≥ *L*(^{(}^{t}^{)}|*Y*) as long as *Q*(^{(}^{t}^{+1)}|^{(}^{t}^{)}) ≥ *Q*(^{(}^{t}^{)}|^{(}^{t}^{)}). So the convergence of the EM algorithm to a local maximum of *L*(**|****y**) is guaranteed under mild regularity condition. In addition, it can be shown that

$${[\frac{{\partial}^{2}Q(\mathit{\vartheta}\mid {\mathit{\vartheta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\vartheta}}^{(t)}}<{[\frac{{\partial}^{2}logL(\mathit{\vartheta}\mid Y)}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\vartheta}}^{(t)}}.$$

So the negative definiteness of [^{2}*Q*(**|**^{(}^{t}^{)})/**′]**_{}^{(t)} around a local maximum follows from the negative definiteness of [^{2}log *L*(**|***Y*)/*θ*** θ**′]

The same ascent property can also be established for our newly proposed algorithm. First, the negative definiteness of [*H*(** θ**)]

$$E{[\frac{{\partial}^{2}log{f}_{ik}({\mathbf{y}}_{i};\mathit{\theta})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}\mid {z}_{ik}=1]}_{{\mathit{\vartheta}}^{(t)}}$$

is negative definite. Here we can focus only on the parameter ** θ**, because (4), as a solution to (3), automatically leads to an increase in

$$\begin{array}{l}Q({\mathit{\theta}}^{(t+1)}\mid {\mathit{\theta}}^{(t)})-Q({\mathit{\theta}}^{(t)}\mid {\mathit{\theta}}^{(t)})={[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}^{\prime}({\mathit{\theta}}^{(t+1)}-{\mathit{\theta}}^{(t)})\\ +\frac{1}{2}({\mathit{\theta}}^{(t+1)}-{\mathit{\theta}}^{(t)}{)}^{\prime}{[\frac{{\partial}^{2}Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\theta}}^{\ast}}({\mathit{\theta}}^{(t+1)}-{\mathit{\theta}}^{(t)}),\end{array}$$

(13)

where *θ*^{*} is between *θ*^{(}^{t}^{+1)} and *θ*^{(}^{t}^{)}. Now plugging our new update (7) into (13), we have

$$\begin{array}{l}Q({\mathit{\theta}}^{(t+1)}\mid {\mathit{\theta}}^{(t)})-Q({\mathit{\theta}}^{(t)}\mid {\mathit{\theta}}^{(t)})=-{\alpha}^{(t)}{[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}^{\prime}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}{[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}\\ +\frac{1}{2}{({\alpha}^{(t)})}^{2}{[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}^{\prime}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}{[\frac{{\partial}^{2}Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\theta}}^{\ast}}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}{[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}\\ ={[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}^{\prime}(\frac{1}{2}{({\alpha}^{(t)})}^{2}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}{[\frac{{\partial}^{2}Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\theta}}^{\ast}}{[H({\mathit{\theta}}^{(t)})]}_{{\mathit{\theta}}^{(t)}}^{-1}\\ -{\alpha}^{(t)}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}){[\frac{\partial Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}}]}_{{\mathit{\theta}}^{(t)}}.\end{array}$$

The above quantity is in the quadratic form. In order to have *Q*(*θ*^{(}^{t}^{+1)}|*θ*^{(}^{t}^{)}) − *Q*(*θ*^{(}^{t}^{)}|*θ*^{(}^{t}^{)}) = 0, it is sufficient that

$$\frac{1}{2}{({\alpha}^{(t)})}^{2}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}{[\frac{{\partial}^{2}Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\theta}}^{\ast}}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}-{\alpha}^{(t)}{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}^{-1}\ge 0.$$

(14)

Since both [*H*(** θ**)]

$${\alpha}^{(t)}I\le {[-\frac{{\partial}^{2}Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\theta}}^{\ast}}^{-1/2}(-2{[H(\mathit{\theta})]}_{{\mathit{\theta}}^{(t)}}){[-\frac{{\partial}^{2}Q(\mathit{\theta}\mid {\mathit{\theta}}^{(t)})}{\partial \mathit{\theta}\partial {\mathit{\theta}}^{\prime}}]}_{{\mathit{\theta}}^{\ast}}^{-1/2}$$

(15)

where *I* is the identity matrix with the same dimension as *H*(** θ**). It is not hard to show that inequality (15) is satisfied when

As another usage, *α*^{(}^{t}^{)} can be adjusted to ensure that the parameter estimates fall in the parameter space. This is usually called step-halving. For example, one concern is that the estimated covariance matrices should be positive definite. By using a small *α*^{(}^{t}^{)} in (7), we are actually shrinking ^{(}^{t}^{+1)} toward ^{(}^{t}^{)}. And usually the parameter space is an open set. So given that ^{(}^{t}^{)} is in the parameter space, there will always be an *α*^{(}^{t}^{)} small enough to guarantee that^{(}^{t}^{+1)} will also be in the parameter space. And the direction of the inequality in (15) enables us to apply step-halving while guaranteeing an increase in *L*(**|***Y*). However, using a small *α*^{(}^{t}^{)} reduces the speed of convergence.

Carefully crafted multiple imputation schemes will be required in the last step of our long term project when we attempt clustering in the presence of the substantial amount of missing data in the database from the Center’s studies. The computational speed of the clustering component is critical in the long term project and we intend to use our new algorithm for this purpose. In this section, we show using simulations that our algorithm is computationally faster than both Titterington’s and the EM-gradient algorithms. Data are simulated to be complete except the usual missing clustering indicators. A fixed value of *α*^{(}^{t}^{)} = 1 is used. All algorithms are coded and running in R language.

Data with a structure conforming to the one described in Section 4 are simulated for the clustering analysis. The dimension of the outcomes y_{1}*,* · · ·*,* y* _{n}* is assumed to be three. According to the component density functions

$${f}_{ik}({\mathbf{y}}_{i},\mathit{\theta})=\phi ({\mathbf{y}}_{i};{X}_{i}{\mathit{\beta}}_{k},{\mathrm{\sum}}_{i}(\mathit{\sigma}))$$

for *i* = 1*, …, n* and *k* = 1, 2, data sets containing data from two clusters are simulated. Each data set contains *n* = 500 subjects with 250 for each cluster. The covariates matrix *X _{i}* is in the form of

$${X}_{i}=\left[\begin{array}{ccccccccc}1& {x}_{1}& {x}_{2}& 0& 0& 0& 0& 0& 0\\ 0& 0& 0& 1& {x}_{1}& {x}_{2}& 0& 0& 0\\ 0& 0& 0& 0& 0& 0& 1& {x}_{1}& {x}_{2}\end{array}\right]$$

for *i* = 1*, …, n*. The constant ‘1’ in the *X _{i}*’s represents the overall difference between the subjects with schizophrenia and the controls. The values of

$$\begin{array}{l}{\mathit{\beta}}_{1}=({\beta}_{11}^{1},{\beta}_{12}^{1},{\beta}_{13}^{1},{\beta}_{21}^{1},{\beta}_{22}^{1},{\beta}_{23}^{1},{\beta}_{31}^{1},{\beta}_{32}^{1},{\beta}_{33}^{1}{)}^{\prime}\\ =(-100,2,50,-50,2,50,-50,1,50{)}^{\prime}\\ {\mathit{\beta}}_{2}=({\beta}_{11}^{2},{\beta}_{12}^{2},{\beta}_{13}^{2},{\beta}_{21}^{2},{\beta}_{22}^{2},{\beta}_{23}^{2},{\beta}_{31}^{2},{\beta}_{32}^{2},{\beta}_{33}^{2}{)}^{\prime}\\ =(100,-2,50,50,2,50,50,-1,50{)}^{\prime}\end{array}$$

be those parameters for the two clusters, respectively. Explicitly speaking,
$({\beta}_{j1}^{1},{\beta}_{j2}^{1},{\beta}_{j3}^{1}{)}^{\prime}$ are the regression parameters for the *j*th component of y* _{i}* in the first cluster for

$$\begin{array}{l}\mathit{\sigma}=({\sigma}_{11},{\sigma}_{22},{\sigma}_{33},{\sigma}_{12},{\sigma}_{13},{\sigma}_{23},{\sigma}_{12}^{c},{\sigma}_{13}^{c},{\sigma}_{23}^{c}{)}^{\prime}\\ =(1000,1500,1000,400,500,600,200,-100,-200{)}^{\prime}\end{array}$$

which creates five possible individual covariance matrices for the outcomes as follows:

$$\begin{array}{l}{\mathrm{\sum}}_{(1)}=\left[\begin{array}{ccc}1000& 400& 500\\ 400& 1500& 600\\ 500& 600& 1000\end{array}\right]{\mathrm{\sum}}_{(2)}=\left[\begin{array}{ccc}1000& 600& 500\\ 600& 1500& 600\\ 500& 600& 1000\end{array}\right]{\mathrm{\sum}}_{(3)}=\left[\begin{array}{ccc}1000& 400& 400\\ 400& 1500& 600\\ 400& 600& 1000\end{array}\right]\\ {\mathrm{\sum}}_{(4)}=\left[\begin{array}{ccc}1000& 400& 500\\ 400& 1500& 400\\ 500& 400& 1000\end{array}\right]{\mathrm{\sum}}_{(5)}=\left[\begin{array}{ccc}1000& 600& 400\\ 600& 1500& 400\\ 400& 400& 1000\end{array}\right].\end{array}$$

In this simulation study, we assume that the five possible matching schemes between subjects with schizophrenia and controls are equally likely. As a result, within each data set and each cluster, each of the above covariance matrices appears to 50 subjects. Although this assumption will most likely be violated in reality, this algorithm can be implemented in the same way.

Direct applications of both the EM-gradient algorithm and Titterington’s algorithm to our simulated data is time consuming. The three algorithms are implemented on a random selection of 30 out of 500 simulated data sets and shown to provide the same parameter estimates. As a result, only our new algorithm is used for the parameter estimation for the rest 470 simulated data sets.

First, we examine in detail their computational speed. Each of the three algorithms is then implemented from the same starting values to find the parameter estimates. For the feasibility of comparison, the three algorithms are stopped according to the same criterion, that is, when the change in the parameter estimates does not exceed a pre-defined limit. We observe that when the algorithms are started from near the true parameter values, they converge in almost the same number of steps. However, when the algorithms are started far from the true parameter values, they behave differently. A typical result from one of the thirty selected data sets is shown in Figure 1. The x-axis represents the number of iterations, while the y-axis represents the value of the observed log-likelihood function evaluated at the parameter estimates in each iteration. Some beginning iteration history for the three algorithms with low values (large negative values) of the log-likelihood function is not shown in Figure 1 for the feasibility of comparison. It can be seen that the EM-gradient algorithm converges in more than 80 steps, while Titterington’s algorithm converges in about 65 steps. However, our new algorithm only requires about 25 steps to converge, which is a big advantage as compared to the other two. In addition to the results on the numbers of iterations before convergence, we also observe that the average cost of time per iteration is 4.67 seconds for the EM-gradient algorithm and about 0.7 seconds for both Titterington’s algorithm and our new algorithm. However, we do recognize that the average cost of time per iteration does depend on the coding of the algorithms, the computing software and the hardware configuration. From Figure 1, the main feature of the new algorithm is that it requires significantly fewer iterations in finding the region containing a maximum when started far from the optimum, while its number of steps for subsequent “local refinement” is actually comparable to the two existing algorithms. In addition, we show in Table 1 the average numbers of steps to convergence and the mean per iteration time, as well as their corresponding standard deviations, for the three algorithms over the thirty selected data sets. The result confirms the above conclusion that our new algorithm is computational more effective than both the EM-gradient and the Titterington’s algorithms.

Iteration history of the clustering algorithms: (a) for the new algorithm; (b) for Titterington’s (1984) algorithm; (c) for the EM gradient algorithm. The mean per iteration time, computed as the average of the computational time required for **...**

Average numbers of steps to convergence and mean per iteration time for the thirty simulated data sets

For our current simulations, two different approaches for starting points are used for the purpose of demonstration. One approach is to choose parameters close to the true parameter values, and the other approach is by starting the algorithm from randomly generated clustering indices, i.e., a random starting point. For any single simulated data set, we define the final clustering result to be “successful” if the algorithm clusters more than 95% of its subjects correctly. For the 500 simulated data sets, we observe that by starting from near the true parameter values we get “successful” clustering results on 100% of the simulated data sets, while by starting from random point we obtain “successful” clustering results for about 95% of the simulated data sets. For the other 5% of the simulated data sets, the algorithm either does not converge (1.4%) or converges (3.6%) to a solution resulting in a clustering in which the subjects are clustered complete randomly. For those data sets with “successful” clustering results when starting from random clustering indices, we summarize the results of the parameter estimation in Table 2 as compared to the true parameter values. It can been seen that the parameter estimation is reasonably accurate when the algorithm finds the correct clusters. In fact, these results are surprisingly good. Typically, no one relies on one random starting point if one has no information about where to start. In order to increase the chance of identifying the correct clustering, we could always start the algorithm from multiple starting points and pick the solution maximizing the likelihood function as the result.

In this paper, some special features of finite mixture models, as compared to general missing data problems, are utilized to speed up the parameter estimation algorithms. The EM-gradient algorithm provides updates which fit the data better in each iteration, while Titterington’s algorithm requires less calculation in each iteration. Our new algorithm takes their both advantages and is shown to acquire their nice local convergence properties as a heritage. In addition, we show by simulations that our new algorithm converges in fewer iterations than its two predecessors while providing the same parameter estimates. And the cost of time per iteration of our new algorithm should be comparable to that of Titterington’s algorithm and lower than that of the EM-gradient algorithm. As discussed, the database of the Center’s studies which our ultimated project will use involves a great degree of missing data in addition to the unobserved clustering indicators. The next steps in our long term goal of clustering subjects with schizophrenia are to develop specially crafted multiple imputation techniques, implement the newly developed clustering algorithm to the multiply imputed data sets, and finally integrate the multiple clustering results to a single clustering result.

We thank Dr. David Lewis for his insights and allowing us use the data obtained in his lab. We also thank the Center researchers, especially Dr. Takanori Hashimoto, for their generous help. This research was supported by NIMH Grant 5P50MH045156-18 and P50 MH084053-01. We finally thank the referees for their insightful comments.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Arminger G, Stein P, Wittenberg J. Mixtures of conditional mean-and covariance-structure models. Psychometrika. 1999;64 (4):475–494.
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B. 1977;39 (1):1–38.
- DeSarbo WS, Corn LW. A maximum likelihood methodology for clusterwise linear regression. Journal of Classification. 1988;5:249–282.
- Johnson RA, Wichern DW. Applied multivariate statistical analysis. 5. Prentice-Hall, Inc.; Upper Saddle River, NJ: 2002.
- Jones PN, McLachlan GJ. Fitting finite mixture models in a regression context. Australian Journal of Statistics. 1992;34 (2):233–240.
- Knable MB, Bacrci BM, Barko JJ, Webster MJ, Torrey EF. Molecular abnormalities in the major psychiatric illnesses: Classification and regression tree (crt) analysis of post-mortem prefrontal markers. Molecular Psychiatry. 2002;7:392–404. [PubMed]
- Knable MB, Torrey EF, Webster MJ, Bartko JJ. Multivariate analysis of prefrontal cortical data from the stanley foundation neuropathology consortium. Brain Research Bulletin. 2001;55 (5):651–659. [PubMed]
- Konopaske GT, Sweet RA, Wu Q, Sampson AR, Lewis DA. Regional specificity of chandelier neuron axon terminal alterations in schizophrenia. Neuroscience. 2006;138(1) [PubMed]
- Lange K. A gradient algorithm locally equivalent to the em algorithm. Journal of the Royal Statistical Society B. 1995a;57 (2):425–437.
- Lange K. A quasi-newton acceleration of the em algorithm. Statistica Sinica. 1995b;5:1–18.
- McLachlan GJ, Basford KE. Mixture models: inference and applications to clustering. Dekker; New York: 1988.
- McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. 2. Wiley; New York: 2008.
- Rai SN, Matthews DE. Improving the em algorithm. Biometrics. 1993;49:587–591.
- Titterington DM. Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society B. 1984;46:257–267.
- Wu Q, Sampson AR. Structured modeling with applications to integrative analysis of post-mortem tissue studies in schizophrenia, unpublished. Dec, 2008. URL http://personal.ecu.edu/wuq/media/structuredmodel.pdf.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |