PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Comput Stat Data Anal. Author manuscript; available in PMC 2010 May 15.
Published in final edited form as:
Comput Stat Data Anal. 2009 May 15; 53(7): 2563–2572.
doi:  10.1016/j.csda.2008.12.005
PMCID: PMC2678729
NIHMSID: NIHMS84585

Mixture modeling with applications in schizophrenia research

Abstract

Finite mixture modeling, together with the EM algorithm, have been widely used in clustering analysis. Under such methods, the unknown group membership is usually treated as missing data. When the “complete data” (log-)likelihood function does not have an explicit solution, the simplicity of the EM algorithm breaks down. Authors, including Rai and Matthews (1993), Lange (1995a) and Titterington (1984), developed modified algorithms therefore. As motivated by research in a large neurobiological project, we propose in this paper a new variant of such modifications and show that it is self-consistent. Moreover, simulations are conducted to demonstrate that the new variant converges faster than its predecessors.

Keywords: Clustering, EM algorithm, EM1 algorithm, EM-gradient algorithm, finite mixture models, schizophrenia

1 Introduction

Finite mixture modeling is a widely used clustering technique for moderate to highly complicated data structure (McLachlan and Basford, 1988). Such methods are model based and generate probabilities for the unknown group membership. The EM algorithm has been used in finding the parameter estimates for finite mixture models. It benefits in terms of stability and simplicity from the fact that the “complete data” (log-)likelihood function has an explicit solution. When the “complete data” (log-)likelihood function has no explicit solution, another iterative algorithm is needed in the M-step in order to maximize the conditional log-likelihood function given the observed data. This has been thought to be inefficient by many authors. Therefore, authors, including Rai and Matthews (1993), Lange (1995a) and Titterington (1984), proposed some modified algorithms which replaced the whole M-step of the EM algorithm with just one iteration of some Newton type algorithms in maximizing the conditional log-likelihood function. Given that the Newton type algorithms converges in a quadratic speed, the modified algorithms were shown to converge in a linear speed.

On the other hand, when there are both outcome variables and covariates, different names, including “regression models for conditional normal mixtures” and “regression clustering”, have been given to the same problem. The basic idea is to cluster the subjects according to the discrepancy in the regression parameters or in addition the covariance parameters. DeSarbo and Corn (1988) defined a regression model for finite normal mixtures with a univariate outcome, while Jones and McLachlan (1992) extended the model to a multivariate setting. Arminger et al. (1999) introduced several likelihood based strategies for parameter estimation, including the EM algorithm and the EM-gradient algorithm (Lange, 1995a).

As motivated by research in a large neurobiological project, in this paper a new variant of the algorithms developed by Rai and Matthews (1993) and Titterington (1984) is proposed to a general finite mixture model. The main goal of our approach is to accelerate the algorithm without sacrificing its numerical simplicity. The rest of this paper is organized as follows. The motivation from the large neurobiological project is discussed in Section 2. Section 3 presents the derivation of the new algorithm. An application to the large neurobiological project is introduced in Section 4. Section 5 establishes the self-consistency of the new algorithm, whereas numerical simulations in investigating the speed of convergences is given in Section 6. Section 7 is the conclusions.

2 Motivation

Schizophrenia, a severe and disabling mental disease, has received a tremendous amount of research focus. A considerable amount of basic neurobiological research concerning schizophrenia have been conducted in the Conte Center for the Neuroscience of Mental Disorders in the Department of Psychiatry at the University of Pittsburgh. Differing neurobiological measurements from post-mortem brain tissue of subjects in the Brain Bank Core of the Center are involved in numerous studies. These studies typically attempt to identify neurobiological markers that are differentially expressed in subjects with schizophrenia as compared to normal controls (Konopaske et al., 2006). Consequently, it is of interest to attempt to undertake a large neurobiological project to integrate data from multiple Center’s studies. There is limited literature on previous attempts at such a data synthesis for post-mortem tissue studies in schizophrenia research. Two such papers are by Knable et al. (2001, 2002). The ultimate goal of our project is to identify possible heterogeneous groups of subjects with schizophrenia based on the various neurobiological markers, and this requires a series of major methodological steps. Clearly, the purpose of our long-term research is to provide new insights into the understanding of the neurobiology of schizophrenia.

The Conte Center databases we need to use involve numerous studies with varying subject populations and differing types of data. The main data issues include repeated measurements, differing matched controls for the same subject with schizophrenia and scientifically important covariates. Whenever repeated measurements occur, we plan to combine them into a single observation appropriate to that study. Multivariate normal models with structured means and covariance matrices have been developed by Wu and Sampson (2008) to deal with the differing matched controls and covariates.

Although the data are usually assumed to be normally distributed in the context of post-mortem tissue studies in schizophrenia, we derive our new procedure in a general setting with arbitrary component distributions. An application to the data structure in the large neurobiological project is illustrated as an example. This application uses the structured models introduced in Wu and Sampson (2008) which are revisited in Section 4. The actual data we will eventually combine from multiple studies show a considerable degree of missingness. To implement the clustering algorithm for the Center’s databases will require carefully crafted multiple imputation schemes. And due to the high computational burden in multiple imputations, time efficiency of the corresponding complete data clustering algorithms is of concern.

3 Derivation of the new algorithm

For a sample of n independent multivariate observations y1, ···, yn, we consider a setting of finite mixture models with g ≥ 2 subpopulations. Let zi = (zi1, ···, zig)′, i = 1,, n, be the unobserved group indicators, where zik = 0 or 1 and k=1gzik=1. And zik = 1 implies that yi is a random sample from the kth subpopulation. Marginally, z1, ···, zn are independent and identically distributed with a multinomial(1; π1, ···, πg) distribution, where 0 ≤ πk ≤1 and k=1gπk=1. Conditional on Z = (z1, …, zn)′, we assume y1, ···, yn having density functions given by

f(yizik=1)=fik(yi;θ)

for i = 1, …, n and k = 1, …, g, where fik(·; ·) depends on i through, e.g., some external covariates of the subject or the design of the study, and the subscript k implies the kth mixture component as identified by parts of θ. One example is fik(yi; θ) = [var phi](yi;Xiβk,Σi(σ)), where [var phi](·; μ, Σ) denotes the multivariate normal density function with mean μ and covariance matrix Σ, Xi is the known covariates matrix, and θ=(β1,,βg,σ) is the vector of unknown parameters. In this case, fik(yi; θ) depends on i through Xi and the design of Σi(σ). Two possible designs of Σi(σ) include dimension changing from subject to subject and random effect covariates. The parameter vector βk identifies the kth mixture component. Here, the covariates Xi is formed as a matrix with each row corresponding to each component in yi. This formulation is different from the conventional one used in multivariate linear regressions where Xi is a vector and βk is a matrix (Section 7.7, Johnson and Wichern, 2002). However, it provides more flexibility in modeling the mean structure of yi in the way that each component of yi can now have different covariates values. And it made the following notation easier.

In the above settings, the likelihood function for the observed data Y = (y1, ···, yn)′ is given by

L(π,θY)=i=1n[k=1gπkfik(yi;θ)]
(1)

where π = (π1, ···, πg−1)′ since πg is redundant given k=1gπk=1. However, directly maximizing (1) is intractable, even if fik(yi; θ) has a simple form. Instead, the EM algorithm introduced by Dempster, Laird, and Rubin (1977) focuses on the complete (augmented) data (Y, Z) and its likelihood function

L(π,θY,Z)=i=1nk=1g[πkfik(yi;θ)]zik.

Denote [theta] = (π, θ)′ for notational ease. The observed likelihood function (1) is then maximized though iterative maximizations of the conditional expectation

Q(ϑϑ(t))=Eϑ(t)[logL(ϑY,Z)Y]=i=1nk=1gτik(t)[logπk+logfik(yi;θ)]
(2)

for t = 0, 1, 2, , where

τik(t)=πk(t)fik(yi;θ(t))j=1gπj(t)fij(yi;θ(t)).

The EM algorithm consists of iterations of an E-step which computes (2) and an M-step which maximizes (2) with respect to its first argument. The EM algorithm takes advantage from the fact that

Q(ϑϑ(t))ϑ=0
(3)

has a closed form solution. If this is not the case, then the EM algorithm loses its base merit of simplicity. In fact, it is easy to see that there is always a closed from solution for π given by

πk(t+1)=i=1nτik(t)n
(4)

for k = 1, , g.

When the closed form solution for θ is not available, the EM algorithm requires another iterative algorithm, e.g., Newton-Raphson, in the M-step to solve (3). Many researchers find this second set of iterations to be inefficient. According to McLachlan and Krishnan (2008, p. 25 and pp. 149–153), Rai and Matthews (1993) proposed an EM1 algorithm where they replaced the entire M-step of the EM algorithm with one iteration of the Newton’s method given by

θ(t+1)=θ(t)α(t)[2Q(ϑϑ(t))θθ]ϑ(t)1[Q(ϑϑ(t))θ]ϑ(t)

where 0 < α ≤ 1 was used to adjust the step size in each iteration to ensure the monotonicity of L([theta](t)|Y) with respect to t. A special version of the EM1 algorithm with a(t) [equivalent] 1 was later referred to as the EM-gradient algorithm by Lange (1995a). In addition, Lange (1995a) considered choosing a(t) [equivalent] a to adjust the EM-gradient iterations and speed up convergence. This modification then served as a basis of a quasi-Newton acceleration introduced in Lange (1995b) where he noted that the EM-gradient actually acquired almost identical local convergence properties as the EM algorithm.

As another approach, Titterington (1984) proposed to update θ according to

θ(t+1)=θ(t)α(t)[Ic(θ)]ϑ(t)1[Q(ϑϑ(t))θ]ϑ(t)

where

Ic(θ)=Eϑ[2logL(ϑY,Z)θθ]

is the complete data information matrix with respect to θ. For a variety of models, e.g., mixtures with normal densities, Ic(θ) has a simpler form than [partial differential]2Q([theta]|[theta](t))/[partial differential] θ[partial differential] θ′ in Rai and Matthews’ and Lange’s methods. In fact, it is not hard to show that under mild regularity conditions

[Ic(θ)]ϑ(t)=E[2Q(ϑϑ(t))θθ]ϑ(t).
(5)

Hence Titterington’s algorithm can be thought of as a scoring version of the EM1 or EM-gradient algorithm. Similar as the EM1 algorithm, a fractional step achieved by a small enough a(t) will ensure the increase of L([theta](t)|Y).

Now let us take a detailed look at the matrices [partial differential]2Q([theta]|[theta](t))/[partial differential] θ[partial differential]θ′ and Ic(θ) given by

2Q(ϑϑ(t))θθ=i=1nk=1gτik(t)[2logfik(yi;θ)θθ]

and

Ic(θ)=i=1nk=1gπk(t)E[2logfik(yi;θ)θθzik=1].
(6)

Titterington’s algorithm is possibly superior to the EM1 algorithms in terms of their speed of convergence when E[[partial differential]2log fik(yi; θ)/[partial differential]θ[partial differential]θ′|zik = 1] is much simpler than [partial differential]2log fik(yi; θ)/[partial differential]θ [partial differential]θ′. But it also suffers from using the unconditional clustering probabilities π1(t),,πg(t) in (6) instead of the conditional ones τi1(t),,τig(t), since it is clear that τi1(t),,τig(t) fit the data better than π1(t),,πg(t) do in the current stage. Because { τik(t)} is necessarily computed for (4) in every iteration, we propose, as a variant to the above algorithms, to update θ according to

θ(t+1)=θ(t)α(t)[H(θ)]ϑ(t)1[Q(ϑϑ(t))θ]ϑ(t)
(7)

where

H(θ)=i=1nk=1gτik(t)E[2logfik(yi;θ)θθzik=1].

As can be seen, the new modification shares the same numerical simplicity as Titterington’s algorithm. The self-consistency of the new algorithm is shown in Section 5. In the simulations in Section 6, the new algorithm with α(t) [equivalent] 1 is compared to the EM-gradient algorithm and Titterington’s algorithm in terms of their convergence speed and clustering accuracy.

4 An example involving structured models

Although the new algorithm developed in Section 3 is applicable in more general settings, in this section we introduce a special example involving structured models because these models are critical in analyzing the motivating databases. And in addition, this example is suitable for demonstrating some basic advantages of the new algorithm in analyzing complex databases.

As noted by Wu and Sampson (2008), several special considerations arise when we attempt to combine data from multiple studies in the Conte Center. First, by December 31, 2005 about 35 separate post-mortem tissue studies have been conducted. These studies involve different neurobiological measurements, such as neuron counts, neuron somal volume and certain mRNA expression levels, on differing subject populations. As a result, the combined data would involve multiple responses and a considerable degree of missingness. Carefully crafted multiple imputation schemes will be required in dealing with missing data. In this paper, we develop an efficient complete data clustering algorithm which will be implemented later in conjunction with multiple imputations. Second, In order to control for both experimental and demographical variations every subject with schizophrenia has been matched with a normal control subject in each study based on their ages at death, gender and post-mortem intervals. And paired differences between measurements on the subjects with schizophrenia and the corresponding controls are typically obtained and treated as primary data in the original analysis. This convention is adopted here. Nevertheless, the matched controls for a subject with schizophrenia might be different in different studies. To be more explicit, let’s consider Si1, ···, Sip and Ci1, ···, Cip being p neurobiological measurements on, respectively, the ith subject with schizophrenia and its corresponding controls for i = 1, 2, ···, n. These measurements are obtained most possibly from more than one study. And there is a chance that Ci1, ···, Cip are from different subjects. The pairwise differences are defined to be Si1Ci1, ···, SipCip for i = 1, 2, ···, n. Observations from the same subjects are assumed to be correlated, while observations from different subjects are taken to be independent. As a result, we have

Cov(SijCij,SikCik)=Cov(Sij,Sik)+Cov(Cij,Cik)forjk.

When Cij and Cik are from the same subject, Cov(Cij, Cik) ≠ 0; otherwise, Cov(Cij, Cik) = 0. Finally, as common in studies with human subjects, covariates, such as age and gender, are involved in the Center’s studies. The means of the multivariate responses yi = (Si1Ci1, ···, SipCip)′, i = 1, 2, ···, n, are then assumed to be linear in the covariates, that is

E[yiXi]=Xiβ,

where Xi is the covariates matrix whose rows corresponding to the components in yi and β is a vector of unknow parameters. This mean structure allows some covariates values to vary from measurement to measurement. Examples of such covariates include tissue quality or solution density which are study dependent.

Wu and Sampson (2008) developed multivariate normal models with structured means and covariance matrices to deal with the special considerations in the Center’s databases. The structured means resulted from the covariates of the subjects with schizophrenia, whereas the structured covariance matrices were due to the differing control subjects that were matched to the same subjects with schizophrenia when they were involved in different studies. Consider the responses y1, ···, yn representing the pairwise differences between the n subjects with schizophrenia and their corresponding controls. Wu and Sampson (2008) defined fi(yi; θ) = [var phi](yi; Xiβ, Σi(σ)), i = 1, …, n, as the corresponding density functions, where X1, · · ·, Xn are the known covariates matrices and θ = (β′, σ′)′. The subscript i of fi(yi; θ) denotes its dependency on i through Xi and Σi(σ). Each Σi(σ) was formulated to be linear in σ=(σ11,,σpp,σ12,,σ(p1)p,σ12c,,σ(p1)pc), i.e.,

i(σ)=j=1pσjjGjj+1j<kp(σjk+σjkcIijk)Gjk

where Gjk, 1 ≤ jkp, are known matrices with all “0” entries except a “1” at both the (j, k)th and the (k, j)th entries, and Iijk=1 if the controls matched to the ith subject with schizophrenia for the jth and the kth measurements are the same; otherwise, Iijk=0 (Wu and Sampson, 2008). Explicitly speaking, σjj is a sum of the jth measurement variances for both the subject with schizophrenia and the control, whereas σjk and σjkc are the (j, k)th measurements covariances on the subject with schizophrenia and the control, respectively. And σjkc is added onto the (j, k)th covariance of the pair-wise differences when Iijk=1, i.e., Cij and Cik are from the same subject. For data with p dimensional responses, there are a total of 2pp possible matching schemes between the subjects with schizophrenia and the controls. But most probably, not all of these matching schemes appear in one data set. In such cases, Iijk0 or 1 for i = 1, 2, · · ·, n for some 1 ≤ j < kp. If Iijk0, then σjk is estimable, but σjkc is not; and If Iijk1, then σjk+σjkc is estimable, but both summands are not. Thus, the number of parameters in σ needs to be reduced accordingly. However, in the following discussion we assume for simplicity that 0<i=1nIijk<n for all 1 ≤ j < kp so that all parameters in σ are estimable.

Following the model specification in Wu and Sampson (2008), in this example we consider a finite mixture model with component density functions fik(yi;θ) = [var phi](yi; Xiβk, Σi(σ)) for i = 1, …, n and k = 1, …, g, where θ=(β1,,βg,σ). The subscript k of fik(yi; θ) identifies the kth mixture component. The purpose is to cluster the subjects with schizophrenia into possible subpopulations. In this model, the clusters are defined only in terms of the regression parameters β1, · · ·, βg. There is no biological reason suggesting that the value of σ should be different over the clusters. For notational ease, we relabel the parameters in σ as σ = (σ1, · · ·, σq). And by using some well-known matrix derivative results, we have

Q(ϑϑ(t))βk=i=1nτik(t)Xii1(yiXiβk),k=1,,g,
(8)

Q(ϑϑ(t))σj=12i=1ntr(i1iσji1(k=1gτik(t)Ciki)),j=1,,q
(9)

where Cik = (yiXiβk)(yiXiβk)′. Continuing to take partial derivatives of (8) and (9) yields

2Q(ϑϑ(t))βkβk=i=1nτik(t)Xii1Xi,1kg,2Q(ϑϑ(t))σjσl=12i=1ntr(i1iσji1iσli1(2k=1gτik(t)Ciki))+i=1ntr(i12iσjσli1(k=1gτik(t)Ciki)),1j,lq,2Q(ϑϑ(t))βjβk=0,1kjg,2Q(ϑϑ(t))βkσj=i=1nτik(t)Xii1iσji1(yiXiβk),1kg,1jq.

On the other hand,

[Ic(θ)]βk,βk=i=1nπk(t)Xii1Xi,1kg,[Ic(θ)]σj,σl=12i=1ntr(i1iσji1iσl),1j,lq,[Ic(θ)]βj,βk=[Ic(θ)]βk,σl=0,1kjg,1lq.

Thus, our newly proposed algorithm (7) with α(t) [equivalent] 1 yields an update given by

βk(t+1)=[i=1nτik(t)Xii1Xi]ϑ(t)1[i=1nτik(t)Xii1yi]ϑ(t),k=1,,g,
(10)
σ(t+1)=[i=1ntr(i1iσji1iσl)]ϑ(t)1[i=1ntr(i1iσli1(k=1gτik(t)Cik))]ϑ(t)+σ(t)[i=1ntr(i1iσji1iσl)]ϑ(t)1[i=1ntr(i1iσl)]ϑ(t).
(11)

When Si is linear in the components of σ as described earlier, (11) becomes

σ(t+1)=[i=1ntr(i1iσji1iσl)]ϑ(t)1[i=1ntr(i1iσli1(k=1gτik(t)Cik))]ϑ(t).
(12)

Clearly, the quantities in (10) are the maximum likelihood estimates of the β’s when the individual clustering probabilities and the covariance matrices are assume to be known. The calculation of (12) is easier than that of the corresponding update in the EM-gradient algorithm and as easy as that of the corresponding update in Titterington’s algorithm.

5 Local convergence properties

It was shown by Dempster, Laird, and Rubin (1977) that L([theta](t+1)|Y) =≥ L([theta](t)|Y) as long as Q([theta](t+1)|[theta](t)) ≥ Q([theta](t)|[theta](t)). So the convergence of the EM algorithm to a local maximum of L([theta]|y) is guaranteed under mild regularity condition. In addition, it can be shown that

[2Q(ϑϑ(t))θθ]ϑ(t)<[2logL(ϑY)θθ]ϑ(t).

So the negative definiteness of [[partial differential]2Q([theta]|[theta](t))/[partial differential][theta][partial differential][theta]′][theta](t) around a local maximum follows from the negative definiteness of [[partial differential]2log L([theta]|Y)/[partial differential]θ[partial differential]θ′][theta](t). The negative definiteness of [Ic(θ)][theta](t) around a local maximum is also trivial. As a result, both Titterington’s and the EM1 algorithms are necessarily ascent in the neighborhood of a local maximum, which means that there always exist at least one 0 < α(t) ≤ 1 which leads to an increase in Q([theta]|[theta](t)).

The same ascent property can also be established for our newly proposed algorithm. First, the negative definiteness of [H(θ)][theta](t) is guaranteed, since every individual item

E[2logfik(yi;θ)θθzik=1]ϑ(t)

is negative definite. Here we can focus only on the parameter θ, because (4), as a solution to (3), automatically leads to an increase in Q([theta]|[theta](t)) by the GEM theory. As a result, given the values of π1(t),,πg(t), we rewrite Q(θ|θ(t)) as Q([theta]|[theta](t)) and consider the difference Q(θ(t+1)|θ(t)) − Q(θ(t)|θ(t)). A Taylor’s expansion around θ(t) is given as

Q(θ(t+1)θ(t))Q(θ(t)θ(t))=[Q(θθ(t))θ]θ(t)(θ(t+1)θ(t))+12(θ(t+1)θ(t))[2Q(θθ(t))θθ]θ(θ(t+1)θ(t)),
(13)

where θ* is between θ(t+1) and θ(t). Now plugging our new update (7) into (13), we have

Q(θ(t+1)θ(t))Q(θ(t)θ(t))=α(t)[Q(θθ(t))θ]θ(t)[H(θ)]θ(t)1[Q(θθ(t))θ]θ(t)+12(α(t))2[Q(θθ(t))θ]θ(t)[H(θ)]θ(t)1[2Q(θθ(t))θθ]θ[H(θ)]θ(t)1[Q(θθ(t))θ]θ(t)=[Q(θθ(t))θ]θ(t)(12(α(t))2[H(θ)]θ(t)1[2Q(θθ(t))θθ]θ[H(θ(t))]θ(t)1α(t)[H(θ)]θ(t)1)[Q(θθ(t))θ]θ(t).

The above quantity is in the quadratic form. In order to have Q(θ(t+1)|θ(t)) − Q(θ(t)|θ(t)) = 0, it is sufficient that

12(α(t))2[H(θ)]θ(t)1[2Q(θθ(t))θθ]θ[H(θ)]θ(t)1α(t)[H(θ)]θ(t)10.
(14)

Since both [H(θ)]θ(t)and [[partial differential]2Q(θ|θ(t))/[partial differential]θ[partial differential]θ′]θ* are negative definite matrices. Inequality (14) can be reduced to

α(t)I[2Q(θθ(t))θθ]θ1/2(2[H(θ)]θ(t))[2Q(θθ(t))θθ]θ1/2
(15)

where I is the identity matrix with the same dimension as H(θ). It is not hard to show that inequality (15) is satisfied when α(t) is chosen to be smaller than the smallest eigenvalue of the matrix on the right hand side of (15). Consequently, the value of a(t) selected such way always lead to an increase in Q(θ|θ(t)) and so in L([theta]|Y).

As another usage, α(t) can be adjusted to ensure that the parameter estimates fall in the parameter space. This is usually called step-halving. For example, one concern is that the estimated covariance matrices should be positive definite. By using a small α(t) in (7), we are actually shrinking [theta](t+1) toward [theta](t). And usually the parameter space is an open set. So given that [theta](t) is in the parameter space, there will always be an α(t) small enough to guarantee that[theta](t+1) will also be in the parameter space. And the direction of the inequality in (15) enables us to apply step-halving while guaranteeing an increase in L([theta]|Y). However, using a small α(t) reduces the speed of convergence.

6 Simulations

Carefully crafted multiple imputation schemes will be required in the last step of our long term project when we attempt clustering in the presence of the substantial amount of missing data in the database from the Center’s studies. The computational speed of the clustering component is critical in the long term project and we intend to use our new algorithm for this purpose. In this section, we show using simulations that our algorithm is computationally faster than both Titterington’s and the EM-gradient algorithms. Data are simulated to be complete except the usual missing clustering indicators. A fixed value of α(t) = 1 is used. All algorithms are coded and running in R language.

Data with a structure conforming to the one described in Section 4 are simulated for the clustering analysis. The dimension of the outcomes y1, · · ·, yn is assumed to be three. According to the component density functions

fik(yi,θ)=φ(yi;Xiβk,i(σ))

for i = 1, …, n and k = 1, 2, data sets containing data from two clusters are simulated. Each data set contains n = 500 subjects with 250 for each cluster. The covariates matrix Xi is in the form of

Xi=[1x1x20000000001x1x20000000001x1x2]

for i = 1, …, n. The constant ‘1’ in the Xi’s represents the overall difference between the subjects with schizophrenia and the controls. The values of x1 are integers sampled uniformly from 20 to 80 to mimic the covariate age. And the values of x2 are sampled uniformly from binary {0, 1} to simulate the covariate gender. This design of Xi allows the covariates effects with respect to the three components of yi to be different. No study dependent covariate is considered in this simulation study. The two clusters differ only in the parameters for the mean structures, and let

β1=(β111,β121,β131,β211,β221,β231,β311,β321,β331)=(100,2,50,50,2,50,50,1,50)β2=(β112,β122,β132,β212,β222,β232,β312,β322,β332)=(100,2,50,50,2,50,50,1,50)

be those parameters for the two clusters, respectively. Explicitly speaking, (βj11,βj21,βj31) are the regression parameters for the jth component of yi in the first cluster for j = 1, 2, 3, whereas (βj12,βj22,βj32) are the regression parameters for the jth component of yi in the second cluster. In addition, let

σ=(σ11,σ22,σ33,σ12,σ13,σ23,σ12c,σ13c,σ23c)=(1000,1500,1000,400,500,600,200,100,200)

which creates five possible individual covariance matrices for the outcomes as follows:

(1)=[100040050040015006005006001000](2)=[100060050060015006005006001000](3)=[100040040040015006004006001000](4)=[100040050040015004005004001000](5)=[100060040060015004004004001000].

In this simulation study, we assume that the five possible matching schemes between subjects with schizophrenia and controls are equally likely. As a result, within each data set and each cluster, each of the above covariance matrices appears to 50 subjects. Although this assumption will most likely be violated in reality, this algorithm can be implemented in the same way.

Direct applications of both the EM-gradient algorithm and Titterington’s algorithm to our simulated data is time consuming. The three algorithms are implemented on a random selection of 30 out of 500 simulated data sets and shown to provide the same parameter estimates. As a result, only our new algorithm is used for the parameter estimation for the rest 470 simulated data sets.

First, we examine in detail their computational speed. Each of the three algorithms is then implemented from the same starting values to find the parameter estimates. For the feasibility of comparison, the three algorithms are stopped according to the same criterion, that is, when the change in the parameter estimates does not exceed a pre-defined limit. We observe that when the algorithms are started from near the true parameter values, they converge in almost the same number of steps. However, when the algorithms are started far from the true parameter values, they behave differently. A typical result from one of the thirty selected data sets is shown in Figure 1. The x-axis represents the number of iterations, while the y-axis represents the value of the observed log-likelihood function evaluated at the parameter estimates in each iteration. Some beginning iteration history for the three algorithms with low values (large negative values) of the log-likelihood function is not shown in Figure 1 for the feasibility of comparison. It can be seen that the EM-gradient algorithm converges in more than 80 steps, while Titterington’s algorithm converges in about 65 steps. However, our new algorithm only requires about 25 steps to converge, which is a big advantage as compared to the other two. In addition to the results on the numbers of iterations before convergence, we also observe that the average cost of time per iteration is 4.67 seconds for the EM-gradient algorithm and about 0.7 seconds for both Titterington’s algorithm and our new algorithm. However, we do recognize that the average cost of time per iteration does depend on the coding of the algorithms, the computing software and the hardware configuration. From Figure 1, the main feature of the new algorithm is that it requires significantly fewer iterations in finding the region containing a maximum when started far from the optimum, while its number of steps for subsequent “local refinement” is actually comparable to the two existing algorithms. In addition, we show in Table 1 the average numbers of steps to convergence and the mean per iteration time, as well as their corresponding standard deviations, for the three algorithms over the thirty selected data sets. The result confirms the above conclusion that our new algorithm is computational more effective than both the EM-gradient and the Titterington’s algorithms.

Fig. 1
Iteration history of the clustering algorithms: (a) for the new algorithm; (b) for Titterington’s (1984) algorithm; (c) for the EM gradient algorithm. The mean per iteration time, computed as the average of the computational time required for ...
Table 1
Average numbers of steps to convergence and mean per iteration time for the thirty simulated data sets

For our current simulations, two different approaches for starting points are used for the purpose of demonstration. One approach is to choose parameters close to the true parameter values, and the other approach is by starting the algorithm from randomly generated clustering indices, i.e., a random starting point. For any single simulated data set, we define the final clustering result to be “successful” if the algorithm clusters more than 95% of its subjects correctly. For the 500 simulated data sets, we observe that by starting from near the true parameter values we get “successful” clustering results on 100% of the simulated data sets, while by starting from random point we obtain “successful” clustering results for about 95% of the simulated data sets. For the other 5% of the simulated data sets, the algorithm either does not converge (1.4%) or converges (3.6%) to a solution resulting in a clustering in which the subjects are clustered complete randomly. For those data sets with “successful” clustering results when starting from random clustering indices, we summarize the results of the parameter estimation in Table 2 as compared to the true parameter values. It can been seen that the parameter estimation is reasonably accurate when the algorithm finds the correct clusters. In fact, these results are surprisingly good. Typically, no one relies on one random starting point if one has no information about where to start. In order to increase the chance of identifying the correct clustering, we could always start the algorithm from multiple starting points and pick the solution maximizing the likelihood function as the result.

Table 2
A Summary of the parameter estimates of the clustering simulations when the true clusters are identified

7 Conclusions

In this paper, some special features of finite mixture models, as compared to general missing data problems, are utilized to speed up the parameter estimation algorithms. The EM-gradient algorithm provides updates which fit the data better in each iteration, while Titterington’s algorithm requires less calculation in each iteration. Our new algorithm takes their both advantages and is shown to acquire their nice local convergence properties as a heritage. In addition, we show by simulations that our new algorithm converges in fewer iterations than its two predecessors while providing the same parameter estimates. And the cost of time per iteration of our new algorithm should be comparable to that of Titterington’s algorithm and lower than that of the EM-gradient algorithm. As discussed, the database of the Center’s studies which our ultimated project will use involves a great degree of missing data in addition to the unobserved clustering indicators. The next steps in our long term goal of clustering subjects with schizophrenia are to develop specially crafted multiple imputation techniques, implement the newly developed clustering algorithm to the multiply imputed data sets, and finally integrate the multiple clustering results to a single clustering result.

Acknowledgments

We thank Dr. David Lewis for his insights and allowing us use the data obtained in his lab. We also thank the Center researchers, especially Dr. Takanori Hashimoto, for their generous help. This research was supported by NIMH Grant 5P50MH045156-18 and P50 MH084053-01. We finally thank the referees for their insightful comments.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • Arminger G, Stein P, Wittenberg J. Mixtures of conditional mean-and covariance-structure models. Psychometrika. 1999;64 (4):475–494.
  • Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B. 1977;39 (1):1–38.
  • DeSarbo WS, Corn LW. A maximum likelihood methodology for clusterwise linear regression. Journal of Classification. 1988;5:249–282.
  • Johnson RA, Wichern DW. Applied multivariate statistical analysis. 5. Prentice-Hall, Inc.; Upper Saddle River, NJ: 2002.
  • Jones PN, McLachlan GJ. Fitting finite mixture models in a regression context. Australian Journal of Statistics. 1992;34 (2):233–240.
  • Knable MB, Bacrci BM, Barko JJ, Webster MJ, Torrey EF. Molecular abnormalities in the major psychiatric illnesses: Classification and regression tree (crt) analysis of post-mortem prefrontal markers. Molecular Psychiatry. 2002;7:392–404. [PubMed]
  • Knable MB, Torrey EF, Webster MJ, Bartko JJ. Multivariate analysis of prefrontal cortical data from the stanley foundation neuropathology consortium. Brain Research Bulletin. 2001;55 (5):651–659. [PubMed]
  • Konopaske GT, Sweet RA, Wu Q, Sampson AR, Lewis DA. Regional specificity of chandelier neuron axon terminal alterations in schizophrenia. Neuroscience. 2006;138(1) [PubMed]
  • Lange K. A gradient algorithm locally equivalent to the em algorithm. Journal of the Royal Statistical Society B. 1995a;57 (2):425–437.
  • Lange K. A quasi-newton acceleration of the em algorithm. Statistica Sinica. 1995b;5:1–18.
  • McLachlan GJ, Basford KE. Mixture models: inference and applications to clustering. Dekker; New York: 1988.
  • McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. 2. Wiley; New York: 2008.
  • Rai SN, Matthews DE. Improving the em algorithm. Biometrics. 1993;49:587–591.
  • Titterington DM. Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society B. 1984;46:257–267.
  • Wu Q, Sampson AR. Structured modeling with applications to integrative analysis of post-mortem tissue studies in schizophrenia, unpublished. Dec, 2008. URL http://personal.ecu.edu/wuq/media/structuredmodel.pdf.