Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Electron J Stat. Author manuscript; available in PMC 2010 May 11.
Published in final edited form as:
Electron J Stat. 2009 January 1; 3: 1473–1496.
doi:  10.1214/09-EJS487
PMCID: PMC2867492

Penalized model-based clustering with unconstrained covariance matrices

Hui Zhou and Wei Pan*
Division of Biostatistics, School of Public Health, University of Minnesota ; ude.nmu@292xuohz


Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm utilizing the graphical lasso (Friedman et al 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.

Keywords: Covariance estimation, EM algorithm, Gaussian graphical models, High-dimension but low-sample size, L1 penalization, Normal mixtures, Penalized likelihood, Semi-supervised learning

1. Introduction

As an important tool of data analysis, clustering has emerged as indispensable to analyzing high-dimensional genomic data. For example, in gene function discovery, various methods of clustering have been applied to group genes with their expressions across multiple conditions (Eisen et al 1998 [(9)]; Tavazoie et al 1999 [(39)]); in disease subtype discovery, clustering is used to cluster patients’ tissue samples with their genomic expressions (Golub et al 1999 [(1)]). In this process, because of unknown identities of many relevant genes and/or experimental conditions it is necessary to select informative genes or conditions to yield meaningful clusters. Such a task of variable selection is critical not only to clustering but also to other modeling strategies such as classification. For classification, Alaiya et al (2002) [(1)] studied borderline ovarian tumor classification, where classification using all 1584 protein spots is unsatisfactory but that focusing on a subset of around 200 selected spots provided more accurate results. For clustering, there have been only a limited number of studies, in contrast to a large body of literature on variable selection for classification and regression. As pointed out in Pan and Shen (2007) [(30)], clustering imposes many unique challenges to variable selection in that some well known model selection procedures, e.g. best subset selection with BIC (Schwarz 1978 [(36)]), may not be applicable to clustering, which is unlike in classification and regression. One main reason is that in general there are many true models in clustering, most of which are not useful. For example, any true noise variable may suggest the existence of only one cluster; however, this discovery, albeit true, is useless because, in clustering one would like to uncover the underlying heterogeneity and structures in the data, such as identifying informative variables that suggest the existence of two or more clusters.

Among many clustering approaches, model-based clustering has become increasingly important due to its interpretability. In addition to good empirical performance relative to its competitors (Thalamuthu et al, 2008 [(40)]), model-based clustering has a solid probabilistic framework of mixture models, which facilitates model building and checking, such as selecting the number of clusters (McLachlan, 1987 [(25)]; Fraley and Raftery, 1998 [(12)]). Although Normal mixture models have been extensively studied by both frequentist and Bayesian approaches (Banfield and Raftery, 1993 [(3)]; Muller et al, 1996 [(28)], and references therein), our focus here is on high-dimensional data, for which variable selection is necessary (e.g. Pan and Shen, 2007 [(30)]). There are basically two categories of approaches to variable selection for high-dimensional model-based clustering: the Bayesian approaches (Liu et al, 2003 [(24)]; Teh et al, 2004 [(41)]; Ho, 2006 [(15)]; Tadesse et al, 2005 [(38)]; Kim et al, 2006 [(19)]; Raftery and Dean, 2006 [(32)]) and penalized likelihood approaches (Pan and Shen, 2007 [(30)]; Xie et al, 2008a [(49)]; Xie et al, 2008b [(50)]; Wang and Zhu, 2008 [(47)]; Guo et al 2009 [(14)]). In general, the Bayesian approaches are more flexible by allowing more general covariance matrices, but computationally are more demanding due to the use of MCMC for stochastic searches. For the penalized likelihood approaches, one common assumption is that each cluster has a diagonal covariance matrix, implying the same orientation for all clusters (Banfield and Raftery, 1993 [(3)]). As to be shown later, this is too stringent and can severely degrade performance in practice. Conceptually a general or unconstrained covariance matrix should be allowed for each cluster; however the challenge is how to treat means and general covariances subject to the constraint that any resulting covariance matrix estimate is positive definite. This challenge is evident in the recent literature on Gaussian graphical modeling that estimates a large covariance matrix based on a Normal sample (Huang et al, 2006 [(16)]; Yuan and Lin, 2007 [(52)]; Levina et al, 2008 [(21)]; Rothman et al, 2009 [(34)]; Fan et al, 2009 [(10)] and references therein). This problem continues to be even more challenging for mixture models, because it is unknown which observations are from which Normal components.

In this article, we propose a general penalized likelihood approach that permits unconstrained covariance matrices in a Normal mixture model. A major innovation here is the recognition of the connection between fitting Normal mixture models and Gaussian graphical modeling. Our approach utilizes the recent development in Gaussian graphical modeling by effectively embedding an existing Gaussian graphical modeling method into the E-M algorithm for Normal mixture models. In particular, we implement our method using the graphical lasso method (Friedman et al, 2007 [(11)]) for covariance estimation. Moreover, we generalize the proposed method to semisupervised learning, permitting partially labeled observations.

The rest of this article is organized as follows. Section 2 reviews the penalized model-based clustering method with diagonal covariance matrices, followed by a description of our proposed method that allows for a common or cluster-specific general covariance matrices. A brief discussion of an extension to semisupervised learning is given to permit known cluster memberships for a subset of observations. Section 3 presents simulation results, and an application to real microarray data is contained in section 4, to demonstrate the feasibility and effectiveness of the method compared against its counterpart with diagonal covariance matrices. Section 5 concludes with a summary and a discussion of future work.

2. Methods

2.1. Mixture model and its penalized likelihood

Denote by X = {x1, …, xn} a random sample of n K-dimensional observations. Assume that the n observations are standardized with sample mean 0 and sample variance 1 for each variable. Assume that the observations are independent and from a mixture distribution with probability density function (pdf)


where fi is the pdf for component or cluster i with unknown parameter vector θi, and πi is the prior probability for component i with i=1gπi=1. For parameter estimation, we adopt the maximum penalized likelihood estimator (MPLE) that maximizes the penalized log-likelihood


where Θ represents all unknown parameters and pλ(Θ) is a penalty function for Θ with a regularization parameter (vector) λ. In what follows, we assume a Gaussian mixture model with fi being a multivariate Normal density, and regularize their mean vectors and possibly covariance matrices.

To obtain the MPLE, we employ the E-M algorithm (Dempster et al, 1977 [(7)]). Let zij denote the indicator of whether xj belongs to component i; namely, zij = 1 if xj comes from component i, and zij = 0 otherwise. Here zij’s are regarded as missing data simply because they are not observed. If zij’s were observed, the complete data penalized log-likelihood becomes


Given a current estimate Θ(r), the E-step of the E-M calculates


where τij is the posterior probability of xj’s belonging to component i. The M-step maximizes QP to update the estimate of Θ.

2.2. Penalized clustering with diagonal covariance matrices

For comparison, we briefly review the method of Pan and Shen (2007) [(30)], which specifies the components fi as multivariate Normal with a common diagonal covariance matrix V=diag(σ12,σ22,,σK2),


with V=det(V)=k=1Kσk2. They proposed a penalty function pλ(Θ) as an L1-norm of the mean parameters


where μik is the mean of kth variable for component i. Note that the observations are standardized to have zero mean and unit variance for each variable k. If μ1k = … = μgk = 0, then variable k cannot differentiate the components, hence deemed as noninformative (i.e. a noise variable) and automatically excluded from clustering. Variable selection is realized when small estimates of μik’s can be shrunken to be exactly 0 by the use of the L1 penalty (Tibshirani, 1996 [(42)]).

For convenience, a generic notation Θ(r) is used to represent the parameter estimate at iteration r. The E-M updating formulas for maximizing the penalized likelihood (2) are as follows: at iteration r + 1, the posterior probability of xj belonging to component i is


the prior probability of an observation from the ith component


the variance of variable k


and the mean of variable k in cluster i


for i = 1, 2, (...) , g and k = 1, 2, (...) , K. For sufficiently large λ1, we have [mu]ik = 0; if [mu]1k = [mu]2k = … = [mu]gk = 0 for variable k, variable k is a noise variable and does not contribute to clustering as can be seen from equation (6).

If the variance parameters are not regularized, it is straightforward to extend (6)–(9) to the case with cluster-specific diagonal covariance matrices Vi=diag(σi,12,,σi,12): all the updating formulas remain to be the same except replacing σk2 by σi,k2:


Note that the treatment here differs from Xie et al (2008b) [(50)], in which the variance parameters are regularized. Throughout this article, we assume that an informative variable is defined to have cluster-specific means, no matter whether it has a common or cluster-specific variances.

2.3. Penalized clustering with a common unconstrained covariance

We now consider general or unconstrained covariance matrix V that further relax the diagonal covariance matrix assumption. Denote W = V−1 the inverse covariance matrix (or precision matrix) with elements Wkl.

To realize variable selection, we require that a noise variable has a common mean across clusters. Since the data have been standardized to have mean 0 for each variable, a common mean implies μ1k = … = μgk = 0. As in Bayesian approaches (Tadesse et al 2005 [(38)]), one can assume that any noise variable is uncorrelated with any informative variable, though this assumption is not necessary in our approach (because this assumption does not influence our estimation procedure). To facilitate estimating large and sparse covariance matrices, we propose the following penalty function:


Note that, the penalty on the mean parameter is mainly for variable selection, while that for the precision matrix is necessary for high-dimensional data. Since the data dimension K is larger than the sample size n, the sample covariance matrix (or the maximum likelihood estimate under the Normality) is necessarily singular. In addition, as discussed in the literature of Gaussian graphical modeling, penalization on a large covariance (or precison) matrix can yield better estimates than non-penalized ones. Although various penalties have been proposed for a covariance (or precision) matrix, some do not yield a positive-definite covariance estimate. For the problem considered here, since we need to calculate the log-likelihood, and thus the determinant of a covariance estimate, the positive-definiteness of a covariance estimate is needed, which imposes a major technical difficulty. In Gaussian graphical modeling, one aims to estimate the covariance or precision matrix of a Normal distribution; since all the observations are known to be iid from the same Normal distribution, the problem is easier than that for mixture models, where we need to cluster the observations into various unknown groups (each corresponding to a Normal distribution) and estimate the covariance matrix (and other parameters) for each group simultaneously. A major contribution of our work is recognition of the connection between Gaussian mixture modeling and Gaussian graphical modeling: in spite of the unknown cluster- or group-memberships of the observations in a Gaussian mixture model, the estimation of a covariance (or precision) matrix for a mixture component can be formulated to be similar to that for Gaussian graphical modeling.

2.3.1. Estimation of the non-covariance parameters

For the EM algorithm, the E-step yields QP as given in (4) and the M-step maximizes QP with respect to the unknown parameters, resulting in the same updating formulas for τij and πi as given in (6) and (7). The updating formula for μij is derived from the following theorem.

Theorem 1

The sufficient and necessary conditions for [mu]ik to be a (global) maximizer of QP (for a fixed i and k) are




where μi = (μi1, …, μiK)t and W.k = (W1k, …, WKk)t.

Hence, we have the below updating formula for the mean parameter:




Simple algebra indicates that the updating formulas (14)-(15) for μik reduces to (9) when the covariance matrix is diagonal. The coordinate-wise updating for μ as above converges to the global maximum in view of the results of Tseng (1988) [(44)] and Tseng (2001) [(45)], because the first term of QP , the conditional expectation of the complete data log-likelihood, is concave, while the second term, the L1 penalty on the mean parameters, is separable (and concave) in μik’s.

It remains to derive an updating formula for the covariance matrix in the E-M algorithm.

2.3.2. Estimation of the inverse covariance matrix

To derive the estimate of the covariance matrix V , we focus on the M-step of (4) with respect to the V . Replacing V with W−1, we only need to find the updating formula for W . To maximize QP with respect to W , it suffices to maximize




is the empirical covariance matrix.

For (16), we shall use the graphical lasso algorithm of Friedman et al (2007) [(11)], to maximize an objective function


over all non-negative definite matrices W for a known covariance matrix S. Hence, we can apply the algorithm to maximize (16) with λ = 2λ2/n and S=S~. Their algorithm is implemented in R package glasso.

2.4. Penalized clustering with cluster-specific covariance matrices

To permit varying cluster volumes and orientations, we now consider component-specific unconstrained covariance matrices Vi, i = 1, …, g. We employ a slightly modified penalty:


In the E-M algorithm, the updating formulas for π and τ remain the same as in (6) and (7), respectively; for μ, we only need to replace W in (14) and (15) with Wi. Thus, we only need to consider the covariance matrix estimation. Note


where C stands for a constant term unrelated to Wi. To maximize Qp, we only need to maximize




Hence, we can separately maximize each of these g terms using the graphical lasso to obtain an updated estimate of Wi.

2.5. Model selection

We propose using the predictive log-likelihood based on an independent tuning dataset or cross-validation (CV) as our model selection criterion. Through this criterion, we use a grid search to estimate the optimal (g, λ1, λ2) as the one with the maximum predictive log-likelihood.

For any given (g, λ1, λ2), because of possibly many local maxima for the mixture model, we run the EM algorithm multiple times with random starts. For our numerical examples, we started with the K-means clustering, and used its result as initial parameter estimates for the E-M algorithm. From the multiple runs, we selected the one giving the maximal penalized log-likelihood as the final result for the given (g, λ1, λ2).

2.6. Extension: semi-supervised learning

We further extend the proposed method to mixture model-based semi-supervised learning, in which some observations have known cluster labels while the others do not (McLachlan and Peel, 2002 [(27)]; Liang et al, 2007 [(22)]). Pan et al (2006) [(31)] developed the penalized mixture approach with diagonal covariance matrices; here we push it to the case with unconstrained covariance matrices.

Without loss of generality, assume that we have partially labeled K-dimensional data x1, …, xn, in which the first n0 observations do not have class labels while the remaining n1 do have. Furthermore, assume the existence of g0 classes among the first n0 observations and g1 known classes among the n1 labeled observations. The log-likelihood is


The penalized log-likelihood can be then constructed with a suitable penalty function pλ(Θ). It is noted that, in the E-M algorithm, because zij’s for the last n1 observations are indeed observed, their corresponding “posterior probabilities” are known as τij = zij, while those for the first n0 observations are the same as (6). With the new updating formula for τ, the updating formulas for μ, π and covariance parameters in the E-M algorithm are the same as before.

Model selection can be performed as before. First, we use an independent tuning dataset or CV, including both labeled and unlabeled observations, to select the number of clusters and penalty parameters. Then we fit the selected model to the whole data set. Note that, after obtaining all parameter estimates, we calculate the posterior probabilities τ for each observation, including the n1 observations with known class labels, and use the posterior probabilities for class assignment.

3. Simulations

3.1. Small K: Why use non-diagonal covariance matrices

To better visualize the advantage of allowing unconstrained covariance matrices, we applied the methods to two-dimensional data. We considered two simple setups: the first with only one true cluster while the second with two clusters. The number of observations in each set-up was 200; for set-up 2, 100 observations were generated from each cluster. We used an independent tuning dataset of an equal size as that of a training dataset to choose the number of clusters and the penalty parameters.

Figure 1 displays the true clusters, estimated clusters with cluster-specific unconstrained and diagonal covariance matrices respectively. Throughout this article, to reflect the sampling variability, for the true clusters, the parameter values were estimated based on the true cluster memberships of the observations. The correctly classified observations were represented by open circles or diamonds, while incorrect ones by filled ones. For set-up 1 (Figure 1), although there was only one cluster, due to the use of the diagonal covariance matrices, to account for the fact that the orientation of the true cluster was not parallel to either axis, two clusters with their orientation parallel to either axis were needed, leading to selecting an incorrect number of clusters. For set-up 2, even though a correct number of clusters was identified with the use of diagonal covariances, again due to the restriction on the orientation of the clusters imposed by the diagonal covariances, there were a large number of mis-classified observations. In contrast, the new method with unconstrained covariance matrices yielded much better results.

Fig 1
Simulated data from only one cluster (top panels) and from two clusters (bottom panels) with K = 2

3.2. Simulated data with diagonal covariance matrices

We next considered three set-ups with K > n and diagonal covariance matrices. The first was a null case with g = 1 cluster; the other two were with two clusters: one with difference only in the mean parameters, and the other in both the mean and variance parameters. For each set-up, 100 simulated datasets were generated. Each dataset contained n = 80 observations with dimension K = 100. For set-up 1, all observations belonged to the same cluster; for set-ups 2 and 3, the first 40 observations formed the first cluster while the remaining 40 observations belonged to the second cluster. Specifically, for set-up 1, all variables came from a standard normal distribution N(0, 1); for set-up 2 and 3, 90 variables were noises generated from N(0, 1), and the remaining 10 variables were informative. For set-up 2, the informative variables for the first cluster came from N(0, 1) and from N(1.5, 1) for the second cluster. For set-up 3, the informative variables were from N(0, 1) and N(1.5, 2) for the two clusters respectively.

We applied three methods: a common unconstrained covariance, cluster-specific unconstrained covariance matrices, and diagonal covariances. For the one with diagonal covariances, we used either a common diagonal covariance or cluster-specific diagonal covariances according to the truth to ideally optimize its performance. For each simulated dataset, we fitted a series of models with the number of components g = 1, 2 and 3, and performed a grid search to choose the penalty parameters. We show the frequencies of selecting various numbers of clusters, the average number of informative variables incorrectly selected to be noninformative (z1), the average number of noninformative variables correctly selected (z2), the number of observations correctly classified to cluster i (Ci), and the number of observations mis-classified from cluster i (ICi). We also report the Rand index (RI) or adjusted Rand index (aRI) to summarize the quality of clustering.

As shown in Table 1, for the null case, we could select the correct number of clusters correctly using the proposed method with a common covariance matrix, while the proposed method with cluster-specific covariance matrices did not perform so well. For set-up 2 with the true model with a common covariance matrix, the proposed method with a common unconstrained covariance matrix correctly selected g = 2 most often, and had a comparable performance to the method with a diagonal covariance in terms of the sample assignments, as summarized by the Rand or adjusted Rand index (Table 2). For set-up 3 with the true model with a cluster-dependent diagonal covariance matrices, the proposed method with cluster-dependent covariances correctly selected g = 2 nearly as often as using cluster-dependent diagonal covariance matrices. In terms of sample assignments, the proposed method also performed comparably. For these three set-ups, a close examination indicated that for the proposed method the highest (predictive) log-likelihood was achieved when we had a sufficiently large penalty on the off-diagonal elements of the inverse covariance matrices, leading to the estimated covariance matrices close to being diagonal as were the truth.

Table 1
Simulated data with diagonal covariances: frequencies of the selected numbers (g) of clusters, and mean numbers of predicted noise variables among the true informative (z1) and noise variables (z2). For set-up 1, the truth was g = 1, z1 = 10 and z2 = ...
Table 2
Simulated data with diagonal covariance matrices: sample assignments for ĝ = 2 and (adjusted) Rand index (RI/aRI) values. # #Ci represents the average number of the samples correctly assigned to cluster i, i = 1,2; # #ICi represents the ...

3.3. Simulated data with non-diagonal covariance matrices

We considered some true models with non-diagonal covariance matrices, while other aspects of simulation remained the same as in section 3.2; in particular, among 100 variables, ten of which were informative. We used two non-diagonal covariance matrices for the 10 informative variables: a compound symmetry (CS) and an AR-1 with ρ = 0.6; the noise variables were independent of each other and of the informative variables. The resulting two covariance matrices are denoted as V0,1 and V0,2.

The following four set-ups were considered: set-up 1 was for a null case with only one cluster; for set-ups 2 and 3, we had two clusters sharing the same covariance matrix V0,1 or V0,2, but with mean parameters differing by 1.5 in each informative variable; for set-up 4, the two clusters differed in both the mean vectors (by 1.5) and covariance matrices as V0,1 and V0,2 respectively. As before, a training dataset contained 80 observations, 40 of which came from the first cluster (if there were two clusters); we used an independent tuning dataset of size 80. We applied the three methods; again according to the truth, we used a common diagonal covariance matrix for set-ups 1-3, but cluster-specific diagonal covariance matrices for set-up 4.

The frequencies of the selected numbers of clusters based on 100 simulated datasets are shown in Table 3. For each case, the proposed methods performed better than the diagonal matrix method; between the two proposed methods, depending on the truth (i.e. whether there was a common covariance matrix), one of them performed better than the other. The same conclusion can be drawn on the performance of the methods for sample classification (Table 4).

Table 3
Simulated data with non-diagonal covariance matrices: frequencies of the selected numbers (g) of clusters, and mean numbers of predicted noise variables among the true informative (z1) and noise variables (z2). For set-up 1, the truth was g = 1, z1 = ...
Table 4
Simulated data with non-diagonal covariance matrices: sample assignments for ĝ = 2 and (adjusted) Rand index (RI/aRI) values. # #Ci represents the average number of the samples correctly assigned to cluster i, i = 1,2; # #ICi represents ...

4. Examples

4.1. Leukemia gene expression data

4.1.1. Data and a clustering analysis

We first applied the methods to a well-known leukemia gene expression dataset of Golub et al (1999) [(13)] to compare their performance. The (training) data contained 38 patient samples, among which 11 were acute myeloid leukemia (AML) while the remaining were acute lymphoblastic leukemia (ALL) samples. The ALL samples consisted of two subtypes: 8 T-cell and 19 B-cell samples. For each sample, the expression levels of 7129 genes were measured by an Affymetrix microarray. As in Dudoit et al (2002) [(8)], we pre-processed the data in the following steps: 1) truncation: any expression level xjk was truncated below at 1 if xjk < 1, and above at 16,000 if xjk > 16, 000; 2) filtering: any gene was excluded if its max/min ≤ 5 and maxmin ≤ 500, where max and min were the maximum and minimum expression levels of the gene across all the samples. Finally, as preliminary gene screening, we selected the top 300 genes with the largest sample variances across the 38 samples. Because there were only a small number of samples, we took stratified sampling by cell types in 3-fold cross validation (CV). We fitted the models with g = 1, 2 ,3 and 4. The first local maximum of the predictive log-likelihood was achieved at g = 3.

The clustering results for g = 3 are shown in Table 5. Although all the 300 genes were selected as informative, there was evidence for a large number of genes with differential expression between the leukemia subtypes (e.g. Pan and Shen 2007 [(30)]). It is clear that the new method with cluster-specific unconstrained covariance matrices gave fewer errors in sample classification. To see why, we examined a few genes in more details. Genes CST3 (cystatin c, M23197) and ZYX (zyxin, X95735) were in the top 50 genes ranked by Golub et at (1999) [(13)], and were two of the 17 genes selected by Antonov et al (2004) [(? )] to distinguish the AML and ALL subtypes. CST3 was also regarded as a suitable marker by Bardi et al (2004) [(4)] . Baker et al (2006) [(2)] and Wang et al (2005) [(46)]further identified ZYX as the most significant gene to discriminate AML/ALL subtypes. These two genes were also identified among the top 20 genes used in the classifier by Liao et al (2007) [(23)] . In addition, we included two genes with gene accession number HG613-HT613 and M38591. The expression levels of gene pairs (HG613-HT613, M23197), and (X95735,M38591) are shown in Figure 2, with the true and estimated clusters. Clearly the true clusters did not necessarily have orientations parallel with either axis, which was captured by the proposed method, leading to more accurate sample classifications with more homogeneous clusters. We also examined element-wise differences between the covariance matrices resulting from the true clusters and estimated clusters: the unconstrained covariance matrices were much closer to the true covariance matrices than the diagonal covariance matrices were (results not shown).

Fig 2
Expression levels of gene pairs (HG613-HT613, M23197) and (X95735, M38591), and the corresponding clusters for the leukemia data
Table 5
Clustering results for the leukemia gene expression data with K = 300 genes

4.1.2. A comparison with a Bayesian method

Kim et al (2006) [(19)] proposed a mixture of multivariate normal distributions via Dirichlet process (DP) mixtures for model-based clustering with the capability of variable selection for high-dimensional data. In particular, their method uses non-diagonal covariance matrices. Among others, a concentration parameter α for the DP prior and some data-dependent priors have to be specified; note that, as pointed out by other authors (Richardson and Green, 1997 [(35)]; Wasserman, 2000 [(48)]), it is not possible to obtain proper posterior distributions with fully noninformative priors in mixture models. Kim et al (2006) [(19)] applied their method to the leukemia gene expression data of Golub et al (1999) [(13)]. The data preprocessing was the same as before except that, rather than using the top 300 genes with the largest sample variances, all the 3571 genes were used. They considered two values of α. For α = 38, the MCMC sampler visited models with 4 to 7 components. The sample allocations based on the maximum a posteriori probability (MAP) and on the least-squares clustering (LSC) algorithm were respectively,



On the other hand, with α = 1, the sampler visited models with 3 to 6 components, and the clustering results were



In each case, about 120 genes were selected to be informative.

We applied our proposed method with cluster-specific unconstrained covariance matrices. For the number of cluster g = 1 to 7, the predictive log-likelihood values based on a 3-fold CV were −28160, −24836, −23117, −23551, −23896, −24432 and −25218, respectively. Hence, we would select g = 3. For comparison, we showed the clustering results for both g = 3 and g = 7 in Table 6.

Table 6
Clustering results with cluster-specific unconstrained covariance matrices for the leukemia gene expression data with K = 3571 genes

Comparing our results with that of Kim et al (2006) [(19)], we see some significant differences. First, our method could distinguish the two subtypes of ALL, while that of Kim et al (2006) [(19)] failed. Second, in contrast to that of Kim et al (2006) [(19)], our results did not show the heterogeneous subgroups of AML even when we forced to have 7 clusters. Third, our method selected 3 clusters, corresponding to the three subtypes of the leukemia, while the Bayesian method chose a larger number of clusters. Finally, the method of Kim et al (2006) [(19)] selected about 120 informative genes, in contrast to 1372 genes selected by our method for g = 3. We also note that the results of the Bayesian method depended on the choices of the prior and sample allocation method.

Currently we implemented our method in R, in which the glasso function is called to estimate a covariance matrix. For analysis of Golub’s data as considered here, it took about two days to complete running our program on a laptop.

4.2. BOEC gene expression data

4.2.1. Data and a clustering analysis

One biologically interesting issue is whether human blood outgrowth endothelial cells (BOECs) belong to or are closer to either large vessel endothelial cells (LVECs) or microvascular endothelial cells (MVECs) based on global expression profiling. To address this question, as in Pan et al (2006) [(31)], we combined the data from two separate studies: 28 LVEC and 25 MVEC samples in Chi et al (2003) [(6)], and 27 BOEC samples in Jiang (2007) [(17)], and normalized the data as in Jiang (2007) [(17)]. Jiang (2007) [(17)] identified 37 genes that would discriminate the LVEC and MVEC samples, and we applied the proposed method using these 37 genes to cluster the 80 samples. Here, we used four-fold CV for tuning. We fitted the models with g = 1, 2, 3 and g = 4; the first local maximum of the predictive log-likelihood was reached at g = 3. At g = 3, 30 genes out of 37 were selected as informative.

The clustering results for three clusters are shown in Table 7. It is confirmed that each type of the samples clustered closely together, as observed by Jiang (2007) [(17)]. Nevertheless, it is noted that using unconstrained covariance matrices led to more accurate (i.e. more homogeneous) clustering than using diagonal covariance matrices. For illustration, two pairs of genes were chosen to have their expression levels and the corresponding true and estimated clusters plotted in Figure 3. Again, based on the true clusters, it is evident that non-diagonal covariance structures were captured better by our proposed method, leading to better (i.e. more homogeneous) clustering results; a direct examination of the estimated covariance matrices also confirmed this point (results not shown).

Fig 3
Expression levels of gene pairs (A2BP1, FMR1) and (APRT, SSBP2), and the corresponding clusters for the BOEC data
Table 7
Clustering results for the BOEC data

4.2.2. Semi-supervised learning

To address the biological question of whether the BOEC samples belong to either or neither of the LVEC and MVEC classes, we fully utilize the known memberships of the LVEC and MVEC samples while allowing the memberships of the BOEC samples to be determined. This is a semi-supervised learning problem with some observations with known cluster memberships (McLachlan and Peel 2002 [(27)]; Pan et al 2006 [(31)]). Below we illustrate the application of the methods to the BOEC data.

For the BOEC data, treating the LVEC and MVEC samples with known class labels (i.e. g1 = 2), we obtained the following results using the 37 genes as used before. First, if we did not allow the BOEC samples to form their own class with g0 = 0, the BOEC samples (largely) fell to the class of MVEC, as shown before (Pan et al 2006). Second, if we allowed the possibility of the BOEC samples to form their own class with g0 = 1, they indeed formed a separate class. As shown in Table 8, in either case, the proposed method with unconstrained covariance matrices performed better than using diagonal covariance matrices with fewer mixed sample assignments. The proposed method selected 26 and 29 informative genes for the two cases respectively.

Table 8
Semi-supervised learning results for the BOEC data

5. Discussion

We have proposed a new approach to penalized model-based clustering with unconstrained covariance matrices. We have shown its better empirical performance than that using diagonal covariance matrices when some informative variables are correlated, which is often the case for high-dimensional genomic data, as supported by co-expressions of functionally-related genes for microarray gene expression data. One key technical challenge is in estimating a possibly large covariance matrix. By taking advantage of the recent development in Gaussian graphical models, we have implemented our approach with the use of the graphical lasso algorithm (Friedman et al 2007 [(11)]), largely due to its fast speed. Nevertheless, in principle, other covariance estimation methods, either frequentist (Huang et al 2006 [(16)]; Yuan and Lin 2007 [(52)]; Levina et al 2008 [(21)]; Rothman et al 2009 [(34)]; Fan et al 2009 [(10)], and references therein), or Bayesian (Jones et al 2005 [(18)]; Scott and Carvalho 2009 [(37)]; Carvalho and Scott 2009 [(5)], and references therein), could be used, though computational speed is an important factor. Alternatively, some nondiagonal structural assumptions may be imposed on covariance matrices. Xie et al (2009) [(51)] proposed such an approach based on the mixture of factor analyzers: some latent variables are used to model the covariance structure in each cluster, which however is computationally demanding if the number of the latent variables needs to be chosen data-adaptively. We comment on that, although we have focused on its application in variable selection, penalized model-based clustering may be useful in its own right, such as in providing better parameter estimates (due to regularization) and thus better clustering results for small samples. For future improvement, we may follow Xie et al (2008b) [(50)] to impose a penalty on variance parameters to account for clustering structures in varying variances across clusters, in addition to that in locations or means. It may be of interest to extend the proposed method to deal with other issues in high-dimensional genomic data, such as prior biological knowledge and outliers (Pan 2006 [(29); Tseng 2007 [(43)]). More investigations are necessary, especially in further evaluating the proposed method with real-world applications.

Free software will be posted on our web site.


This research was partially supported by NIH grant RO1-GM081535; in addition, HZ and WP by NIH grant RO1-HL65462, and XS by NSF grants DMS-0604394 and DMS-0906616.

Appendix A: Appendix: Proof of Theorem 1

For simplicity of notation, we drop the “hat” from any parameter estimate.

Since QP is differentiable with respect to μik when μik ≠ 0, while nondifferentiable at μik = 0, we consider the following two cases:

  1. If μik ≠ 0 is a maximum, given that QP is concave and differentiable, then the sufficient and necessary condition for μik to be the global maximum of QP is
    from which (12) can be easily derived if we separate μik from other components of μi.
  2. If μik = 0 is a maximum, we compare QP (0, .) with QPμik, .), the values of QP at μik = 0 and μik = Δμik respectively (while other components of μi are fixed at its maximum). By definition, we have
    This yields (12) and (13).


AMS 2000 subject classifications: Primary 62H30.



[1] Alaiya AA, et al. Molecular classification of borderline ovarian tumors using hierarchical cluster analysis of protein expression profiles. Int. J. Cancer. 2002;98:895–899. [PubMed]
[2] Baker Stuart G., Kramer Barnett S. Identifying genes that contribute most to good classification in microarray. BMC Bioinformatics. 2006 Sep 7;7:407. [PMC free article] [PubMed]
[3] Banfield JD, Raftery AE. Model-Based Gaussian and Non-Gaussian Clustering. Biometrics. 1993;49:803–821.
[4] Bardi E, Bobok I, Olah AV, Olah E, Kappelmayer J, Kiss C. Cystatin C is a suitable marker of glomerular function in children with cancer. Pediatric Nephrology. 2004;19:1145–1147. [PubMed]
[5] Carvalho CM, Scott JG. Objective Bayesian model selection in Gaussian graphical models. Biometrika. 2009;96:497–512.
[6] Chi J-T, et al. Endothelial cell diversity revealed by global expression profiling. PNAS. 2003;100:10623–10628. [PubMed]
[7] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion) JRSS-B. 1977;39:1–38.
[8] Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using expression data. J. Am. Stat. Assoc. 2002;97:77–87.
[9] Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. PNAS. 1998;95:14863–14868. [PubMed]
[10] Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann. Appl. Stat. 2009;3:521–541. [PMC free article] [PubMed]
[11] Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007;0:1–10. [PMC free article] [PubMed]
[12] Fraley C, Raftery AE. How many clusters? Which clustering methods? Answers via model-based cluster analysis. Computer J. 1998;41:578–588.
[13] Golub T, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. [PubMed]
[14] Guo FJ, Levina E, Michailidis G, Zhu J. Pairwise Variable Selection for High-dimensional Model-based Clustering. To appear in Biometrics. 2009 [PMC free article] [PubMed]
[15] Hoff PD. Model-based subspace clustering. Bayesian Analysis. 2006;1:321–344.
[16] Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98.
[17] Jiang A, Pan W, Yu S, Robert PH. A practical question based on cross-platform microarray data normalization: are BOEC more like large vessel or microvascular endothelial cells or neither of them? Journal of Bioinformatics and Computational Biology. 2007;5:875–893. [PubMed]
[18] Jones B, Carvalho C, Dobra A, Hans C, Carter C, West M. Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 2005;20:388–400.
[19] Kim S, Tadesse MG, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006;93:877–893.
[20] Lau JW, Green PJ. Bayesian model based clustering procedure. Journal of Computational and Graphical Statistics. 2007;16:526–558.
[21] Levina L, Rothman A, Zhu J. Sparse estimation of large covariance matrices via a nested lasso penalty. Annals of Applied Statistics. 2008;2:245–263.
[22] Liang F, Mkherjee S, West M. The use of unlabeled data in predictive modeling. Statistical Science. 2007;22:189–205.
[23] Liao JG, Chin KV. Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics. 2007;23:1945–1951. [PubMed]
[24] Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. Bayesian clustering with variable and transformation selection (with discussion) Bayesian Statistics. 2003;7:249–275.
[25] McLachlan G. On bootstrapping likelihood ratio test statistics for the number of components in a normal mixture. Applied Statistics. 1987;36:318–324.
[26] McLachlan GJ, Bean RW, Peel D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002;18:413–422. [PubMed]
[27] McLachlan GJ, Peel D. Finite Mixture Model. John Wiley & Sons, Inc.; New York: 2002.
[28] Muller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79.
[29] Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics. 2006;22:795–801. [PubMed]
[30] Pan W, Shen X. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research. 2007;8:1145–1164.
[31] Pan W, Shen X, Jiang A, Hebbel RP. Semi-supervised learning via penalized mixture model with application to microarray sample classification. Bioinformatics. 2006;22:2388–2395. [PubMed]
[32] Raftery AE, Dean N. Variable selection for model-based clustering. Journal of the American Statistical Association. 2006;101:168–178.
[33] Rand WM. Objective criteria for the evaluation of clustering methods. JASA. 1971;66:846–850.
[34] Rothman A, Levina L, Zhu J. Generalized thresholding of large covariance matrices. JASA. 2009;104(485):177–186. 2009.
[35] Richardson S, Green PJ. On Bayesian analysis of mixture models (with Discussion) J R Statist Soc B. 1997;59:731–792.
[36] Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464.
[37] Scott JG, Carvalho CM. Feature-inclusion stochastic search for Gaussian graphical models. J. Comp. Graph. Stat. 2009;17:790–808.
[38] Tadesse MG, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association. 2005;100:602–617.
[39] Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat. Genet. 1999;22:281–285. [PubMed]
[40] Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics. 2006;22:2405–2412. [PubMed]
[41] Teh YW, Jordan MI, Beal MJ, Beal MJ. Sharing clusters among related groups: Hierarchical Dirichlet processes. NIPS. 2004
[42] Tibshirani R. Regression shrinkage and selection via the lasso. JRSS-B. 1996;58:267–288.
[43] Tseng GC. Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 2007;23:2247–2255. [PubMed]
[44] Tseng P. Coordinate ascent for maximizing nondifferentiable concave functions. Massachusetts Institute of Technology. Laboratory for Information and Decision Systems; 1988. Technical report LIDS-P; 1840.
[45] Tseng P. Convergence of block coordinate descent method for nondifferentiable maximization. J. Opt. Theory and Applications. 2001;109:474–494.
[46] Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KFX, Mewes HW. Gene selection from microarray data for cancer classification - a machine learning approach. Comput Biol Chem. 2005;29:37–46. [PubMed]
[47] Wang S, Zhu J. Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data. Biometrics. 2008;64:440–448. [PubMed]
[48] Wasserman L. Asymptotic inference for mixture models using data-dependent priors. J R Statist Soc B. 2000;62:159–180.
[49] Xie B, Pan W, Shen X. Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics. 2008a;64:921–930. [PubMed]
[50] Xie B, Pan W, Shen X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron. J. Statist. 2008b;2:168–212. [PMC free article] [PubMed]
[51] Xie B, Pan W, Shen X. To appear Bioinformatics. Division of Biostatistics, University of Minnesota; 2009. Penalized mixtures of factor analyzers with application to clustering high dimensional microarray data. Available at as Research Report 2009-019. [PMC free article] [PubMed]
[52] Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35.