Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Sin. Author manuscript; available in PMC 2017 March 15.
Published in final edited form as:
Stat Sin. 2016 April; 26(2): 445–464.
doi:  10.5705/ss.2014.256
PMCID: PMC5351783

Joint Estimation of Multiple High-dimensional Precision Matrices


Motivated by analysis of gene expression data measured in different tissues or disease states, we consider joint estimation of multiple precision matrices to effectively utilize the partially shared graphical structures of the corresponding graphs. The procedure is based on a weighted constrained [ell]/[ell]1 minimization, which can be effectively implemented by a second-order cone programming. Compared to separate estimation methods, the proposed joint estimation method leads to estimators converging to the true precision matrices faster. Under certain regularity conditions, the proposed procedure leads to an exact graph structure recovery with a probability tending to 1. Simulation studies show that the proposed joint estimation methods outperform other methods in graph structure recovery. The method is illustrated through an analysis of an ovarian cancer gene expression data. The results indicate that the patients with poor prognostic subtype lack some important links among the genes in the apoptosis pathway.

Keywords: Constrained optimization, Convergence rate, Graph recovery, Precision matrices, Second-order cone programming, Sparsity


Gaussian graphical models provide a natural tool for modeling the conditional independence relationships among a set of random variables (Lauritzen (1996); Whittaker (1990)). They have been successfully applied to infer relationships between genes at transcriptional level (Schäfer and Strimmer (2005); Li and Gui (2006); Li, Hsu, Peng et al. (2013)). Gaussian graphical models are tightly linked to precision matrices. Suppose X = (X1,…, Xp)′ follows a multivariate Gaussian distribution Np(μ, Σ). The precision matrix = Σ−1 describes the graphical structure of its corresponding Gaussian graph. If the (i, j)-th entry of the precision matrix ωij is equal to zero, then Xi and Xj are independent conditioning on all other variables Xk, k ≠ i, j. Correspondingly, no edge exists between Xi and variable Xj in the graphical structure of Gaussian graphical model. If ωij 0, then Xi and Xj are conditionally dependent and they are therefore connected in the graphical structure. Define the support of by S={(i,j):ωij0}, which is also the set of edges in the Gaussian graphical model. If the maximum degree of , maxij=1pI(ωij0) is relatively small, we call sparse. Since the expression variation of a gene can usually be explained by a small subset of other genes, the precision matrix for gene expression data of a set of genes is expected to be sparse.

Many methods for estimating high-dimensional precision matrix or its Gaussian graphical model have been developed in the past decade. Meinshausen and Bühlmann (2006) introduced a neighborhood selection approach by regressing all other variables on each variable with an [ell]1 penalty. The method consistently estimates the set of non-zero elements of the precision matrix. Efficient algorithms for exact maximization of the [ell]1-penalized log-likelihood have also been proposed. Yuan and Lin (2007), Banerjee, Ghaoui and d’Aspremont (2008) and Dahl, Vandenberghe and Roychowdhury (2008) adopted an interior point optimization method to solve this problem. Based on the work of Banerjee, Ghaoui and d’Aspremont (2008) and a block-wise coordinate descent algorithm, Friedman, Hastie and Tibshirani (2008) developed the graphical Lasso (GLASSO) for sparse precision matrix estimation; it is computationally efficient even when the dimension is greater than the sample size. Yuan (2010) developed a linear programming procedure and obtained oracle inequalities for the estimation error in term of matrix operator norm. Cai, Liu and Luo (2011) developed a constrained [ell]1 minimization approach (CLIME) to estimate sparse precision matrix. All of these methods addressed the problem of estimating a single precision matrix or a single Gaussian graphical model.

In many applications, the problem of estimating multiple precision matrices arises when data are collected among multiple groups. For example, gene expression levels are often measured over multiple groups (tissues, environments, or subpopulations). Their precision matrices and the corresponding graphical structures imply gene regulatory mechanisms and are of great biological interest. Since the gene regulatory networks in different groups are often similar to each other, the graphical structures share many common edges. Estimating a single precision matrix group by group ignores the partial homogeneity in their graphical structures, which often leads to low power. To effectively utilize the shared graphical structures and to increase the estimation precision, it is important to estimate multiple precision matrices jointly.

Previous attempts to jointly estimate multiple precision matrices include Guo, Levina, Michailidis et al. (2011) and Danaher, Wang and Witten (2014). Guo, Levina, Michailidis et al. (2011) proposed a hierarchical penalized model to perserve the common graphical structure while allowing differences across groups. Their method achieves Frobenius norm convergence when p log(p)/n goes to zero, where p is the number of variables, and n is the total sample size. Unfortunately, for genomic applications, the number of genes often exceeds the total sample size and, as a result, invalidates the theoretical justification in Guo, Levina, Michailidis et al. (2011). Danaher, Wang and Witten (2014) proposed two algorithms of joint graphical lasso (FGL and GGL) to estimate precision matrices that share common edges. Their approach is based upon maximizing a penalized log likelihood with a fused Lasso or group Lasso penalty. The paper did not provide any theoretical justification on the statistical convergence rate of their estimators.

In this paper, we propose a weighted constrained [ell]/[ell]1 minimization method to estimate K sparse precision matrices (MPE) jointly. Different from Guo, Levina, Michailidis et al. (2011) and Danaher, Wang and Witten (2014), the proposed estimators converge to the true precision matrices even when p = O{exp(na)}, for some 0 < a < 1. In addition, when K is sufficiently large, compared to the estimators from separate estimation methods, our proposed estimators converge to the true precision matrices (under the entry-wise [ell] norm loss) faster. An additional thresholding step on the estimators with a carefully chosen threshold yields thresholded estimators with additional theoretical properties. The thresholded estimators from our method converge to the true precision matrices under the matrix [ell]1 norm. Finally, when the graphical structures across groups are the same, our method leads to the exact recovery of the graph structures with a probability tending to 1.

The rest of the paper is organized as follows. Section 2 presents the estimation method and the optimization algorithm. Theoretical properties of the proposed method and accuracy of the graph structure recovery are studied in Section 3. Section 4 investigates the numerical performance of the method through a simulation study. The proposed method is compared with other competing methods. The method is also illustrated by an analysis of an epithelial ovarian cancer gene expression study in Section 5. A brief discussion is given in Section 6 and proofs are presented in the Appendix.


The following notations are used in the paper. For a vector a = (a1,…, ap)T [set membership] Rp, define |a|1=j=1p|aj| and |a|2=(j=1paj2)1/2. The vector ai is the vector of a without the i-th entry. The support of a is defined as supp(a) = {i : ai 0}. For a matrix A = (aij) [set membership] R p×q, its entrywise [ell] norm is denoted by |A| = maxi,j |aij|. Its matrix [ell]1-norm is denoted by A1=max1jqi=1p|aij|, and its spectral norm is denoted by ║A2. The sub-matrix Ai,−i is the matrix of A without the i-th row and the i-th column. Denote by λmax(A) and λmin(A) the largest and smallest eigenvalues of A, respectively. For two sequences of real numbers {an} and {bn}, write an = O(bn) if there exists a constant C such that |an| ≤ C|bn| holds for all sufficiently large n, write an = o(bn) if limn→∞ an/bn = 0. If an = O(bn) and bn = O(an), then an [union above intersection] bn.

2.1 The Joint Estimation Method

We introduce an estimation method to jointly estimate K precision matrices with partial homogeneity in their graphical structures. The method uses a constrained [ell]1 minimization approach, that has been successfully applied to high dimensional regression problems (Donoho, Elad and Temlyakov (2006); Candés and Tao (2007)) and signal precision matrix estimation problem (Cai, Liu and Luo (2011)) to recover the sparse vector or matrix.

For 1 ≤ k ≤ K, let X(k) ~ N(μ(k), Σ(k)) be a p-dimensional random vector. The precision matrix of X(k), denoted by Ω(k)=(ωij(k)), is the inverse of the covariance matrix Σ(k). Suppose there are nk identically and independently distributed random samples of X(k):{Xj(k),1jnk}. The sample covariance matrix for the k-th group is


where X¯(k)=j=1nkXj(k)/nk. Denote by n = Σk nk the total sample size and wk = nk/n the weight of the k-th group. We estimate Ω(k)=(ωij(k)) for k = 1, …, K by the constrained optimization


where λn = C(log p/n)1/2 is a tuning parameter. The [ell]/[ell]1 objective function is used to encourage the sparsity of all K precision matrices. The constraint is imposed on the maximum of the element-wise group [ell]2 norm to encourage the groups to share a common graphical structure.

Denote by Ω^1(k)(1kK) the solution to (1). They are not symmetric in general. To make the solution symmetric, the estimator Ω^(k)=(ω^ij(k)) is constructed by comparing ω^1ij(k) and ω^1ji(k) and assigning the one with a smaller magnitude at both entries,


This symmetrizing procedure is not ad-hoc. The procedure assures that the final estimator Ω^(k) achieves the same entry-wise [ell] estimation error as Ω^1(k). The details are discussed in Section 3.

2.2 Computational algorithm

The convex optimization problem (1) involves estimating K p × p precision matrices. To reduce the computation complexity, it can be further decomposed into p sub-problems that involve estimating K p × 1 sparse vectors:

minβj(k)p,1kK(maxk|β1(k)|1),subject tomaxi{k=1Kwk|(^(k)βj(k)ej)i|2}1/2λn

for 1 ≤ j ≤ p, where ej [set membership] Rp is the unit vector with j-th element 1 and other elements 0. A lemma shows that solving (2) is equivalent to solving (1).

Lemma 1

Suppose Ω^1(k) is the solution to (1) and B^(k):=(β^1(k),,β^p(k)), where β^j(k) is the solution to (2). Then Ω^1(k)=B^(k) for 1 ≤ k ≤ K.

Problem (2) can be solve by a second-order cone programming. The existing packages to solve (2) include the SDTP3 and the SeDuMi package in Matlab, and the CLSOCP package in R. CLSOCP uses a one-step smoothing Newton method of Liang, He and Hu (2009). This algorithm has good precision but works relatively slowly for high dimensional problems. SeDuMi and SDTP3 adopt the primal-dual infeasible-interior point algorithm (Newsterov and Todd (1998)). The most time-consuming part of the algorithm is to solve the Schur complement equation, which involves Cholesky factorization. The sparsity and the size of the Schur complement matrix are two factors that affect efficiency. SDTP3 is able to divide a high dimensional optimization problem into sparse blocks and uses the sparse solver for Cholesky factorizations. It is therefore faster than SeDuMi in solving (2). In this paper, we used the SDTP3 package for all the computations. For a problem with p = 200, nk = 150 and K = 3, it takes a dual-core 2.7 GHz Intel Core i7 laptop approximately 11 minutes to solve (1).

2.3 Tuning Parameter Selection

Choosing the tuning parameters in regularized estimation is in general a difficult problem. For linear regression models, Chen and Chen (2008); Wang, Li and Leng (2009); Wang and Zhu (2011) studied how to consistently choose the tuning parameters when p = O(na) for some a > 0. Recently, Fan and Tang (2013) proposed a general information criterion (GIC) for choosing the tuning parameter for estimating the generalized linear model in ultra-high dimensional settings, p = O(exp(na)) for some a > 0. The GIC criterion adopts a novel penalty on the degree freedom of the model so that it consistently chooses the proper tuning parameter under mild conditions. Unfortunately, the Gaussian graphical model is different from the generalized linear model, and therefore the justification of GIC does not apply to our problem. We propose a tuning parameter selection method based on BIC and a re-fitted precision matrix on the restricted model.

Since X1(k)(1lnk) follows a multivariate Gaussian distribution Np(μ(k), Σ(k)), we have


where εl,i(k)~N(0,1/ωij(k)). The regression coefficients satisfy (Anderson (2003))


Based on these results, we propose a tuning parameter selection procedure:

  1. For a given λ, calculate the estimator Ω^(k). Based on the support of Ω^(k), use least squares and neighborhood selection to re-fit the precision matrix estimator Ω^2(k).
  2. Define Si(k)={j:ω^ij(k)0,ji}, the set of non-zero off-diagonal elements of the i-th row of Ω^(k).
  3. If card(Si(k))nk, let the i-th column of Ω^2(k) equal to the i-th column of Ω^j(k),Ω^2,i(k)=Ω^i(k). If card(Si(k))<nk, fit the regression model
    If Si(k) equals the true support S0,i(k)={j:ω0,ij(k)0,ji}, βij(k)=ω0,ij(k)/ω0,ii(k) and Var(εl,i(k))=1/ω0,ii(k). Suppose the solution to (5) is β^ij(k). Define
    After fitting Model (5), let ω^2,ii(k)=nk/l=1nk(εl,i(k))2, and ω^2,ij(k)=β^ij(k)ω^2,ii(k).
  4. Repeat Step 3 for i = 1,…, p and k = 1,…, K. The resulting matrices Ω^2(k), k = 1,…, K are not symmetric. We symmetrize Ω^2(k) by the same procedure:

We use Ω^3(k)=(ω^3,ij(k)), k = 1,…, K as the estimators corresponding to the tuning parameter λ. The optimal tuning parameter can be selected by Bayesian information criterion (BIC),


where sk=Card{(i,j):ω^i,j0,1i<jp}. We obtain the solution to our method over a wide range of tuning parameters and choose λ^n that minimizes BIC(λ).

Since the refitted estimator can potentially reduce some bias introduced in the optimization due to the penalty term, using it in BIC (6) improves the tuning parameter selection in the numerical studies. However, using Ω^3(k) as an estimator of (l) is not recommended because when Card(Si(k)) is large, the re-fitted estimator might lead to overfitting. Overfitting does not severely affect the tuning parameter selection, because BIC puts penalties on complicated models that are less likely to be chosen.


3.1 Estimation Error Bound

We investigate the properties of the proposed estimator by considering the convergence rates of Ω^(k)Ω(k), including estimation error bounds and graph structure recovery. We assume the following conditions:

  • (C1)
    There exists some constant a > 0, such that
  • (C2)
    sup1≤kK {λmax ((k))/λmin((k))} ≤ M0 for some bounded constant M0 > 0.
  • (C3)
    If eij = I(i = j), Zij(k)=(X(k)X(k)Ω(k))ijeij, Zij=(Zij(1),,Zij(K)) and the largest eigenvalue of Cov(Zij) is λmax,ij, supij λmax,ijM1.
  • (C4)
    n1 [union above intersection] n2 [union above intersection] (...) [union above intersection]nK [union above intersection] n/K.

Condition (C1) allows p to grow exponentially fast as n. It also allows the number of groups K to grow slowly with p and n. For example, when log p = O(nr) and K = O(nb) for r + 2b < 1 and 3b < r, (C1) holds. Condition (C3) allows X(k) to be dependent across groups. When X(k) are independent, Cov(Yij) = IK, and thus maxij λmax,ij = 1.

Let Mn=sup1kKmaxji=1p|ωij(k)|=sup1kKΩ(k)1 be the maximum matrix [ell]1 norms of the K matrices. A theorem establishes the convergence rate of the precision matrix estimates under the element-wise [ell] norm.

Theorem 1

Let λn = C0 (log p/n)1/2 for some constant C0>2M0+2. If (C1)–(C4) hold,


with a high probability converging to 1 and C1 = 2C0.

  • Remark 1: The value of C0 depends on M0. In practice, M0 is often unknown. However, we can use the tuning parameter selection method, such as BIC in (6), to choose λn. The details are discussed in Section 2.3.
  • Remark 2: Theorem 1 (and Theorems 2 and 3) does not require the true precision matrices (k) to have identical graphical structures. Both the values and locations of non-zero entries can differ across (k), k = 1,…, K.



Proposition 1

Let U be the set of estimators (Ω(1),,Ω(K)), where Ω(k) only depends on the k-th sample {Xj(k);1jnk}. Assume the samples are independent across K groups. Under (C4), there exists a constant α > 0, such that


for sufficiently large n.

Theorem 1 shows that the convergence rate of {k=1Kwk|(Ω(k)Ω(k))i,j|2}1/2 from our joint estimation method is less than or equal to C1 Mn (log K log p/n)1/2 with a probability tending to 1. When K is bounded, Proposition 1 shows that with a non-vanishing probability αK, the minimax convergence rate of any separate estimation method is at least αMn(K log p/n)1/2. For bounded but sufficiently large K, C1 (log K)1/2 ≤ αK1/2. Therefore, the convergence upper bound C1Mn(logKlogpn)1/2 in (7) for the joint estimation method is less than the convergence lower bound αMn(Klogpn)1/2 in Proposition 1 for the separate estimation method. In other words, compared to separate estimation methods, our joint estimation method leads to estimators with a faster convergence.

An additional thresholding step on the estimators with a careful chosen threshold leads to new estimators, that converge to the true precision matrices under the matrix operator norm. Define the thresholded estimator Ω(k)=(ωij(k)) as follows,


with C1 the constant defined in (7). Let Sj(k)={(i,j):ωij(k)0,i<j} and Sj=k=1kSj(k). Define s0(p)=max1jpCard(Sj) as the union sparsity.

Theorem 2

Suppose that (C1)–(C4) hold. Then with a high probability converging to 1,


The convergence rates of Ω(k) depend on the union sparsity level s0(p). When the precision matrices share the same graphical structure, s0(p)=max1jpCard(Sj(k)) for all k = 1,…, K. When the number of shared edges in the graphical structures increases, the union sparsity s0(p) decreases, and consequently Ω(k) converges to (k) faster.

Let a^ij=k=1Kwk(Ω(k)Ω(k))ij2. Matrix  = (âij)p×p measures the overall estimation errors among the entries of Ω(k) and (k) for k = 1,…, K.

Corollary 1

Suppose that the conditions in Theorem 2 hold. Then with a high probability converging to 1,


3.2 Graphical Structure Recovery

Theoretical analysis for graphical structure recovery is very complicated when the graphical structures of the precision matrices are different across the K groups since the results depend on the structures of the shared edges. Here we focus on the case in which the K precision matrices have a common support. Let Sk={(i,j):ωij(k)0} be the support for the k-th precision matrix, and the common support be S=kSk. When S1==SK, S=Sk, k = 1,…, K, by Theorem 1 we estimate S by


where C1 is a constant given in Theorem 1. Let


We have a result on support recovery.

Theorem 3

Suppose that the conditions in Theorem 1 hold. Assume that


Then S^=S with a high probability converging to 1.

When the graphical structures are the same across all K groups, the lower bound condition (9) is weaker than the lower bound condition needed for graphical structure recovery by separate estimation methods. Based on Proposition 1 and its proof, to fully recover the shared graphical structure by separate estimation methods, a necessary condition is


When K is sufficiently large, this condition is stronger than (9).


4.1 Data generation

We evaluated the numerical performance of the proposed method and other competitive methods, including the separate precision matrix estimation procedures proposed by Friedman, Hastie and Tibshirani (2008) and Cai, Liu and Luo (2011) and the joint estimation method proposed by Guo, Levina, Michailidis et al. (2011) and Danaher, Wang and Witten (2014). The separate precision matrix estimation methods were applied to each group, and therefore ignored the partial homogeneity in graphical structures among groups. In all numerical studies, we set p = 200, K = 3 and (n1, n2, n3) = (80, 120, 150). The simulated observations were generated in each group independently from a multivariate Gaussian distribution N{0, ((k))−1}, where (k) is the precision matrix in the k-th group. For each model, 100 replications were performed.

We present results for two different types of graphical models: the Erdös and Rényi (ER) model (Erdös and Rényi (1960)) and the Watts-Strogatz (WS) model (Watts and Strogatz (1998)). For the ER model, the graph contains p vertices and each pair of vertices are connected with a probability 0.05. For the WS model, first a ring lattice of p vertex is created; one vertex is connected with its neighbors within order distance of 15, and then the edges of the lattice are rewired uniformly and randomly with a probability 0.01. These graph models have several topological properties such as sparcity and a “small world” property often observed in true biological gene regulatory networks. See Fell and Wagner (2000); Jeong, Tombor, Albert et al. (2000); Vendrascolo, Dokholyan, Paci et al. (2002); Greene and Higman (2003).

Based on the ER model or the WS model, a common graph structure was generated. Let M be the number of edges in the common graph structure. Then, ⎿ρM⏌ random edges were added to the common graph structure to generate graph structures for each group, where the parameter ρ was the ratio between the number of group-specific edges and common edges. We considered ρ = 0, 1/4, and 1. The first setting represents the scenario where the precision matrices in all groups share the same graph structure. After the graph structure of each group was determined, the values of non-zero off-diagonal entries were generated independently as uniform in [−1, −0.5] [union or logical sum] [0.5, 1]. The diagonal values were assigned to a constant so that the condition number of each precision matrix was equal to p.

4.2 Simulation results

Each method was evaluated for a range of tuning parameters under each model. The optimal tuning parameter was chosen by BIC (6). Several measures are used to compare the performances of these estimators. The estimation error was evaluated in terms of average matrix [ell]1 norm, [ell]2 norm (spectral norm), and Frobenius norm:


The graph structure recovery results were evaluated by average sensitivity (SEN), specificity (SPE), and Matthews correlation coefficient (MCC). For a true precision matrix 0 = (ω0,ij) with support set S0={(i,j):ω0,ij0andij}, suppose its estimator Ω^ has the support set S^. Then the measures with respect to 0 and Ω^ are defined as follows:


Here, TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives and false negatives:


We compared Ω^(k) and Ω0(k) and report the average sensitivities (SEN), specificities (SPE), and Matthews correlation coefficient (MCC) among K groups.

The comparisons of the results for the four graphical models are shown in Tables 1 and and2.2. It shows that when ρ = 0, the true graph structures are the same across all three groups, joint estimation methods perform much better than the separate estimation methods. As ρ increases, the structures across different groups become more different, and the joint estimation methods gradually lose their advantages. Our method has the best performance in graph structure recovery among all the methods. Even when ρ = 1, it still performs significantly better than the separate estimation methods. Our method has the best performance in graph structure recovery and it achieves the highest Matthews correlation coefficient. Its estimation error measured under the matrix [ell]1, [ell]2, and the Frobenius norm are comparable to other joint estimation methods.

Table 1
Simulation results for data generated based on the Erdös and Rényi graph with different ratios of the number of individual-specific edges to the number of shared edges. Results are based on 100 replications. The numbers in the brackets ...
Table 2
Simulation results for data generated based on the Watt-Strogatz graph with different ratios of the number of individual-specific edges to the number of shared edges. Results are based on 100 replications. The numbers in the brackets are standard errors. ...

Since the tuning parameter selection may affect the performance of the methods, we plot in Figure 1 the receiver operating characteristic (ROC) curves averaged over 100 repetitions with the false positive rate controlled under 10%. The methods proposed by Danaher, Wang and Witten (2014) have two tuning parameters. For each sparsity tuning parameter, we first chose an optimal similarity tuning parameter from a grid of candidates by BIC and then plotted the ROC curves based on a sequence of sparsity tuning parameters and their corresponding optimal similarity tuning parameters. Because these methods involve choosing two tuning parameters, they are slower than our method in computation (The computation of FGL and GGL is based on the R package “JGL” contributed by Danaher (2013)). Figure 1 shows that our method consistently outperforms the other methods in graph structure recovery.

Figure 1
Receiver operator characteristic curves for graph structure recovery for the simulated Erdős and Rényi graphs (the first row), and the Watts-Strogatz graphs (the second row). The x-axis and y-axis of each panel are average false positive ...


Epithelial ovarian cancer is a molecularly diverse cancer that lacks effective personalized therapy. Tothill, Tinker, George et al. (2008) identified six molecular subtypes of ovarian cancer, labeled as C1–C6, where the C1 subtype was characterized by significant differential expressions of genes associated with stromal and immune cell types. The patients in C1 subtype group have shown to have a lower survival rate compared to the patients from other subtypes. The data set includes RNA expression data collected from n = 78 patients of C1 subtype and n = 113 patients from the other subtypes. We are interested to see how the wiring (conditional dependency) of the genes at the transcription levels differs among molecular subgroups of ovarian cancer. We focus on the apoptosis pathway from the KEGG database (Orgata, Goto, Sato et al. (1999); Kanehisa, Goto, Sato et al. (2012)) to see whether the genes related to this pathway (p = 87) are differentially wired between the C1 and other subtypes.

To stabilize the graph structure selection, we bootstraped the samples 100 times within each group. At each time, Iik was sampled uniformly taking values in i = {1,…, nk}, with k = 1, 2. Let xi(k)=xIik(k), where xIik(k) is the p-dimensional gene expression data for the Iik-th patient in the kth subtype group. The bootstrap sample is X(k)=(x1(k),,xnk(k)), with k = 1, 2. We then applied our proposed method and its competitors to each of the bootstrapped samples to obtain the estimators of the two precision matrices Ω^(k), k = 1, 2. The supports of the estimators were recorded so that Ω(k)=(I(ω^ij(k)0)). We then added Ω^(k) up for all bootstrap samples and got the final frequency of each edge being selected. Those edges that were selected in more than 50 of the 100 bootstrap samples were finally selected as important edges. This type of bootstrap aggregation methods has been commonly used in recovering the sparse graphical structures (Meinshausen and Bühlmann (2010); Li, Hsu, Peng et al. (2013)), which often leads to better selection stability for sparse precision matrix.

Table 3 lists the number of edges selected by the bootstrap aggregation of our proposed method and its competitors. The separate estimation methods (CLIME and GLASSO) resulted in graphs that share fewer edges in the precision matrices of the two cancer subtype groups. JEMGM resulted in most shared edges, followed by GGL and our method (MPE). Overall, FGL and GGL selected a lot more linked genes than other methods. Figure 2 shows the Gaussian graphs estimated by these six methods. FGL, GGL and MPE selected more unique edges among the gene expression levels for the C2–C6 subtype cancer than those for the C1 subtype. This suggests that the patients with poor prognostic subtype (C1) lack some important links among the Apoptosis genes.

Figure 2
Estimated Gaussian Graphs by the proposed method and its competitors. The dashed edges are links unique to the precision matrix estimator of the C1 subtype, the dotted edges are unique to that of other subtypes, and the solid edges are shared by both ...
Table 3
Number of edges selected by the proposed method and its competitors. “C1 unique” counts the number of edges that only appear in the precision matrix of the gene expression levels in C1 cancer subtype; “Other unique” counts ...

We further defined the nodes with degrees equal or larger than five based on the union of the estimated graphs of two subtypes as the hub nodes. FGL and GGL yielded estimators with most of the hub nodes completely unlinked in the estimated graph for C1 cancer subtype. The estimators by MPE had several edges between the hub nodes shared by both subtype groups, while also displaying some links unique to each group. The hub nodes identified by MPE were FASLG, CASP10, CSF2RB, IL1B, MYD88, NFKB1, NFKBIA, PIK3CA, IKBKG, and PIK3R5. Among these, CASP10, PIK3CA, IL1B, and NFKb1 have been implicated in ovarian cancer risk or progression. In particular, PIK3CA has been implicated as an oncogene in ovarian cancer (Shayesteh, Lu, Kuo et al. (1999)), indicating the importance of these hub genes in ovarian cancer progression.


It is of interest to discuss the connection and difference between the problem considered in this paper and the problem of estimating matrix graphical models (Leng and Tang (2012); Yin and Li (2012); Zhou (2014)). Matrix graphical models consider the random matrix variate X following the distribution MNp×q(M; U, V) with the probability density function (pdf)


Here M [set membership] R p×q is the mean matrix, U [set membership] R p×p is the row covariance matrix, and V [set membership] R q×q is the column covariance matrix. Thus, each row and each column of X share the same covariance matrices U and V. This distribution implies that vec(X) follows a vector multivariate Gaussian distribution Npq(vec(M), U [multiply sign in circle] V), where “vec” is the vectorization operator and “[multiply sign in circle]” is the Kronecker product.

Our model assumes X(k) follows the vector multivariate Gaussian distribution Np(μ, Σ(k)). If n = n1 =… = nK, X = (X(1),…, X(K)) [set membership] R p×K is a random matrix variate. However, each column of X has its own covariance matrix Σ(k), and each row of X may also have its own. Therefore, the covariance matrix of vec(X) cannot be expressed as the Kronecker product of two positive definite matrices. In general, the degree of freedom of Cov(vec(X)) is larger than that of U [multiply sign in circle] V mentioned above.

Supplementary Material



T. Tony Cai’s research was supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334. Hongzhe Li’s research was supported in part by NHI Grants R01 CA127334 and R01 GM097505. Weidong Liu’s research was supported in part by NSFC, Grants No. 11201298, No. 11322107, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning Shanghai Pujiang Program, Shanghai Shuguang Program and 973 Program (2015CBB56004). Jichun Xie’s research was supported in part by NIH Grant UL1 TR001117.

We thank the Editor, an associate editor and the reviewers for their insightful comments and suggestions that have helped us substantially improve the quality of the paper.

The research was supported in part by NSF FRG Grant DMS-0854973, NSF MRI Grant No. CNS-09-58854, NIH grants CA127334, GM097505. Weidong Liu’s research was also supported by NSFC Grant No.11201298 and No.11322107, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, Shanghai Pujiang Program, Foundation for the Author of National Excellent Doctoral Dissertation of PR China and Program for New Century Excellent Talents in University.



In the supplementary material, we provide additional simulation studies to compare the proposed methods and other competitive methods. We also include the proofs of the theorems.

Contributor Information

T. Tony Cai, Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104.

Hongzhe Li, Professor of Biostatistics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104.

Weidong Liu, Professor, Department of Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, Shanghai, China.

Jichun Xie, Assistant Professor, Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27707.


  • Anderson TW. An introduction to multivariate statistical analysis. Wiley-Interscience; 2003.
  • Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Machine Learning Research. 2008;9:485–516.
  • Cai T, Liu W, Luo X. A constrained [ell]1 minimization approach to sparse precision matrix estimation. Journal of American Statistical Association. 2011;106:594–607.
  • Candés E, Tao T. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351.
  • Chen J, Chen Z. Extended bayesian information criterion for model selection with large model space. Biometrika. 2008;95:232–253.
  • Dahl J, Vandenberghe L, Roychowdhury V. Covariance selection for non-chordal graphs via chordal embedding. Optimization Methods and Software. 2008;23:501–420.
  • Danaher P. JGL: Performs the Joint Graphical Lasso for sparse inverse covariance estimation on multiple classes. R package version 2.3 2013
  • Danaher P, Wang P, Witten D. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B. 2014:76–2. [PMC free article] [PubMed]
  • Donoho D, Elad M, Temlyakov V. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 2006;52:6–18.
  • Erdös P, Rényi A. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences. 1960;5:17–61.
  • Fan Y, Tang C. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society. 2013;75:531–552.
  • Fell D, Wagner A. The small world of metabolism. Nature Biotechnology. 2000:18–11. [PubMed]
  • Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. [PMC free article] [PubMed]
  • Greene L, Higman V. Uncovering network systems within protein structures. Journal of Molecular Biology. 2003;334:781–791. [PubMed]
  • Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of mulitple graphical models. Biometrika. 2011;98:1–15. [PMC free article] [PubMed]
  • Jeong H, Tombor B, Albert R, Oltvai Z, Barabási A. The large-scale organization of metabolic networks. Nature. 2000:407–6804. [PubMed]
  • Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research. 2012;40:D109–D114. [PMC free article] [PubMed]
  • Lauritzen SL. Graphical Models. Clarendon Press; Oxford: 1996.
  • Leng C, Tang C. Sparse matrix graphical models. Journal of American Statistical Association. 2012:107–499.
  • Li H, Gui J. Gradient directed regularization for sparse gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7:302–317. [PubMed]
  • Li S, Hsu L, Peng J, Wang P. Bootstrap inference for network construction with an application to a breast cancer microarray study. The Annals of Applied Statistics. 2013:7–1. [PMC free article] [PubMed]
  • Liang F, He G, Hu Y. A new smoothing Newton-type method for second-order cone programming problems. Applied Mathematics and Compuation. 2009;215:1020–1029.
  • Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462.
  • Meinshausen N, Bühlmann P. Stability selection. Journal of Royal Statistics Society, Series B. 2010;72:417–473.
  • Newsterov Y, Todd M. Primal-dual interior-point methods for self-scaled cones. SIAM Journal on Optimization. 1998;8:324–364.
  • Orgata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 1999;27:29–34. [PMC free article] [PubMed]
  • Schäfer J, Strimmer K. An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21:754–764. [PubMed]
  • Shayesteh L, Lu Y, Kuo W, Baldocchi R, Godfrey T, Collins C, Pinkel D, Powell B, Mills G, Gray J. Pik3ca is implicated as an oncogene in ovarian cancer. Nature Genetics. 1999;21:99–102. [PubMed]
  • Tothill R, Tinker A, George J, Brown R, Fox S, Lade S, Johnson D, Trivett J, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung J, Chiew Y, Haviv I, Group A. O. C. S. Gertig D, deFazio A, Bowtell D. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research. 2008;14:5198–5208. [PubMed]
  • Vendrascolo M, Dokholyan N, Paci E, Karplus M. Small-world view of the amino acids that play a key role in protein folding. Physical Review E. 2002;65 [PubMed]
  • Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 2009;71:671–683.
  • Wang T, Zhu L. Consistent tuning parameter selection in high dimensional sparse linear regression. Journal of Multivariate Analysis. 2011;102:1141–1151.
  • Watts D, Strogatz S. Collective dynamics of “small world” networks. Nature. 1998;393:440–442. [PubMed]
  • Whittaker J. Graphical Models in Applied Multivariate Analysis. Wiley; 1990.
  • Yin J, Li H. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis. 2012;107:119–140. [PMC free article] [PubMed]
  • Yuan M. Sparse inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research. 2010;11:2261–2286.
  • Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35.
  • Zhou S. Gemini: Graph estimation with matrix variate normal instances. Annals of Statistics. 2014;42:532–562.