Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC5351783

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. METHODOLOGY
- 3. THEORETICAL PROPERTIES
- 4. SIMULATION STUDIES
- 5. EPITHELIAL OVARIAN CANCER DATA ANALYSIS
- 6. DISCUSSION
- Supplementary Material
- References

Authors

Related links

Stat Sin. Author manuscript; available in PMC 2017 March 15.

Published in final edited form as:

Stat Sin. 2016 April; 26(2): 445–464.

doi: 10.5705/ss.2014.256PMCID: PMC5351783

NIHMSID: NIHMS852667

T. Tony Cai, Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104;

Motivated by analysis of gene expression data measured in different tissues or disease states, we consider joint estimation of multiple precision matrices to effectively utilize the partially shared graphical structures of the corresponding graphs. The procedure is based on a weighted constrained _{∞}/_{1} minimization, which can be effectively implemented by a second-order cone programming. Compared to separate estimation methods, the proposed joint estimation method leads to estimators converging to the true precision matrices faster. Under certain regularity conditions, the proposed procedure leads to an exact graph structure recovery with a probability tending to 1. Simulation studies show that the proposed joint estimation methods outperform other methods in graph structure recovery. The method is illustrated through an analysis of an ovarian cancer gene expression data. The results indicate that the patients with poor prognostic subtype lack some important links among the genes in the apoptosis pathway.

Gaussian graphical models provide a natural tool for modeling the conditional independence relationships among a set of random variables (Lauritzen (1996); Whittaker (1990)). They have been successfully applied to infer relationships between genes at transcriptional level (Schäfer and Strimmer (2005); Li and Gui (2006); Li, Hsu, Peng et al. (2013)). Gaussian graphical models are tightly linked to precision matrices. Suppose ** X** = (

Many methods for estimating high-dimensional precision matrix or its Gaussian graphical model have been developed in the past decade. Meinshausen and Bühlmann (2006) introduced a neighborhood selection approach by regressing all other variables on each variable with an _{1} penalty. The method consistently estimates the set of non-zero elements of the precision matrix. Efficient algorithms for exact maximization of the _{1}-penalized log-likelihood have also been proposed. Yuan and Lin (2007), Banerjee, Ghaoui and d’Aspremont (2008) and Dahl, Vandenberghe and Roychowdhury (2008) adopted an interior point optimization method to solve this problem. Based on the work of Banerjee, Ghaoui and d’Aspremont (2008) and a block-wise coordinate descent algorithm, Friedman, Hastie and Tibshirani (2008) developed the graphical Lasso (GLASSO) for sparse precision matrix estimation; it is computationally efficient even when the dimension is greater than the sample size. Yuan (2010) developed a linear programming procedure and obtained oracle inequalities for the estimation error in term of matrix operator norm. Cai, Liu and Luo (2011) developed a constrained _{1} minimization approach (CLIME) to estimate sparse precision matrix. All of these methods addressed the problem of estimating a single precision matrix or a single Gaussian graphical model.

In many applications, the problem of estimating multiple precision matrices arises when data are collected among multiple groups. For example, gene expression levels are often measured over multiple groups (tissues, environments, or subpopulations). Their precision matrices and the corresponding graphical structures imply gene regulatory mechanisms and are of great biological interest. Since the gene regulatory networks in different groups are often similar to each other, the graphical structures share many common edges. Estimating a single precision matrix group by group ignores the partial homogeneity in their graphical structures, which often leads to low power. To effectively utilize the shared graphical structures and to increase the estimation precision, it is important to estimate multiple precision matrices jointly.

Previous attempts to jointly estimate multiple precision matrices include Guo, Levina, Michailidis et al. (2011) and Danaher, Wang and Witten (2014). Guo, Levina, Michailidis et al. (2011) proposed a hierarchical penalized model to perserve the common graphical structure while allowing differences across groups. Their method achieves Frobenius norm convergence when *p* log(*p*)*/n* goes to zero, where *p* is the number of variables, and *n* is the total sample size. Unfortunately, for genomic applications, the number of genes often exceeds the total sample size and, as a result, invalidates the theoretical justification in Guo, Levina, Michailidis et al. (2011). Danaher, Wang and Witten (2014) proposed two algorithms of joint graphical lasso (FGL and GGL) to estimate precision matrices that share common edges. Their approach is based upon maximizing a penalized log likelihood with a fused Lasso or group Lasso penalty. The paper did not provide any theoretical justification on the statistical convergence rate of their estimators.

In this paper, we propose a weighted constrained _{∞}/_{1} minimization method to estimate *K* sparse precision matrices (MPE) jointly. Different from Guo, Levina, Michailidis et al. (2011) and Danaher, Wang and Witten (2014), the proposed estimators converge to the true precision matrices even when *p* = *O*{exp(*n ^{a}*)}, for some 0 <

The rest of the paper is organized as follows. Section 2 presents the estimation method and the optimization algorithm. Theoretical properties of the proposed method and accuracy of the graph structure recovery are studied in Section 3. Section 4 investigates the numerical performance of the method through a simulation study. The proposed method is compared with other competing methods. The method is also illustrated by an analysis of an epithelial ovarian cancer gene expression study in Section 5. A brief discussion is given in Section 6 and proofs are presented in the Appendix.

The following notations are used in the paper. For a vector **a** = (*a*_{1}*,…, a_{p}*)

We introduce an estimation method to jointly estimate *K* precision matrices with partial homogeneity in their graphical structures. The method uses a constrained _{1} minimization approach, that has been successfully applied to high dimensional regression problems (Donoho, Elad and Temlyakov (2006); Candés and Tao (2007)) and signal precision matrix estimation problem (Cai, Liu and Luo (2011)) to recover the sparse vector or matrix.

For 1 *≤ k ≤ K*, let *X*^{(}^{k}^{)}
*~ N*(*μ*^{(}^{k}^{)}, **Σ**^{(}^{k}^{)}) be a *p*-dimensional random vector. The precision matrix of *X*^{(}^{k}^{)}, denoted by
${\mathbf{\Omega}}^{(k)}=({\omega}_{ij}^{(k)})$, is the inverse of the covariance matrix **Σ**^{(}^{k}^{)}. Suppose there are *n _{k}* identically and independently distributed random samples of
${\mathit{X}}^{(k)}:\{{\mathit{X}}_{j}^{(k)},1\le j\le {n}_{k}\}$. The sample covariance matrix for the

$${\widehat{\sum}}^{(k)}=\frac{1}{{n}_{k}}{\displaystyle \sum _{j=1}^{{n}_{k}}({\mathit{X}}_{j}^{(k)}-{\overline{\mathit{X}}}^{(k)})({\mathit{X}}_{j}^{(k)}-{\overline{\mathit{X}}}^{(k)})\prime},$$

where
${\overline{\mathit{X}}}^{(k)}={\displaystyle {\sum}_{j=1}^{{n}_{k}}}{\mathit{X}}_{j}^{(k)}/{n}_{k}$. Denote by *n* = Σ* _{k} n_{k}* the total sample size and

$$\begin{array}{l}\underset{{\mathbf{\Omega}}_{1}^{(k)}\in {\mathrm{\mathbb{R}}}^{p\times p},1\le k\le K}{min}\left(\underset{1\le k\le K}{max}{|{\mathbf{\Omega}}_{1}^{(k)}|}_{1}\right),\\ \text{subject}\phantom{\rule{0.2em}{0ex}}\text{to}\underset{i,j}{max}{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}}{|{({\widehat{\sum}}^{(k)}{\mathbf{\Omega}}_{1}^{(k)}-\mathbf{I})}_{ij}|}^{2}\right\}}^{1/2}\le {\mathrm{\lambda}}_{n},\end{array}$$

(1)

where *λ _{n}* =

Denote by ${\widehat{\mathbf{\Omega}}}_{1}^{(k)}(1\le k\le K)$ the solution to (1). They are not symmetric in general. To make the solution symmetric, the estimator ${\widehat{\mathbf{\Omega}}}^{(k)}=({\widehat{\omega}}_{ij}^{(k)})$ is constructed by comparing ${\widehat{\omega}}_{1ij}^{(k)}$ and ${\widehat{\omega}}_{1ji}^{(k)}$ and assigning the one with a smaller magnitude at both entries,

$${\widehat{\omega}}_{ij}^{(k)}={\widehat{\omega}}_{ji}^{(k)}:={\widehat{\omega}}_{1ij}^{(k)}I(|{\widehat{\omega}}_{1ij}^{(k)}|\le |{\widehat{\omega}}_{1ji}^{(k)}|)+{\widehat{\omega}}_{1ji}^{(k)}I(|{\widehat{\omega}}_{1ij}^{(k)}|>|{\widehat{\omega}}_{1ji}^{(k)}|).$$

This symmetrizing procedure is not ad-hoc. The procedure assures that the final estimator
${\widehat{\mathbf{\Omega}}}^{(k)}$ achieves the same entry-wise * _{∞}* estimation error as
${\widehat{\mathbf{\Omega}}}_{1}^{(k)}$. The details are discussed in Section 3.

The convex optimization problem (1) involves estimating *K p* × *p* precision matrices. To reduce the computation complexity, it can be further decomposed into *p* sub-problems that involve estimating *K p* × 1 sparse vectors:

$$\begin{array}{l}\underset{{\mathit{\beta}}_{j}^{(k)}\in {\mathrm{\mathbb{R}}}^{p},1\le k\le K}{min}\left(\underset{k}{max}{|{\beta}_{1}^{(k)}|}_{1}\right),\\ \text{subjectto}\underset{i}{max}{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}}{|{({\widehat{\sum}}^{(k)}{\beta}_{j}^{(k)}-{e}_{j})}_{i}|}^{2}\right\}}^{1/2}\le {\mathrm{\lambda}}_{n}\end{array}$$

(2)

for 1 *≤ j ≤ p*, where *e _{j}*

Suppose
${\widehat{\mathbf{\Omega}}}_{1}^{(k)}$ is the solution to (1) and
${\widehat{B}}^{(k)}:=({\widehat{\beta}}_{1}^{(k)},\dots ,{\widehat{\beta}}_{p}^{(k)})$, where
${\widehat{\beta}}_{j}^{(k)}$ is the solution to (2). Then
${\widehat{\mathbf{\Omega}}}_{1}^{(k)}={\widehat{\mathbf{B}}}^{(k)}$
*for* 1 *≤ k ≤ K.*

Problem (2) can be solve by a second-order cone programming. The existing packages to solve (2) include the SDTP3 and the SeDuMi package in Matlab, and the CLSOCP package in R. CLSOCP uses a one-step smoothing Newton method of Liang, He and Hu (2009). This algorithm has good precision but works relatively slowly for high dimensional problems. SeDuMi and SDTP3 adopt the primal-dual infeasible-interior point algorithm (Newsterov and Todd (1998)). The most time-consuming part of the algorithm is to solve the Schur complement equation, which involves Cholesky factorization. The sparsity and the size of the Schur complement matrix are two factors that affect efficiency. SDTP3 is able to divide a high dimensional optimization problem into sparse blocks and uses the sparse solver for Cholesky factorizations. It is therefore faster than SeDuMi in solving (2). In this paper, we used the SDTP3 package for all the computations. For a problem with *p* = 200, *n _{k}* = 150 and

Choosing the tuning parameters in regularized estimation is in general a difficult problem. For linear regression models, Chen and Chen (2008); Wang, Li and Leng (2009); Wang and Zhu (2011) studied how to consistently choose the tuning parameters when *p* = *O*(*n ^{a}*) for some

Since
${\mathit{X}}_{1}^{(k)}(1\le l\le {n}_{k})$ follows a multivariate Gaussian distribution *N _{p}*(

$${X}_{l,i}^{(k)}={\alpha}_{i}^{(k)}+{\mathit{X}}_{l,-i}^{{(k)}^{\prime}}{\mathit{\beta}}_{i}^{(k)}+{\epsilon}_{l,i}^{(k)},$$

(3)

where ${\epsilon}_{l,i}^{(k)}~N(0,1/{\omega}_{ij}^{(k)})$. The regression coefficients satisfy (Anderson (2003))

$${\alpha}_{i}^{(k)}={\mu}_{i}^{(k)}-{\sum}_{i,-i}^{(k)}{({\sum}_{-i,-i}^{(k)})}^{-1}{\mathit{\mu}}_{-i}^{(k)},\phantom{\rule{0.5em}{0ex}}{\mathit{\beta}}_{i}^{(k)}=-{({\omega}_{ii}^{(k)})}^{-1}{\mathbf{\Omega}}_{-i,i}^{(k)}.$$

(4)

Based on these results, we propose a tuning parameter selection procedure:

- For a given
*λ*, calculate the estimator ${\widehat{\mathbf{\Omega}}}^{(k)}$. Based on the support of ${\widehat{\mathbf{\Omega}}}^{(k)}$, use least squares and neighborhood selection to re-fit the precision matrix estimator ${\widehat{\mathbf{\Omega}}}_{2}^{(k)}$. - Define ${\mathcal{S}}_{i}^{(k)}=\{j:{\widehat{\omega}}_{ij}^{(k)}\ne 0,j\ne i\}$, the set of non-zero off-diagonal elements of the
*i*-th row of ${\widehat{\mathbf{\Omega}}}^{(k)}$. - If $\mathsf{card}({\mathcal{S}}_{i}^{(k)})\ge {n}_{k}$, let the
*i*-th column of ${\widehat{\mathbf{\Omega}}}_{2}^{(k)}$ equal to the*i*-th column of ${\widehat{\mathbf{\Omega}}}_{\cdot j}^{(k)},{\widehat{\mathbf{\Omega}}}_{2,\cdot i}^{(k)}={\widehat{\mathbf{\Omega}}}_{\cdot i}^{(k)}$. If $\mathsf{card}({\mathcal{S}}_{i}^{(k)})<{n}_{k}$, fit the regression modelIf ${\mathcal{S}}_{i}^{(k)}$ equals the true support ${\mathcal{S}}_{0,i}^{(k)}=\{j:{\omega}_{0,ij}^{(k)}\ne 0,j\ne i\}$, ${\beta}_{ij}^{(k)}=-{\omega}_{0,ij}^{(k)}/{\omega}_{0,ii}^{(k)}$ and $\mathsf{Var}({\epsilon}_{l,i}^{(k)})=1/{\omega}_{0,ii}^{(k)}$. Suppose the solution to (5) is ${\widehat{\beta}}_{ij}^{(k)}$. Define$${X}_{l,i}^{(k)}={\alpha}_{i}^{(k)}+{\displaystyle \sum _{j\in {\mathcal{S}}_{i}^{(k)}}{X}_{l,j}^{{(k)}^{\prime}}{\beta}_{ij}^{(k)}+}{\epsilon}_{l,i}^{(k)}.$$(5)After fitting Model (5), let ${\widehat{\omega}}_{2,ii}^{(k)}={n}_{k}/{\sum}_{l=1}^{{n}_{k}}{({\epsilon}_{l,i}^{(k)})}^{2}$, and ${\widehat{\omega}}_{2,ij}^{(k)}=-{\widehat{\beta}}_{ij}^{(k)}{\widehat{\omega}}_{2,ii}^{(k)}$.$${\widehat{\epsilon}}_{l,i}^{(k)}=({X}_{l,i}^{(k)}-{\overline{X}}_{i}^{(k)})-{{\displaystyle \sum _{j\in {\mathcal{S}}_{i}^{(k)}}({X}_{l,j}^{(k)}-{\overline{X}}_{j}^{(k)})}}^{\prime}{\widehat{\beta}}_{ij}^{(k)}.$$ - Repeat Step 3 for
*i*= 1*,…, p*and*k*= 1*,…, K*. The resulting matrices ${\widehat{\mathbf{\Omega}}}_{2}^{(k)}$,*k*= 1*,…, K*are not symmetric. We symmetrize ${\widehat{\mathbf{\Omega}}}_{2}^{(k)}$ by the same procedure:$${\widehat{\omega}}_{3,ij}^{(k)}={\widehat{\omega}}_{3,ji}^{(k)}:={\widehat{\omega}}_{2,ij}^{(k)}I(|{\widehat{\omega}}_{2,ij}^{(k)}|\le |{\widehat{\omega}}_{2,ji}^{(k)}|)+{\widehat{\omega}}_{2,ji}^{(k)}I(|{\widehat{\omega}}_{2,ij}^{(k)}|>|{\widehat{\omega}}_{2,ji}^{(k)}|).$$

We use
${\widehat{\mathbf{\Omega}}}_{3}^{(k)}=({\widehat{\omega}}_{3,ij}^{(k)})$, *k* = 1*,…, K* as the estimators corresponding to the tuning parameter *λ*. The optimal tuning parameter can be selected by Bayesian information criterion (BIC),

$$\text{BIC}(\mathrm{\lambda})={\displaystyle \sum _{k=1}^{K}\left\{{n}_{k}\text{tr}\left({\widehat{\sum}}^{(k)}{\widehat{\mathbf{\Omega}}}_{3}^{(k)}\right)-{n}_{k}\mathrm{log}(\mathrm{det}{\widehat{\mathbf{\Omega}}}_{3}^{(k)})+\mathrm{log}({n}_{k}){s}_{k}\right\},}$$

(6)

where ${s}_{k}=\mathsf{Card}\{(i,j):{\widehat{\omega}}_{i,j}\ne 0,1\le i<j\le p\}$. We obtain the solution to our method over a wide range of tuning parameters and choose ${\widehat{\mathrm{\lambda}}}_{n}$ that minimizes BIC(λ).

Since the refitted estimator can potentially reduce some bias introduced in the optimization due to the penalty term, using it in BIC (6) improves the tuning parameter selection in the numerical studies. However, using
${\widehat{\mathbf{\Omega}}}_{3}^{(k)}$ as an estimator of **Ω**^{(}^{l}^{)} is not recommended because when
$\mathsf{Card}({\mathcal{S}}_{i}^{(k)})$ is large, the re-fitted estimator might lead to overfitting. Overfitting does not severely affect the tuning parameter selection, because BIC puts penalties on complicated models that are less likely to be chosen.

We investigate the properties of the proposed estimator by considering the convergence rates of ${\widehat{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}^{(k)}$, including estimation error bounds and graph structure recovery. We assume the following conditions:

- (C1)There exists some constant
*a >*0, such that$$\mathrm{log}p=o\left(\frac{n}{{K}^{2a}{(\mathrm{log}n)}^{2}}\right),\phantom{\rule{0.5em}{0ex}}\text{and}\phantom{\rule{0.5em}{0ex}}max(K,{K}^{4-a}\mathrm{log}K)=o(\mathrm{log}p).$$ - (C2)sup
_{1≤k≤K}{*λ*_{max}(**Ω**^{(}^{k}^{)})/*λ*_{min}(**Ω**^{(}^{k}^{)})}*≤ M*_{0}for some bounded constant*M*_{0}*>*0. - (C3)If
*e*=_{ij}*I*(*i*=*j*), ${Z}_{ij}^{(k)}={({\mathit{X}}^{(k)}{\mathit{X}}^{(k)\prime}{\mathbf{\Omega}}^{(k)})}_{ij}-{e}_{ij}$, ${Z}_{ij}={({Z}_{ij}^{(1)},\dots ,{Z}_{ij}^{(K)})}^{\prime}$ and the largest eigenvalue of`Cov`(*Z*) is_{ij}*λ*_{max,ij}, sup_{ij}λ_{max,ij}≤*M*_{1}. - (C4)
*n*_{1}*n*_{2}*n*_{K}*n/K*.

Condition (C1) allows *p* to grow exponentially fast as *n*. It also allows the number of groups *K* to grow slowly with *p* and *n*. For example, when log *p* = *O*(*n ^{r}*) and

Let
${M}_{n}={\mathrm{sup}}_{1\le k\le K}{max}_{j}{\sum}_{i=1}^{p}|{\omega}_{ij}^{(k)}|={\text{sup}}_{1\le k\le K}{\Vert {\mathbf{\Omega}}^{(k)}\Vert}_{1}$ be the maximum matrix _{1} norms of the *K* matrices. A theorem establishes the convergence rate of the precision matrix estimates under the element-wise * _{∞}* norm.

Let λ* _{n}* =

$$\underset{i,j}{\mathrm{sup}}{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}}{|{({\widehat{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}^{(k)})}_{i,j}|}^{2}\right\}}^{1/2}\le {C}_{1}{M}_{n}{\left(\frac{\mathrm{log}K\cdot \mathrm{log}p}{n}\right)}^{1/2}$$

(7)

with a high probability converging to *1* and *C*_{1} = 2*C*_{0}.

- Remark 1: The value of
*C*_{0}depends on*M*_{0}. In practice,*M*_{0}is often unknown. However, we can use the tuning parameter selection method, such as BIC in (6), to choose*λ*. The details are discussed in Section 2.3._{n} - Remark 2: Theorem 1 (and Theorems 2 and 3) does not require the true precision matrices
**Ω**^{(}^{k}^{)}to have identical graphical structures. Both the values and locations of non-zero entries can differ across**Ω**^{(}^{k}^{)},*k*= 1*,…, K*.

Define

$$\begin{array}{l}\mathcal{U}(M)=\{({\mathbf{\Omega}}^{(1)},\dots ,{\mathbf{\Omega}}^{(K)}):{\mathrm{\lambda}}_{max}({\mathbf{\Omega}}^{(k)})/{\mathrm{\lambda}}_{min}({\mathbf{\Omega}}^{(k)})\le M,\\ \phantom{\rule{10em}{0ex}}{\Vert {\mathbf{\Omega}}^{(k)}\Vert}_{1}\mathrm{\u2a46}{M}_{n}=o\{{(n/\mathrm{log}p)}^{1/2}\},k=1,\dots ,K\}.\end{array}$$

Let
$\stackrel{\sim}{\mathcal{U}}$ be the set of estimators
$({\stackrel{\sim}{\mathbf{\Omega}}}^{(1)},\dots ,{\stackrel{\sim}{\mathbf{\Omega}}}^{(K)})$, where
${\stackrel{\sim}{\mathbf{\Omega}}}^{(k)}$ only depends on the k-th sample
$\left\{{\mathit{X}}_{j}^{(k)};1\le j\le {n}_{k}\right\}$. Assume the samples are independent across K groups. Under (*C4*), there exists a constant *α* > 0, such that

$$\underset{i,j}{\mathrm{inf}}\underset{({\stackrel{\sim}{\mathbf{\Omega}}}^{(1)},\dots ,{\stackrel{\sim}{\mathbf{\Omega}}}^{(K)})\in \stackrel{\sim}{\mathcal{U}}}{\mathrm{inf}}\underset{({\mathbf{\Omega}}^{(1)},\dots ,{\mathbf{\Omega}}^{(K)})\in \mathcal{U}}{\mathrm{sup}}P\left[{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}{|{({\stackrel{\sim}{\mathbf{\Omega}}}^{(\mathrm{k})}-{\stackrel{\sim}{\mathbf{\Omega}}}^{(k)})}_{i,j}|}^{2}}\right\}}^{1/2}\ge \alpha {M}_{n}{\left(\frac{K\cdot \mathrm{log}p}{n}\right)}^{1/2}\right]\ge {\alpha}^{K},$$

for sufficiently large n.

Theorem 1 shows that the convergence rate of
${\left\{{\sum}_{k=1}^{K}{w}_{k}{|{({\stackrel{\sim}{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}^{(k)})}_{i,j}|}^{2}\right\}}^{1/2}$ from our joint estimation method is less than or equal to *C*_{1}
*M _{n}* (log

An additional thresholding step on the estimators with a careful chosen threshold leads to new estimators, that converge to the true precision matrices under the matrix operator norm. Define the thresholded estimator ${\stackrel{\mathbf{\u2323}}{\mathbf{\Omega}}}^{(k)}=({\stackrel{\u2323}{\omega}}_{ij}^{(k)})$ as follows,

$${\stackrel{\u2323}{\omega}}_{ij}^{(k)}={\widehat{\omega}}_{ij}^{(k)}I\left\{{\left({\displaystyle \sum _{k=1}^{K}{w}_{k}{({\widehat{\omega}}_{ij}^{(k)})}^{2}}\right)}^{1/2}>{C}_{1}{M}_{n}{\left(\frac{\mathrm{log}K\cdot \mathrm{log}p}{n}\right)}^{1/2}\right\}$$

with *C*_{1} the constant defined in (7). Let
${\mathcal{S}}_{j}^{(k)}=\{(i,j):{\omega}_{ij}^{(k)}\ne 0,i<j\}$ and
${\mathcal{S}}_{j}={\cup}_{k=1}^{k}{S}_{j}^{(k)}$. Define
${s}_{0}(p)={max}_{1\le j\le p}\mathsf{Card}({\mathcal{S}}_{j})$ as the union sparsity.

Suppose that (*C1*)–(*C4*) hold. Then with a high probability converging to 1,

$$\underset{j}{max}{\displaystyle \sum _{i=1}^{p}{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}{({\stackrel{\mathbf{\u2323}}{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}^{(k)})}_{ij}^{2}}\right\}}^{1/2}\le \phantom{\rule{0.2em}{0ex}}}{C}_{1}{M}_{n}{s}_{0}(p){\left(\frac{\mathrm{log}K\cdot \mathrm{log}p}{n}\right)}^{1/2}.$$

(8)

The convergence rates of
${\stackrel{\mathbf{\u2323}}{\mathbf{\Omega}}}^{(k)}$ depend on the union sparsity level *s*_{0}(*p*). When the precision matrices share the same graphical structure,
${s}_{0}(p)={max}_{1\le j\le p}\mathsf{Card}({\mathcal{S}}_{j}^{(k)})$ for all *k* = 1*,…, K*. When the number of shared edges in the graphical structures increases, the union sparsity *s*_{0}(*p*) decreases, and consequently
${\stackrel{\mathbf{\u2323}}{\mathbf{\Omega}}}^{(k)}$ converges to **Ω**^{(}^{k}^{)} faster.

Let
${\widehat{a}}_{ij}=\sqrt{{\sum}_{k=1}^{K}{w}_{k}{({\stackrel{\mathbf{\u2323}}{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}^{(k)})}_{ij}^{2}}$. Matrix Â = (*â _{ij}*)

Suppose that the conditions in Theorem 2 hold. Then with a high probability converging to 1,

$${\Vert \widehat{\mathbf{A}}\Vert}_{2}\le {\Vert \widehat{\mathbf{A}}\Vert}_{1}\le {C}_{1}{M}_{n}{s}_{0}(p){\left(\frac{\mathrm{log}K\cdot \mathrm{log}p}{n}\right)}^{1/2}.$$

Theoretical analysis for graphical structure recovery is very complicated when the graphical structures of the precision matrices are different across the *K* groups since the results depend on the structures of the shared edges. Here we focus on the case in which the *K* precision matrices have a common support. Let
${\mathcal{S}}_{k}=\{(i,j):{\omega}_{ij}^{(k)}\ne 0\}$ be the support for the *k*-th precision matrix, and the common support be
$\mathcal{S}={\cap}_{k}{\mathcal{S}}_{k}$. When
${\mathcal{S}}_{1}=\dots ={\mathcal{S}}_{K}$,
$\mathcal{S}={\mathcal{S}}_{k}$, *k* = 1*,…, K*, by Theorem 1 we estimate
$\mathcal{S}$ by

$$\widehat{\mathcal{S}}=\left[(i,j):{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}}{\left({\widehat{\omega}}_{ij}^{(k)}\right)}^{2}\right\}}^{1/2}>{C}_{1}{M}_{n}{\left(\frac{\mathrm{log}K\cdot \mathrm{log}p}{n}\right)}^{1/2}\right],$$

where *C*_{1} is a constant given in Theorem 1. Let

$${\theta}_{n}=\underset{(i,j)\in \mathcal{S}}{min}{\left\{{\displaystyle \sum _{k=1}^{K}{w}_{k}}{\left({\omega}_{ij}^{(k)}\right)}^{2}\right\}}^{1/2}.$$

We have a result on support recovery.

Suppose that the conditions in Theorem 1 hold. Assume that

$${\theta}_{n}>2{C}_{1}{M}_{n}{\left(\frac{\mathrm{log}K\cdot \mathrm{log}p}{n}\right)}^{1/2}.$$

(9)

Then $\widehat{\mathcal{S}}=\mathcal{S}$ with a high probability converging to 1.

When the graphical structures are the same across all *K* groups, the lower bound condition (9) is weaker than the lower bound condition needed for graphical structure recovery by separate estimation methods. Based on Proposition 1 and its proof, to fully recover the shared graphical structure by separate estimation methods, a necessary condition is

$$\underset{(i,j)\in \mathcal{S}}{\mathrm{inf}}\underset{k}{\mathrm{inf}}|{\omega}_{ij}^{(k)}|>2\alpha {M}_{n}{(K\mathrm{log}p/n)}^{1/2}.$$

When *K* is sufficiently large, this condition is stronger than (9).

We evaluated the numerical performance of the proposed method and other competitive methods, including the separate precision matrix estimation procedures proposed by Friedman, Hastie and Tibshirani (2008) and Cai, Liu and Luo (2011) and the joint estimation method proposed by Guo, Levina, Michailidis et al. (2011) and Danaher, Wang and Witten (2014). The separate precision matrix estimation methods were applied to each group, and therefore ignored the partial homogeneity in graphical structures among groups. In all numerical studies, we set *p* = 200, *K* = 3 and (*n*_{1}*, n*_{2}*, n*_{3}) = (80, 120, 150). The simulated observations were generated in each group independently from a multivariate Gaussian distribution N{0, (**Ω**^{(}^{k}^{)})^{−1}}, where **Ω**^{(}^{k}^{)} is the precision matrix in the *k*-th group. For each model, 100 replications were performed.

We present results for two different types of graphical models: the Erdös and Rényi (ER) model (Erdös and Rényi (1960)) and the Watts-Strogatz (WS) model (Watts and Strogatz (1998)). For the ER model, the graph contains *p* vertices and each pair of vertices are connected with a probability 0.05. For the WS model, first a ring lattice of *p* vertex is created; one vertex is connected with its neighbors within order distance of 15, and then the edges of the lattice are rewired uniformly and randomly with a probability 0.01. These graph models have several topological properties such as sparcity and a “small world” property often observed in true biological gene regulatory networks. See Fell and Wagner (2000); Jeong, Tombor, Albert et al. (2000); Vendrascolo, Dokholyan, Paci et al. (2002); Greene and Higman (2003).

Based on the ER model or the WS model, a common graph structure was generated. Let *M* be the number of edges in the common graph structure. Then, ⎿*ρM*⏌ random edges were added to the common graph structure to generate graph structures for each group, where the parameter *ρ* was the ratio between the number of group-specific edges and common edges. We considered *ρ* = 0, 1/4, and 1. The first setting represents the scenario where the precision matrices in all groups share the same graph structure. After the graph structure of each group was determined, the values of non-zero off-diagonal entries were generated independently as uniform in [−1, −0.5] [0.5, 1]. The diagonal values were assigned to a constant so that the condition number of each precision matrix was equal to *p*.

Each method was evaluated for a range of tuning parameters under each model. The optimal tuning parameter was chosen by BIC (6). Several measures are used to compare the performances of these estimators. The estimation error was evaluated in terms of average matrix _{1} norm, _{2} norm (spectral norm), and Frobenius norm:

$$\begin{array}{l}{L}_{1}=\frac{1}{K}{\displaystyle \sum _{k=1}^{K}{\Vert {\widehat{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}_{0}^{(k)}\Vert}_{1},}\\ {L}_{2}=\frac{1}{K}{\displaystyle \sum _{k=1}^{K}{\Vert {\widehat{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}_{0}^{(k)}\Vert}_{2},}\\ {L}_{\mathrm{F}}=\frac{1}{K}{\displaystyle \sum _{k=1}^{K}{\Vert {\widehat{\mathbf{\Omega}}}^{(k)}-{\mathbf{\Omega}}_{0}^{(k)}\Vert}_{\mathrm{F}}.}\end{array}$$

The graph structure recovery results were evaluated by average sensitivity (SEN), specificity (SPE), and Matthews correlation coefficient (MCC). For a true precision matrix **Ω**_{0} = (*ω*_{0,ij}) with support set
${\mathcal{S}}_{0}=\{(i,j):{\omega}_{0,ij}\ne 0\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}i\ne j\}$, suppose its estimator
$\widehat{\mathbf{\Omega}}$ has the support set
$\widehat{\mathcal{S}}$. Then the measures with respect to **Ω**_{0} and
$\widehat{\mathbf{\Omega}}$ are defined as follows:

$$\begin{array}{l}\text{SPE}=\frac{\text{TN}}{\text{TN}+\text{FP}},\phantom{\rule{0.5em}{0ex}}\text{SEN}=\frac{\text{TP}}{\text{TP}+\text{FN}},\\ \text{MCC}=\frac{\text{TP}\times \text{TN}-\text{FP}\times \text{FN}}{{\{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{FP}+\text{FN})\}}^{1/2}}.\end{array}$$

Here, TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives and false negatives:

$$\begin{array}{l}\text{TP}=\#\{(i,j):(i,j)\in {\mathcal{S}}_{0}\cap \widehat{\mathcal{S}}\},\phantom{\rule{0.5em}{0ex}}\text{TN}=\#\{(i,j):(i,j)\in {\mathcal{S}}_{0}^{C}\cap {\widehat{\mathcal{S}}}^{C}\},\\ \text{FP}=\#\{(i,j):(i,j)\in {\mathcal{S}}_{0}^{C}\cap \widehat{\mathcal{S}}\},\phantom{\rule{0.5em}{0ex}}\text{FN}=\#\{(i,j):(i,j)\in {\mathcal{S}}_{0}\cap {\widehat{\mathcal{S}}}^{C}\}.\end{array}$$

We compared
${\widehat{\mathbf{\Omega}}}^{(k)}$ and
${\mathbf{\Omega}}_{0}^{(k)}$ and report the average sensitivities (SEN), specificities (SPE), and Matthews correlation coefficient (MCC) among *K* groups.

The comparisons of the results for the four graphical models are shown in Tables 1 and and2.2. It shows that when *ρ* = 0, the true graph structures are the same across all three groups, joint estimation methods perform much better than the separate estimation methods. As *ρ* increases, the structures across different groups become more different, and the joint estimation methods gradually lose their advantages. Our method has the best performance in graph structure recovery among all the methods. Even when *ρ* = 1, it still performs significantly better than the separate estimation methods. Our method has the best performance in graph structure recovery and it achieves the highest Matthews correlation coefficient. Its estimation error measured under the matrix _{1}, _{2}, and the Frobenius norm are comparable to other joint estimation methods.

Simulation results for data generated based on the Erdös and Rényi graph with different ratios of the number of individual-specific edges to the number of shared edges. Results are based on 100 replications. The numbers in the brackets **...**

Simulation results for data generated based on the Watt-Strogatz graph with different ratios of the number of individual-specific edges to the number of shared edges. Results are based on 100 replications. The numbers in the brackets are standard errors. **...**

Since the tuning parameter selection may affect the performance of the methods, we plot in Figure 1 the receiver operating characteristic (ROC) curves averaged over 100 repetitions with the false positive rate controlled under 10%. The methods proposed by Danaher, Wang and Witten (2014) have two tuning parameters. For each sparsity tuning parameter, we first chose an optimal similarity tuning parameter from a grid of candidates by BIC and then plotted the ROC curves based on a sequence of sparsity tuning parameters and their corresponding optimal similarity tuning parameters. Because these methods involve choosing two tuning parameters, they are slower than our method in computation (The computation of FGL and GGL is based on the R package “JGL” contributed by Danaher (2013)). Figure 1 shows that our method consistently outperforms the other methods in graph structure recovery.

Epithelial ovarian cancer is a molecularly diverse cancer that lacks effective personalized therapy. Tothill, Tinker, George et al. (2008) identified six molecular subtypes of ovarian cancer, labeled as C1–C6, where the C1 subtype was characterized by significant differential expressions of genes associated with stromal and immune cell types. The patients in C1 subtype group have shown to have a lower survival rate compared to the patients from other subtypes. The data set includes RNA expression data collected from *n* = 78 patients of C1 subtype and *n* = 113 patients from the other subtypes. We are interested to see how the wiring (conditional dependency) of the genes at the transcription levels differs among molecular subgroups of ovarian cancer. We focus on the apoptosis pathway from the KEGG database (Orgata, Goto, Sato et al. (1999); Kanehisa, Goto, Sato et al. (2012)) to see whether the genes related to this pathway (*p* = 87) are differentially wired between the C1 and other subtypes.

To stabilize the graph structure selection, we bootstraped the samples 100 times within each group. At each time, *I _{ik}* was sampled uniformly taking values in

Table 3 lists the number of edges selected by the bootstrap aggregation of our proposed method and its competitors. The separate estimation methods (CLIME and GLASSO) resulted in graphs that share fewer edges in the precision matrices of the two cancer subtype groups. JEMGM resulted in most shared edges, followed by GGL and our method (MPE). Overall, FGL and GGL selected a lot more linked genes than other methods. Figure 2 shows the Gaussian graphs estimated by these six methods. FGL, GGL and MPE selected more unique edges among the gene expression levels for the C2–C6 subtype cancer than those for the C1 subtype. This suggests that the patients with poor prognostic subtype (C1) lack some important links among the Apoptosis genes.

Estimated Gaussian Graphs by the proposed method and its competitors. The dashed edges are links unique to the precision matrix estimator of the C1 subtype, the dotted edges are unique to that of other subtypes, and the solid edges are shared by both **...**

Number of edges selected by the proposed method and its competitors. “C1 unique” counts the number of edges that only appear in the precision matrix of the gene expression levels in C1 cancer subtype; “Other unique” counts **...**

We further defined the nodes with degrees equal or larger than five based on the union of the estimated graphs of two subtypes as the hub nodes. FGL and GGL yielded estimators with most of the hub nodes completely unlinked in the estimated graph for C1 cancer subtype. The estimators by MPE had several edges between the hub nodes shared by both subtype groups, while also displaying some links unique to each group. The hub nodes identified by MPE were FASLG, CASP10, CSF2RB, IL1B, MYD88, NFKB1, NFKBIA, PIK3CA, IKBKG, and PIK3R5. Among these, CASP10, PIK3CA, IL1B, and NFKb1 have been implicated in ovarian cancer risk or progression. In particular, PIK3CA has been implicated as an oncogene in ovarian cancer (Shayesteh, Lu, Kuo et al. (1999)), indicating the importance of these hub genes in ovarian cancer progression.

It is of interest to discuss the connection and difference between the problem considered in this paper and the problem of estimating matrix graphical models (Leng and Tang (2012); Yin and Li (2012); Zhou (2014)). Matrix graphical models consider the random matrix variate **X** following the distribution *MN _{p×q}*(

$$p(\mathbf{X}|\mathbf{M},\mathbf{U},\mathbf{V})={(2\pi )}^{-pq/2}{|{\mathbf{U}}^{-1}|}^{q/2}{|{\mathbf{V}}^{-1}|}^{q/2}\mathrm{exp}\{-\mathsf{tr}[{(\mathbf{X}-\mathbf{M})}^{\prime}{\mathbf{V}}^{-1}{(\mathbf{X}-\mathbf{M})}^{\prime}{\mathbf{U}}^{-1}/2]\}.$$

Here **M** ^{p}^{×}* ^{q}* is the mean matrix,

Our model assumes *X*^{(}^{k}^{)} follows the vector multivariate Gaussian distribution *N _{p}*(

T. Tony Cai’s research was supported in part by NSF Grants DMS-1208982 and DMS-1403708, and NIH Grant R01 CA127334. Hongzhe Li’s research was supported in part by NHI Grants R01 CA127334 and R01 GM097505. Weidong Liu’s research was supported in part by NSFC, Grants No. 11201298, No. 11322107, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning Shanghai Pujiang Program, Shanghai Shuguang Program and 973 Program (2015CBB56004). Jichun Xie’s research was supported in part by NIH Grant UL1 TR001117.

We thank the Editor, an associate editor and the reviewers for their insightful comments and suggestions that have helped us substantially improve the quality of the paper.

The research was supported in part by NSF FRG Grant DMS-0854973, NSF MRI Grant No. CNS-09-58854, NIH grants CA127334, GM097505. Weidong Liu’s research was also supported by NSFC Grant No.11201298 and No.11322107, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, Shanghai Pujiang Program, Foundation for the Author of National Excellent Doctoral Dissertation of PR China and Program for New Century Excellent Talents in University.

SUPPLEMENTARY MATERIAL

In the supplementary material, we provide additional simulation studies to compare the proposed methods and other competitive methods. We also include the proofs of the theorems.

T. Tony Cai, Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104.

Hongzhe Li, Professor of Biostatistics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104.

Weidong Liu, Professor, Department of Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, Shanghai, China.

Jichun Xie, Assistant Professor, Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27707.

- Anderson TW. An introduction to multivariate statistical analysis. Wiley-Interscience; 2003.
- Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Machine Learning Research. 2008;9:485–516.
- Cai T, Liu W, Luo X. A constrained
_{1}minimization approach to sparse precision matrix estimation. Journal of American Statistical Association. 2011;106:594–607. - Candés E, Tao T. The dantzig selector: Statistical estimation when
*p*is much larger than*n*. The Annals of Statistics. 2007;35:2313–2351. - Chen J, Chen Z. Extended bayesian information criterion for model selection with large model space. Biometrika. 2008;95:232–253.
- Dahl J, Vandenberghe L, Roychowdhury V. Covariance selection for non-chordal graphs via chordal embedding. Optimization Methods and Software. 2008;23:501–420.
- Danaher P. JGL: Performs the Joint Graphical Lasso for sparse inverse covariance estimation on multiple classes. R package version 2.3 2013
- Danaher P, Wang P, Witten D. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B. 2014:76–2. [PMC free article] [PubMed]
- Donoho D, Elad M, Temlyakov V. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 2006;52:6–18.
- Erdös P, Rényi A. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences. 1960;5:17–61.
- Fan Y, Tang C. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society. 2013;75:531–552.
- Fell D, Wagner A. The small world of metabolism. Nature Biotechnology. 2000:18–11. [PubMed]
- Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. [PMC free article] [PubMed]
- Greene L, Higman V. Uncovering network systems within protein structures. Journal of Molecular Biology. 2003;334:781–791. [PubMed]
- Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of mulitple graphical models. Biometrika. 2011;98:1–15. [PMC free article] [PubMed]
- Jeong H, Tombor B, Albert R, Oltvai Z, Barabási A. The large-scale organization of metabolic networks. Nature. 2000:407–6804. [PubMed]
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Research. 2012;40:D109–D114. [PMC free article] [PubMed]
- Lauritzen SL. Graphical Models. Clarendon Press; Oxford: 1996.
- Leng C, Tang C. Sparse matrix graphical models. Journal of American Statistical Association. 2012:107–499.
- Li H, Gui J. Gradient directed regularization for sparse gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7:302–317. [PubMed]
- Li S, Hsu L, Peng J, Wang P. Bootstrap inference for network construction with an application to a breast cancer microarray study. The Annals of Applied Statistics. 2013:7–1. [PMC free article] [PubMed]
- Liang F, He G, Hu Y. A new smoothing Newton-type method for second-order cone programming problems. Applied Mathematics and Compuation. 2009;215:1020–1029.
- Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462.
- Meinshausen N, Bühlmann P. Stability selection. Journal of Royal Statistics Society, Series B. 2010;72:417–473.
- Newsterov Y, Todd M. Primal-dual interior-point methods for self-scaled cones. SIAM Journal on Optimization. 1998;8:324–364.
- Orgata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 1999;27:29–34. [PMC free article] [PubMed]
- Schäfer J, Strimmer K. An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21:754–764. [PubMed]
- Shayesteh L, Lu Y, Kuo W, Baldocchi R, Godfrey T, Collins C, Pinkel D, Powell B, Mills G, Gray J. Pik3ca is implicated as an oncogene in ovarian cancer. Nature Genetics. 1999;21:99–102. [PubMed]
- Tothill R, Tinker A, George J, Brown R, Fox S, Lade S, Johnson D, Trivett J, Etemadmoghadam D, Locandro B, Traficante N, Fereday S, Hung J, Chiew Y, Haviv I, Group A. O. C. S. Gertig D, deFazio A, Bowtell D. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome. Clinical Cancer Research. 2008;14:5198–5208. [PubMed]
- Vendrascolo M, Dokholyan N, Paci E, Karplus M. Small-world view of the amino acids that play a key role in protein folding. Physical Review E. 2002;65 [PubMed]
- Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of Royal Statistical Society, Series B. 2009;71:671–683.
- Wang T, Zhu L. Consistent tuning parameter selection in high dimensional sparse linear regression. Journal of Multivariate Analysis. 2011;102:1141–1151.
- Watts D, Strogatz S. Collective dynamics of “small world” networks. Nature. 1998;393:440–442. [PubMed]
- Whittaker J. Graphical Models in Applied Multivariate Analysis. Wiley; 1990.
- Yin J, Li H. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis. 2012;107:119–140. [PMC free article] [PubMed]
- Yuan M. Sparse inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research. 2010;11:2261–2286.
- Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35.
- Zhou S. Gemini: Graph estimation with matrix variate normal instances. Annals of Statistics. 2014;42:532–562.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |