Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2995610

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Estimation of sparse precision matrix
- 3 Estimation of sparse covariance matrix
- 4 Extension to sparse Cholesky decomposition
- 5 Proofs
- References

Authors

Related links

Ann Stat. Author manuscript; available in PMC 2010 December 1.

Published in final edited form as:

Ann Stat. 2009; 37(6B): 4254–4278.

doi: 10.1214/09-AOS720PMCID: PMC2995610

NIHMSID: NIHMS248826

Clifford Lam, Department of Statistics, London School of Economics and Political Science, London, WC2A 2AE (Email: ku.ca.esl@2maL.C)

See other articles in PMC that cite the published article.

This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity *priori* may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (*s _{n}* log

Covariance matrix estimation is a common statistical problem in many scientific applications. For example, in financial risk assessment or longitudinal study, an input of covariance matrix **Σ** is needed, whereas an inverse of the covariance matrix, the precision matrix **Σ**^{−1}, is required for optimal portfolio selection, linear discriminant analysis or graphical network models. Yet, the number of parameters in the covariance matrix grows quickly with dimensionality. Depending on the applications, the sparsity of the covariance matrix or precision matrix is frequently imposed to strike a balance between biases and variances. For example, in longitudinal data analysis [see e.g., Diggle and Verbyla (1998), or Bickel and Levina (2008b)], it is reasonable to assume that remote data in time are weakly correlated, whereas in Gaussian graphical models, the sparsity of the precision matrix is a reasonable assumption (Dempster (1972)).

This initiates a series of researches focusing on the parsimony of a covariance matrix. Smith and Kohn (2002) used priors which admit zeros on the off-diagonal elements of the Cholesky factor of the precision matrix **Ω** = **Σ**^{−1}, while Wong, Carter and Kohn (2003) used zero-admitting prior directly on the off-diagonal elements of **Ω** to achieve parsimony. Wu and Pourahmadi (2003) used the Modified Cholesky Decomposition (MCD) to find a banded structure for **Ω** nonparametrically for longitudinal data. Bickel and Levina (2008b) developed consistency theories on banding methods for longitudinal data, for both **Σ** and **Ω**.

Various authors have used penalized likelihood methods to achieve parsimony on covariance selection. Fan and Peng (2004) has laid down a general framework for penalized likelihood with diverging dimensionality, with general conditions for the oracle property stated and proved. However, it is not clear whether it is applicable to the specific case of covariance matrix estimation. In particular, they did not link the dimensionality *p _{n}* with the number of nonzero elements

Recently, there is a surge of interest on the estimation of sparse covariance matrix or precision matrix using penalized likelihood method. Huang, Liu, Pourahmadi and Liu (2006) used the LASSO on the off-diagonal elements of the Cholesky factor from MCD, while Meinshausen and Bühlmann (2006), d’Aspremont, Banerjee, and El Ghaoui (2008) and Yuan and Lin (2007) used different LASSO algorithms to select zero elements in the precision matrix. A novel penalty called the nested Lasso was constructed in Levina, Rothman and Zhu (2008) to penalize off-diagonal elements. Thresholding the sample covariance matrix in high-dimensional setting was thoroughly studied by El Karoui (2008) and Bickel and Levina (2008a) and Cai, Zhang and Zhou (2009) with remarkable results for high dimensional applications. However, it is not directly applicable to estimating sparse precision matrix when the dimensionality *p _{n}* is greater than the sample size

In this paper, we investigate the aforementioned problems using the penalized pseudo-likelihood method. Assume a random sample {**y**_{i}}_{1≤i≤n} with mean zero and covariance matrix **Σ**_{0}, satisfying some sub-Gaussian tails conditions as specified in Lemma 2 (see Section 5). The sparsity of the true precision matrix **Ω**_{0} can be explored by maximizing the Gaussian quasi-likelihood or equivalently minimizing

$${q}_{1}\left(\Omega \right)=\mathrm{tr}\left(\mathrm{S}\Omega \right)-\mathrm{log}\mid \Omega \mid +\sum _{i\ne j}{p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}\mid \right),$$

(1.1)

which is the penalized negative log-likelihood if the data is Gaussian. The matrix $\mathrm{S}={n}^{-1}{\sum}_{i=1}^{n}{\mathrm{y}}_{i}{\mathrm{y}}_{i}^{T}$ is the sample covariance matrix, **Ω** = (*ω _{ij}*), and

$${p}_{\lambda}^{\prime}\left(\theta \right)=\lambda {1}_{\{\theta \le \lambda \}}+{(a\lambda -\theta )}_{+}{1}_{\{\theta >\lambda \}}\u2215(a-1),\phantom{\rule{thickmathspace}{0ex}}\text{for some}\phantom{\rule{thickmathspace}{0ex}}a>2,$$

(1.2)

are folded-concave. Nonconvex penalty is introduced to reduce bias when the true parameter has a relatively large magnitude. For example, the SCAD penalty remains constant when *θ* is large, while the *L*_{1}-penalty grows linearly with *θ*. See Fan and Li (2001) for a detailed account of this and other advantages of such a penalty function. The computation can be done via the local linear approximation (Zhou and Li, 2008, Fan *et al.* 2009); see Section 2.1 for additional details.

Similarly, the sparsity of the true covariance matrix **Σ**_{0} can be explored by minimizing

$${q}_{2}\left(\Sigma \right)=\mathrm{tr}\left(\mathrm{S}{\Sigma}^{-1}\right)+\mathrm{log}\mid \Sigma \mid +\sum _{i\ne j}{p}_{{\lambda}_{n2}}\left(\mid {\sigma}_{ij}\mid \right),$$

(1.3)

where **Σ** = (*σ _{ij}*). Note that we only penalize the off-diagonal elements of

In studying a sparse covariance or precision matrix, it is important to distinguish between the diagonal and off-diagonal elements, since the diagonal elements are always positive and they contribute to the overall mean-squares errors. For example, the true correlation matrix, denoted by **Γ**_{0}, has the same sparsity structure as **Σ**_{0} without the need to estimating its diagonal elements. In view of this fact, we introduce a revised method (3.2) to take this advantage. It turns out that the correlation matrix can be estimated with a faster rate of convergence, at (*s*_{n1} log *p _{n}*/

The bias issues of the commonly used *L*_{1}-penalty, or LASSO, can be seen from our theoretical results. In fact, due to the bias of LASSO, an upper bounded of *λ _{ni}* is needed in order to achieve fast rate of convergence. On the other hand, a lower bound is required in order to achieve sparsity of estimated precision or covariance matrices. This is in fact one of the motivations for introducing nonconvex penalty functions in Fan and Li (2001) and Fan and Peng (2004), but we state and prove the explicit rates in the current context. In particular, we demonstrate that the

We also compare two different formulations of penalized likelihood using the modified Cholesky decomposition, exploring their respective rates of convergence and sparsity properties.

Throughout this paper, we use *λ*_{min}(*A*), *λ*_{max}(*A*) and tr(*A*) to denote the minimum eigenvalue, maximum eigenvalue, and trace of a symmetric matrix *A* respectively. For a matrix *B*, we define the operator norm and the Frobenius norm, respectively, as $\parallel B\parallel ={\lambda}_{\mathrm{max}}^{1\u22152}\left({B}^{T}B\right)$ and *B*_{F} = tr^{1/2}(*B ^{T}B*).

In this section, we present the analysis of (1.1) for estimating a sparse precision matrix. Before this, let us first present an algorithm for computing the nonconcave maximum (pseudo)-likelihood estimator and then state the conditions needed for our technical results.

The computation of the nonconcave maximum likelihood problems can be solved by a sequence of *L*_{1}-penalized likelihood problems via local linear approximation (Zou and Li 2008, Fan *et al.* 2009). For example, given the current estimate Ω_{k} = (*ω _{ij,k}*), by the local linear approximation to the penalty function,

$${q}_{1}\left(\Omega \right)\approx \mathrm{tr}\left(\mathrm{S}\Omega \right)-\mathrm{log}\mid \Omega \mid +\sum _{i\ne j}[{p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij,k}\mid \right)+{p}_{{\lambda}_{n1}}^{\prime}\left(\mid {\omega}_{ij,k}\mid \right)(\mid {\omega}_{ij}\mid -\mid {\omega}_{ij,k}\mid )].$$

(2.1)

Hence, **Ω**_{k+1} should be taken to maximize the right-hand side of (2.1):

$${\Omega}_{k+1}={\mathrm{argmax}}_{\Omega}\left[\mathrm{tr}\left(\mathrm{S}\Omega \right)-\mathrm{log}\mid \Omega \mid +\sum _{i\ne j}{p}_{{\lambda}_{n1}}^{\prime}\left(\mid {\omega}_{ij,k}\mid \right)\mid {\omega}_{ij}\mid \right],$$

(2.2)

after ignoring the two constant terms. Problem (2.2) is the weighted penalized *L*_{1}-likelihood. In particular, if we take the most primitive initial value **Ω**_{0} = **0**, then

$${\Omega}_{1}={\mathrm{argmax}}_{\Omega}\left[\mathrm{tr}\left(\mathrm{S}\Omega \right)-\mathrm{log}\mid \Omega \mid +{\lambda}_{n1}\sum _{i\ne j}\mid {\omega}_{ij}\mid \right],$$

is already a good estimator. Iterations of (2.2) reduces the biases of the estimator, as larger estimated coefficients in the previous iterations receive less penalty. In fact, in a different setup, Zou and Li (2008) showed that one iteration of such a procedure is sufficient as long as the initial values are good enough.

Fan *et al.* (2009) has implemented the above algorithm for optimizing (1.1). They have also demonstrated in Section 2.2 in their paper how to utilize the graphical lasso algorithm of Friedman, Hastie and Tibshirani (2008), which is essentially a group coordinate descent procedure, to solve problem (2.2) quickly, even when *p _{n}* >

We give a brief summary of a breast cancer data analysis with *p _{n}* >

Normalized gene expression data from 130 patients with stage I-III breast cancers are analyzed, with 33 of them belong to class 1 and 97 belong to class 2. The aim is to assess prediction accuracy in predicting which class a patient will belong to, using a set of pre-selected genes (*p _{n}* = 110, chosen by t-tests) as gene expression profile data. The data is randomly divided into training (

On average, the estimated precision matrix $\widehat{\Omega}$ using LASSO has many more nonzeros than that using SCAD (3923 versus 674). This is not surprising when we look at equation (2.3) in our paper, where the *L*_{1} penalty imposes an upper bound on the tuning parameter *λ*_{n1} for consistency, which links to reducing the bias in the estimation. This makes the *λ*_{n1} in practice too small to set many of the elements in $\widehat{\Omega}$ to zero. While we do not know which elements in the true **Ω** are zero, the large number of nonzero elements in the *L*_{1} penalized estimator seems spurious, and the resulting gene network is not easy to interpret.

On the other hand, SCAD-penalized estimator has a much smaller number of nonzero elements, since the tuning parameter *λ*_{n1} is not bounded above under consistency of the resulting estimator. This makes the resulting gene network easier to interpret, with some clusters of genes identified.

Also, classification results on the testing set using the SCAD penalty for precision matrix estimation is better than that using the *L*_{1} penalty, in the sense that the specificity (#True Negative/#class 2) is higher (0.794 to 0.768) while the sensitivity (#True Positive/#class 1) is similar to that using *L*_{1}-penalized precision matrix estimator.

We now introduce some notations and present regularity conditions for the rate of convergence and sparsistency.

Let ${S}_{1}=\{(i,j):{\omega}_{ij}^{0}\ne 0\}$, where ${\Omega}_{0}=\left({\omega}_{ij}^{0}\right)$ is the true precision matrix. Denote by *s*_{n1} = |*S*_{1}| − *p _{n}*, which is the number of nonzero elements in the off-diagonal entries of

$${a}_{n1}=\underset{(i,j)\in {S}_{1}}{\mathrm{max}}{p}_{{\lambda}_{n1}}^{\prime}\left(\mid {\omega}_{ij}^{0}\mid \right),\phantom{\rule{1em}{0ex}}{b}_{n1}=\underset{(i,j)\in {S}_{1}}{\mathrm{max}}{p}_{{\lambda}_{n1}}^{\u2033}\left(\mid {\omega}_{ij}^{0}\mid \right).$$

The term *a*_{n1} is related to the asymptotic bias of the penalized likelihood estimate due to penalization. Note that for *L*_{1}-penalty, *a*_{n1} = *λ*_{n1} and *b*_{n1} = 0, whereas for SCAD, *a*_{n1} = *b*_{n1} = 0 for sufficiently large *n* under the last assumption of condition (B) below.

We assume the following regularity conditions:

- There are constants
*τ*_{1}and*τ*_{2}such that$$0<{\tau}_{1}<{\lambda}_{\mathrm{min}}\left({\Sigma}_{0}\right)\le {\lambda}_{\mathrm{max}}\left({\Sigma}_{0}\right)<{\tau}_{2}<\infty \phantom{\rule{thickmathspace}{0ex}}\text{for all}\phantom{\rule{thickmathspace}{0ex}}n.$$ *a*_{n1}=*O*({(1 +*p*/(_{n}*s*_{n1}+ 1)) log*p*_{n}/*n*}^{1/2}),*b*_{n1}=*o*(1), and ${\mathrm{min}}_{(i,j)\in {S}_{1}}\mid {\omega}_{ij}^{0}\mid \u2215{\lambda}_{n1}\to \infty $ as*n*→ ∞.- The penalty
*p*(·) is singular at the origin, with lim_{λ}_{t↓0}*p*(_{λ}*t*)/(*λt*) =*k*> 0. - There are constants
*C*and*D*such that, when ${\theta}_{1},{\theta}_{2}<C{\lambda}_{n1},\mid {p}_{{\lambda}_{n1}}^{\u2033}\left({\theta}_{1}\right)-{p}_{{\lambda}_{n1}}^{\u2033}\left({\theta}_{2}\right)\mid \le D\mid {\theta}_{1}-{\theta}_{2}\mid $.

Condition (A) bounds uniformly the eigenvalues of **Σ**_{0}, which facilitates the proof of consistency. It also includes a wide class of covariance matrices as noted in Bickel and Levina (2008b). The rates *a*_{n1} and *b*_{n1} in condition (B) are also needed for proving consistency. If they are too large, the bias due to penalty can dominate the variance from the likelihood, resulting in poor estimates.

The last requirement in condition (B) states the rate at which the nonzero parameters should be distinguished from zero asymptotically. It is not explicitly needed in the proofs, but for asymptotically unbiased penalty functions, this is a necessary condition so that *a*_{n1} and *b*_{n1} are converging to zero fast enough as needed in the first part of condition (B). In particular, for the SCAD and hard-thresholding penalty functions, this condition implies that *a*_{n1} = *b*_{n1} = 0 exactly for sufficiently large *n*, thus allowing a flexible choice of *λ*_{n1}. For the SCAD penalty (1.2), the condition can be relaxed as ${\mathrm{min}}_{(i,j)\in {S}_{1}}\mid {\omega}_{ij}^{0}\mid \u2215{\lambda}_{n1}<a$.

The singularity in condition (C) gives sparsity in the estimates [Fan and Li (2001)]. Finally, condition (D) is a smoothing condition for the penalty function, and is needed in proving asymptotic normality. The SCAD penalty, for instance, satisfies this condition by choosing the constant *D*, independent of *n*, to be large enough.

Minimizing (1.1) involves nonconvex minimization, and we need to prove that there exists a local minimizer $\widehat{\Omega}$ for the minimization problem with a certain rate of convergence, which is given under the Frobenius norm. The proof is given in Section 5. It is similar to the one given in Rothman *et al.* (2008), but now the penalty function is nonconvex.

**Theorem 1** (Rate of convergence). Under regularity conditions (A)-(D), if $({p}_{n}+{s}_{n1})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n=O\left({\lambda}_{n1}^{2}\right)$ *and* (*p _{n}* +

The proofs of this theorem and others are relegated to Section 5 so that readers can get more quickly what the results are. As in Fan and Li (2001), the asymptotic bias due to the penalty for each nonzero parameter is *a*_{n1}. Since we penalized only on the off-diagonal elements, the total bias induced by the penalty is asymptotically of order *s*_{n1}*a*_{n1}. The square of this total bias over all nonzero elements is of order *O _{P}*{(

Theorem 1 states explicitly how the number of nonzero elements and dimensionality affect the rate of convergence. Since there are (*p _{n}* +

Theorem 1 is also applicable to the *L*_{1}-penalty function, where the local minimizer becomes the global minimizer. The asymptotic bias of the *L*_{1}-penalized estimate is given in the term *s*_{n1}*a*_{n1} = *s*_{n1}*λ*_{n1} as shown in the technical proof. In order to control the bias, we impose condition (*B*), which entails an upper bound on *λ*_{n1} = *O*({(1+*p _{n}*/(

Next we show the sparsistency of the penalized estimator from (1.1). We use *S ^{c}* to denote the complement of a set

**Theorem 2** (Sparsistency). Under the conditions given in Theorem 1, for any local minimizer of (1.1) satisfying ${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}_{F}^{2}={O}_{P}\{({p}_{n}+{s}_{n1})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}$ *and* ${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}^{2}={O}_{P}\left({\eta}_{n}\right)$ *for a sequence of* ${\eta}_{n}\to 0$, *if* $\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}=O\left({\lambda}_{n1}^{2}\right)$, *then with probability tending to 1,* ${\widehat{\omega}}_{ij}=0$ *for all* $(i,j)\in {S}_{1}^{c}$.

First, since ${\parallel M\parallel}^{2}\le {\parallel M\parallel}_{F}^{2}$ for any matrix *M*, we can always take *η _{n}* = (

$$\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}=O\left({\lambda}_{n1}^{2}\right)=(1+{p}_{n}\u2215({s}_{n1}+1))\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n$$

(2.3)

for both consistency and sparsistency to be satisfied. We present two scenarios here for the two bounds to be compatible, making use of the inequalities ${\parallel M\parallel}_{F}^{2}\u2215{p}_{n}\le {\parallel M\parallel}^{2}\le {\parallel M\parallel}_{F}^{2}$ for a matrix *M* of size *p _{n}*.

- We always have $\parallel \widehat{\Omega}-{\Omega}_{0}\parallel \le {\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}_{F}$. In the worst case scenario where they have the same order, ${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}^{2}={O}_{P}(({p}_{n}+{s}_{n1})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$, so that
*η*= (_{n}*p*+_{n}*s*_{n1}) log*p*. It is then easy to see from (2.3) that the two bounds are compatible only when_{n}/n*s*_{n1}=*O*(1). - We also have ${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}_{F}^{2}\u2215{p}_{n}\le {\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}^{2}$. In the optimistic scenario where they have the same order,Hence,$${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}^{2}={O}_{P}((1+{s}_{n1}\u2215{p}_{n})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n).$$
*η*= (1 +_{n}*s*_{n1}/*p*) log_{n}*p*/_{n}*n*, and compatibility of the bounds requires*s*_{n1}=*O*(*p*)._{n}

Hence, even in the optimistic scenario, consistency and sparsistency are guaranteed only when *s*_{n1} = *O*(*p _{n}*) if the

However, if the penalty function used is unbiased, like the SCAD or the hard-thresholding penalty, we do not impose an extra upper bound for *λ*_{n1} since its first derivative ${p}_{{\lambda}_{n1}}^{\prime}\left(\mid \theta \mid \right)$ goes to zero fast enough as |*θ*| increases (exactly equals zero for the SCAD and hard-thresholding penalty functions, when *n* is sufficiently large; see condition (B) and the explanation thereof). Thus, *λ*_{n1} is allowed to decay to zero slowly, allowing even the largest order ${s}_{n1}=O\left({p}_{n}^{2}\right)$.

We remark that asymptotic normality for the estimators of the elements in *S*_{1} have been established in a previous version of this paper. We omit it here for brevity.

The inverse correlation matrix **Ψ**_{0} retains the same sparsity structure of **Ω**_{0}. Consistency and sparsistency results can be achieved with *p _{n}* as large as log

$$\mathrm{tr}\left(\Psi {\widehat{\Gamma}}_{\mathrm{S}}\right)-\mathrm{log}\mid \Psi \mid +\sum _{i\ne j}{p}_{{\nu}_{n1}}\left(\mid {\psi}_{ij}\mid \right),$$

(2.4)

where ${\widehat{\Gamma}}_{\mathbf{S}}={\widehat{\mathbf{W}}}^{-1}\mathbf{S}{\widehat{\mathbf{W}}}^{-1}$ is the sample correlation matrix, with ${\widehat{\mathbf{W}}}^{2}={\mathbf{D}}_{\mathbf{S}}$ being the diagonal matrix with diagonal elements of **S**, and *υ*_{n1} is a regularization parameter. After obtaining $\widehat{\Psi}$, **Ω**_{0} can also be estimated by $\stackrel{~}{\Omega}={\widehat{\mathbf{W}}}^{-1}\widehat{\Psi}{\widehat{\mathbf{W}}}^{-1}$.

To present the rates of convergence for $\widehat{\Psi}$ and $\stackrel{~}{\Omega}$, we define

$${c}_{n1}=\underset{(i,j)\in {S}_{1}}{\mathrm{max}}{p}_{{\nu}_{n1}}^{\prime}\left(\mid {\psi}_{ij}^{0}\mid \right),\phantom{\rule{1em}{0ex}}{d}_{n1}=\underset{(i,j)\in {S}_{1}}{\mathrm{max}}{p}_{{\nu}_{n1}}^{\u2033}\left(\mid {\psi}_{ij}^{0}\mid \right),$$

where ${\Psi}_{0}=\left({\psi}_{ij}^{0}\right)$ and modify condition (D) to (D’) with *λ*_{n1} there replaced by *υ*_{n1}, and impose (B’) *c*_{n1} = *O*({log *p _{n}*/

**Theorem 3** Under regularity conditions (A),(B’),(C) and (D’), if (*s*_{n1}+1)(log *p _{n}*)

Note that we can allow *p _{n}*

For the *L*_{1}-penalty, our result reduces to that given in Rothman *et al.* (2008). We offer the sparsistency result as follows.

**Theorem 4** (Sparsistency) Under the conditions given in Theorem 3, for any local minimizer of (2.4) satisfying ${\parallel \widehat{\Psi}-{\Psi}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n1}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$ *and* ${\parallel \widehat{\Psi}-{\Psi}_{0}\parallel}^{2}={O}_{P}\left({n}_{n}\right)$ *for some η _{n}* → 0,

The proof follows exactly the same as that for Theorem 2 in Section 2.4, and is thus omitted.

For the *L*_{1}-penalty, control of bias and sparsistency require *υ*_{n1} to satisfy bounds like (2.3):

$$\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}=O\left({\nu}_{n1}^{2}\right)=\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n.$$

(2.5)

This leads to two scenarios:

- The worst case scenario hasmeaning$${\parallel \widehat{\Psi}-{\Psi}_{0}\parallel}^{2}={\parallel \widehat{\Psi}-{\Psi}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n1}\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n),$$
*η*=_{n}*s*_{n1}log*p*/_{n}*n*. Then compatibility of the bounds in (2.5) requires*s*_{n1}=*O*(1). - The optimistic scenario hasmeaning$${\parallel \widehat{\Psi}-{\Psi}_{0}\parallel}^{2}={\parallel \widehat{\Psi}-{\Psi}_{0}\parallel}_{F}^{2}\u2215{p}_{n}={O}_{P}({s}_{n1}\u2215{p}_{n}\cdot \mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n),$$
*η*=_{n}*s*_{n1}/*p*· log_{n}*p*/_{n}*n*. Then compatibility of the bounds in (2.5) requires*s*_{n1}=*O*(*p*)._{n}

On the other hand, for penalties like the SCAD or the hard-thresholding penalty, we do not need an upper bound for *s*_{n1}. Hence, we only need (*s _{n}* + 1)(log

In this section, we analyze the sparse covariance matrix estimation using the penalized likelihood (1.3). Then it is modified to estimating the correlation matrix, which improves the rate of convergence. We assume that the **y**_{i}’s are i.i.d. *N*(**0**, **Σ**_{0}) throughout this section.

Let ${S}_{2}=\{(i,j):{\sigma}_{ij}^{0}\ne 0\}$, where ${\Sigma}_{0}=\left({\sigma}_{ij}^{0}\right)$. Denote *s*_{n2} = |*S*_{2}| − *p _{n}*, so that

$${a}_{n2}=\underset{(i,j)\in {S}_{2}}{\mathrm{max}}{p}_{{\lambda}_{n2}}^{\prime}\left(\mid {\sigma}_{ij}^{0}\mid \right),\phantom{\rule{1em}{0ex}}{b}_{n2}=\underset{(i,j)\in {S}_{2}}{\mathrm{max}}{p}_{{\lambda}_{n2}}^{\u2033}\left(\mid {\sigma}_{ij}^{0}\mid \right).$$

Technical conditions in Section 2 need some revision. In particular, condition (D) now becomes condition (D2) with *λ*_{n1} there replaced by *λ*_{n2}. Condition (B) should now be (B2) *a*_{n2} = O({(1 + *p _{n}*/(

**Theorem 5** (Rate of convergence). Under regularity conditions (A), (B2), (C) and (D2), if (*p _{n}* +

Like the case for precision matrix estimation, the asymptotic bias due to the *L*_{1}-penalty is of order *s*_{n2}*a*_{n2} = *s*_{n2}*λ*_{n2}. To control this term, for the *L*_{1}-penalty, we require *λ*_{n2} = *O*({(1 + *p _{n}*/(

**Theorem 6** (Sparsistency). Under the conditions given in Theorem 5, for any local minimizer $\widehat{\Sigma}$ *of* (1.3) *satisfying* ${\parallel \widehat{\Sigma}-{\Sigma}_{0}\parallel}_{F}^{2}={O}_{P}(({p}_{n}+{s}_{n2})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$ *and* ${\parallel \widehat{\Sigma}-{\Sigma}_{0}\parallel}^{2}={O}_{P}\left({n}_{n}\right)$ *for some η _{n}* → 0,

For the *L*_{1}-penalized likelihood, controlling of bias for consistency together with sparsistency requires

$$\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}=O\left({\lambda}_{n2}^{2}\right)=(1+{p}_{n}\u2215({s}_{n2}+1))\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n.$$

(3.1)

This is the same condition as (2.3), and hence in the worst case scenario where

$${\parallel \widehat{\Sigma}-{\Sigma}_{0}\parallel}^{2}={\parallel \widehat{\Sigma}-{\Sigma}_{0}\parallel}_{F}^{2}={O}_{P}(({p}_{n}+{s}_{n2})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n),$$

we need *s*_{n2} = *O*(1). In the optimistic scenario where

$${\parallel \widehat{\Sigma}-{\Sigma}_{0}\parallel}^{2}={\parallel \widehat{\Sigma}-{\Sigma}_{0}\parallel}_{F}^{2}\u2215{p}_{n},$$

we need *s*_{n2} = *O*(*p _{n}*). In both cases, the matrix

On the other hand, if unbiased penalty functions like the SCAD or hard-thresholding penalty are used, we do not need an upper bound on *λ*_{n2} since the bias *a*_{n2} = 0 for sufficiently large *n*. This gives more flexibility on the order of *s*_{n2}.

Similar to Section 2, asymptotic normality for the estimators of the elements in *S*_{2} can be proved under certain assumptions.

The correlation matrix **Γ**_{0} retains the same sparsity structure of **Σ**_{0} with known diagonal elements. This special structure allows us to estimate **Γ**_{0} more accurately. To take advantage of the known diagonal elements, the sparse correlation matrix **Γ**_{0} is estimated by minimizing w.r.t. **Γ** = (*γ _{ij}*),

$$\mathrm{tr}\left({\Gamma}^{-1}{\widehat{\Gamma}}_{\mathrm{S}}\right)+\mathrm{log}\mid \Gamma \mid +\sum _{i\ne j}{p}_{{\nu}_{n2}}\left(\mid {\gamma}_{ij}\mid \right),$$

(3.2)

where *υ*_{n2} is a regularization parameter. After obtaining $\widehat{\Gamma}$ **Σ**_{0} can be estimated by $\stackrel{~}{\Sigma}=\widehat{\mathrm{W}}\widehat{\Gamma}\widehat{\mathrm{W}}$.

To present the rates of convergence for $\widehat{\Gamma}$ and $\stackrel{~}{\Sigma}$, we define

$${c}_{n2}=\underset{(i,j)\in {S}_{2}}{\mathrm{max}}{p}_{{\nu}_{n2}}^{\prime}\left(\mid {\gamma}_{ij}^{0}\mid \right),\phantom{\rule{1em}{0ex}}{d}_{n2}=\underset{(i,j)\in {S}_{2}}{\mathrm{max}}{p}_{{\nu}_{n2}}^{\u2033}\left(\mid {\gamma}_{ij}^{0}\mid \right),$$

where ${\Gamma}_{0}=\left({\gamma}_{ij}^{0}\right)$. We modify condition (D) to (D2′) with *λ*_{n2} there replaced by *υ*_{n2}, and (B) to (B2′) as follows: (B2′) *c*_{n2} = *O*({log *p _{n}*/

**Theorem 7** Under regularity conditions (A),(B2′),(C) and (D2′), if (*p _{n}*+

$${\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n2}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n).$$

*In addition, for the operator norm, we have*

$${\parallel \stackrel{~}{\Sigma}-{\Sigma}_{0}\parallel}^{2}={O}_{P}\{({s}_{n2}+1)\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}.$$

*For the L*_{1}-*penalty, we only need* $\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}{p}_{n}\u2215n=O\left({\nu}_{n2}^{2}\right)$.

The proof is sketched in Section 5. This theorem shows that the correlation matrix, like the inverse correlation matrix, can be estimated more accurately, since diagonal elements are known to be one.

**Theorem 8** (Sparsistency). Under the conditions given in Theorem 7, for any local minimizer $\widehat{\Gamma}$ *of* (3.2) *satisfying* ${\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n2}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$ *and* ${\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}^{2}={O}_{P}\left({n}_{n}\right)$ *for some η _{n}* → 0 ,

The proof follows exactly the same as that of Theorem 6 in Section 5, and is omitted. For the *L*_{1}-penalized likelihood, controlling of bias and sparsistency requires

$$\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}=O\left({\nu}_{n2}^{2}\right)=\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n.$$

(3.3)

This is the same condition as (2.5), hence in the worst scenario where

$${\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}^{2}={\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n2}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n),$$

we need *s*_{n2} = *O*(1). In the optimistic scenario where

$${\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}^{2}={\parallel \widehat{\Gamma}-{\Gamma}_{0}\parallel}_{F}^{2}\u2215{p}_{n}={O}_{P}({s}_{n2}\u2215{p}_{n}\cdot \mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n),$$

we need *s*_{n2} = *O*(*p _{n}*).

The use of unbiased penalty functions like the SCAD or the hard-thresholding penalty, similar to results in the previous sections, does not impose an upper bound on the regularization parameter since bias *c*_{n2} = 0 for sufficiently large *n*. This gives more flexibility to the order of *s*_{n2} allowed.

Pourahmadi (1999) proposed the modified Cholesky decomposition (MCD) which facilitates the sparse estimation of **Ω** through penalization. The idea is to represent zero-mean data **y** = (*y*_{1}, , *y _{pn}*)

$${y}_{i}=\sum _{j=1}^{i-1}{\varphi}_{ij}{y}_{j}+{\u220a}_{i},\phantom{\rule{thickmathspace}{0ex}}\text{and}\phantom{\rule{thickmathspace}{0ex}}\mathrm{T}\Sigma {\mathrm{T}}^{T}=\mathrm{D},$$

(4.1)

where **T** is the unique unit lower triangular matrix with ones on its diagonal and (*i, j*)^{th} element being −*ϕ _{ij}* for

Huang *et al.* (2006) and Levina *et al.* (2008) both used the MCD for estimating **Ω**_{0}. The former maximized the log-likelihood (ML) over **T** and **D** simultaneously, while the latter suggested also a least square version (LS), with **D** being first set to the identity matrix and then minimizing over **T** to obtain $\widehat{\mathbf{T}}$. The latter corresponds to the original Cholesky decomposition. The sparse Cholesky factor can be estimated through minimizing

$$\left(ML\right):{q}_{3}(\mathbf{T},\mathbf{D})=\mathrm{tr}\left({\mathbf{T}}^{T}{\mathbf{D}}^{-1}\mathbf{TS}\right)+\mathrm{log}\mid \mathbf{D}\mid +2\sum _{i<j}{p}_{{\lambda}_{n3}}\left(\mid {t}_{ij}\mid \right).$$

(4.2)

This is indeed the same as (1.1) with the substitution of **Ω** = **T**^{T}**D**^{−1}**T** and penalization parameter *λ*_{n3}. Noticing that (4.1) can be written as $\mathbf{Ty}=\epsilon $, the least square version is to minimize $\mathrm{tr}\left(\epsilon {\epsilon}^{T}\right)=\mathrm{tr}\left({\mathbf{T}}^{T}{\mathbf{Tyy}}^{T}\right)$ in the matrix notation. Aggregating the *n* observations and adding penalty functions, the least-square criterion is to minimize

$$\left(LS\right):\phantom{\rule{1em}{0ex}}{q}_{4}\left(\mathbf{T}\right)=\mathrm{tr}\left({\mathbf{T}}^{T}\mathbf{TS}\right)+2\sum _{i<j}{p}_{{\lambda}_{n4}}\left(\mid {t}_{ij}\mid \right).$$

(4.3)

In view of the results in Sections 2.5 and 3.2, we can also write the sample covariance matrix in (4.2) as $\mathbf{S}=\widehat{\mathbf{W}}{\widehat{\Gamma}}_{\mathrm{S}}\widehat{\mathbf{W}}$ and then replace ${\mathbf{D}}^{-1\u22152}\mathbf{T}\widehat{\mathbf{W}}$ by **T**, resulting in the normalized (NL) version as follows:

$$\left(NL\right):\phantom{\rule{1em}{0ex}}{q}_{5}\left(\mathbf{T}\right)=\mathrm{tr}\left({\mathbf{T}}^{T}\mathbf{T}{\widehat{\Gamma}}_{\mathrm{S}}\right)-2\phantom{\rule{thinmathspace}{0ex}}\mathrm{log}\mid \mathbf{T}\mid +2\sum _{i<j}{p}_{{\lambda}_{n5}}\left(\mid {t}_{ij}\mid \right).$$

(4.4)

We will also assume the **y**_{i}’s are i.i.d. *N*(**0**, **Σ**_{0}) as in the last section.

Since all the **T**’s introduced in the three models above have the same sparsity structure, let *S* and *s*_{n3} be the nonzero set and number of nonzeros associated with each **T** above. Define

$${a}_{n3}=\underset{(i,j)\in S}{\mathrm{max}}{p}_{{\lambda}_{n3}}^{\prime}\left(\mid {t}_{ij}^{0}\mid \right),\phantom{\rule{1em}{0ex}}{b}_{n3}=\underset{(i,j)\in S}{\mathrm{max}}{p}_{{\lambda}_{n3}}^{\u2033}\left(\mid {t}_{ij}^{0}\mid \right).$$

For (ML), condition (D) is adapted to (D3) with *λ*_{n1} there replaced by *λ*_{n3}. Condition (B) is modified as (B3) *a*_{n3} = *O*({(1 + *p _{n}*/(

After obtaining $\widehat{\mathbf{T}}$ and $\widehat{\mathbf{D}}$ from minimizing (ML), we set $\widehat{\Omega}={\widehat{\mathbf{T}}}^{T}{\widehat{\mathbf{D}}}^{-1}\widehat{\mathbf{T}}$.

**Theorem 9** *Under regularity conditions (A),(B3),(C),(D3), if* (*p _{n}* +

The proof is similar to those of Theorems 5 and 7 and is omitted. The Cholesky factor **T** has ones on its main diagonal without the need for estimation. Hence, the rate of convergence is faster than $\widehat{\Omega}$.

**Theorem 10** (Sparsistency). Under the conditions in Theorem 9, for any local minimizer $\widehat{\mathbf{T}}$, $\widehat{\mathbf{D}}$ *of* (4.2) *satisfying* ${\parallel \widehat{\mathbf{T}}-{\mathbf{T}}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n3}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$ *and* ${\parallel \widehat{\mathbf{D}}-{\mathbf{D}}_{0}\parallel}_{F}^{2}={O}_{P}({p}_{n}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$, *if* $\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}+{\zeta}_{n}=O\left({\lambda}_{n3}^{2}\right)$, *then sparsistency holds for* $\widehat{\mathbf{T}}$, *provided that* ${\parallel \widehat{\mathbf{T}}-{\mathbf{T}}_{0}\parallel}^{2}={O}_{P}\left({\eta}_{n}\right)$ *and* ${\parallel \widehat{\mathbf{D}}-{\mathbf{D}}_{0}\parallel}^{2}={O}_{P}\left({\zeta}_{n}\right)$, *for some η _{n}, ζ_{n}* → 0.

The proof is in Section 5. For the *L*_{1}-penalized likelihood, control of bias and sparsistency impose the following:

$$\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}+{\zeta}_{n}=O\left({\lambda}_{n3}^{2}\right)=(1+{p}_{n}\u2215({s}_{n3}+1))\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n.$$

(4.5)

The worst scenario corresponds to *η _{n}* =

On the other hand, such a restriction is not needed for unbiased penalties like the SCAD or hard-thresholding penalty, giving more flexibility on the order of *s*_{n3}.

We now turn to analyzing the normalized penalized likelihood (4.4). With **T** = (*t _{ij}*) in (NL) which is lower triangular, define

$${a}_{n5}=\underset{(i,j)\in S}{\mathrm{max}}{p}_{{\lambda}_{n5}}^{\prime}\left(\mid {t}_{ij}^{0}\mid \right),\phantom{\rule{1em}{0ex}}{b}_{n5}=\underset{(i,j)\in S}{\mathrm{max}}{p}_{{\lambda}_{n5}}^{\u2033}\left(\mid {t}_{ij}^{0}\mid \right).$$

Condition (D) is now changed to (D5) with *λ*_{n1} there replaced by *λ*_{n5}. Condition (B) is now substituted by (B5) ${a}_{n5}^{2}=O(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$, *b*_{n5} = *o*(1), ${\mathrm{min}}_{(i,j)\in S}\mid {t}_{ij}^{0}\mid \u2215{\lambda}_{n5}\to \infty $ as *n* → ∞.

**Theorem 11** (Rate of convergence) Under regularity conditions (A),(B5),(C) and (D5), if s_{n3}(log *p _{n}*)

$${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}_{F}^{2}={O}_{P}\{({p}_{n}+{s}_{n3})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\},$$

*and in the operator norm, it is improved to*

$${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}^{2}={O}_{P}\left\{({s}_{n3}+1)\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)\right\}.$$

*For the L*_{1}-*penalty, we only need* $\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n=O\left({\lambda}_{n5}^{2}\right)$.

The proof is similar to that of Theorems 5 and 7 and is omitted. In this theorem, like Lemma 3, we can have *p _{n}* so that

**Theorem 12** (Sparsistency). Under the conditions given in Theorem 11, for any local minimizer $\widehat{\mathbf{T}}$ *of* (4.4) *satisfying* ${\parallel \widehat{\mathbf{T}}-{\mathbf{T}}_{0}\parallel}_{F}^{2}={O}_{P}({s}_{n3}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)$ *if* $\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n+{\eta}_{n}=O\left({\lambda}_{n5}^{2}\right)$, *then sparsistency holds for* $\widehat{\mathbf{T}}$, *provided that* ${\parallel \widehat{\mathbf{T}}-{\mathbf{T}}_{0}\parallel}^{2}=O\left({\eta}_{n}\right)$ *for some η _{n}* → 0.

Proof is omitted since it goes exactly the same as that of Theorem 10. The above results apply also to the *L*_{1}-penalized estimator. For simultaneous persistency and optimal rate of convergence using the *L*_{1}-penalty, the biases inherent in it induce the restriction *s*_{n3} = *O*(1) in the worst scenario where ${\eta}_{n}^{2}={s}_{n3}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n$, and *s*_{n3} = *O*(*p _{n}*) in the optimistic scenario where ${\eta}_{n}^{2}={s}_{n3}\u2215{p}_{n}\cdot \mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n$. This restriction does not apply to the SCAD and other asymptotically unbiased penalty functions.

We first prove three lemmas. The first one concerns with inequalities involving the operator and the Frobenius norms. The other two concern with order estimation for elements in a matrix of the form **A**(**S** − **Σ**_{0})**B**, which are useful in proving results concerning sparsistency.

**Lemma 1** *Let* **A** *and* **B** *be real matrices such that the product* **AB** *is defined. Then, defining* ${\parallel \mathbf{A}\parallel}_{\mathrm{min}}^{2}={\lambda}_{\mathrm{min}}\left({\mathbf{A}}^{T}\mathbf{A}\right)$, *we have*

$${\parallel \mathrm{A}\parallel}_{\mathrm{min}}{\parallel \mathrm{B}\parallel}_{F}\le {\parallel \mathrm{AB}\parallel}_{F}\le \parallel \mathrm{A}\parallel {\parallel \mathrm{B}\parallel}_{F}.$$

(5.1)

*In particular, if* **A** = (*a _{ij}*),

**Proof of Lemma 1**. Write **B** = (**b**_{1}, , **b**_{q}), where **b**_{i} is the *i*-th column vector in **B**. Then

$$\begin{array}{cc}\hfill {\parallel \mathrm{AB}\parallel}_{F}^{2}=\mathrm{tr}\left({\mathrm{B}}^{T}{\mathrm{A}}^{T}\mathrm{AB}\right)=\sum _{i=1}^{q}{\mathrm{b}}_{i}^{T}{\mathrm{A}}^{T}{\mathrm{Ab}}_{i}\le & {\lambda}_{\mathrm{max}}\left({\mathrm{A}}^{T}\mathrm{A}\right)\sum _{i=1}^{q}{\parallel {\mathrm{b}}_{i}\parallel}^{2}\hfill \\ \hfill =& {\parallel \mathrm{A}\parallel}^{2}{\parallel \mathrm{B}\parallel}_{F}^{2}.\hfill \end{array}$$

Similarly,

$$\begin{array}{cc}\hfill {\parallel \mathrm{AB}\parallel}_{F}^{2}=\sum _{i=1}^{q}{\mathrm{b}}_{i}^{T}{\mathrm{A}}^{T}{\mathrm{Ab}}_{i}\ge & {\lambda}_{\mathrm{min}}\left({\mathrm{A}}^{T}\mathrm{A}\right)\sum _{i=1}^{q}{\parallel {\mathrm{b}}_{i}\parallel}^{2}\hfill \\ \hfill =& {\parallel \mathrm{A}\parallel}_{\mathrm{min}}^{2}{\parallel \mathrm{B}\parallel}_{F}^{2},\hfill \end{array}$$

which completes the proof of (5.1). To prove |*a _{ij}*| ≤

$$\mid {a}_{ij}\mid =\mid {\mathrm{e}}_{i}^{T}{\mathrm{Ae}}_{j}\mid \le {\parallel {\mathrm{Ae}}_{j}\parallel}_{F}\le \parallel \mathrm{A}\parallel \cdot {\parallel {\mathrm{e}}_{j}\parallel}_{F}=\parallel \mathrm{A}\parallel ,$$

and this completes the proof of the lemma. □

**Lemma 2** *Let* **S** *be a sample covariance matrix of a random sample* {**y**_{i}}_{1≤i≤n}, *with E*(**y**_{i}) = 0 *and var*(**y**_{i}) = **Σ**_{0}. *Let* **y**_{i} = (*y*_{i1}, , **y***ip _{n}*)

$$\underset{1\le i\le {p}_{n}}{\mathrm{max}}{\int}_{0}^{\infty}\mathrm{exp}\left(\lambda t\right)d{G}_{j}\left(t\right)<\infty ,\phantom{\rule{thickmathspace}{0ex}}0<\mid \lambda \mid <{\lambda}_{0}$$

(5.2)

*for some λ*_{0} > 0. *Assume* log *p _{n}*/

*Remark:* The conditions on the *y _{ij}*’s above are the same as those used in Bickel and Levina (2008b) for relaxing the normality assumption.

**Proof of Lemma 2**. Let **x**_{i} = **Ay**_{i} and ${\mathrm{w}}_{i}={\mathbf{B}}^{T}{\mathrm{y}}_{i}$. Define ${\mathrm{u}}_{i}={({\mathrm{x}}_{i}^{T},{\mathrm{w}}_{i}^{T})}^{T}$ , with covariance matrix

$${\Sigma}_{\mathrm{u}}=\mathrm{var}\left({\mathrm{u}}_{i}\right)=\left(\begin{array}{cc}\hfill \mathrm{A}{\Sigma}_{0}{\mathrm{A}}^{T}\hfill & \hfill \mathrm{A}{\Sigma}_{0}\mathrm{B}\hfill \\ \hfill {\mathrm{B}}^{T}{\Sigma}_{0}{\mathrm{A}}^{T}\hfill & \hfill {\mathrm{B}}^{T}{\Sigma}_{0}\mathrm{B}\hfill \end{array}\right).$$

Since (**A**^{T} **B**)^{T} ≤ (**A**^{2} + **B**^{2})^{1/2} = *O*(1) and **Σ**_{0} = *O*(1) uniformly, we have **Σ _{u}** =

$$\underset{i,j}{\mathrm{max}}\mid {({\mathrm{S}}_{\mathrm{u}}-{\Sigma}_{\mathrm{u}})}_{ij}\mid ={O}_{P}\left({\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}\right).$$

In particular, it means that

$$\underset{i,j}{\mathrm{max}}\mid {\left(\mathrm{A}(\mathrm{S}-{\Sigma}_{0})\mathrm{B}\right)}_{ij}\mid ={\left({n}^{-1}\sum _{r=1}^{n}{\mathrm{x}}_{r}{\mathrm{w}}_{r}^{T}-\mathrm{A}{\Sigma}_{0}\mathrm{B}\right)}_{ij}={O}_{P}\left({\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}\right),$$

which completes the proof of the lemma. □

**Lemma 3** *Let* **S** *be a sample covariance matrix of a random sample* **y**_{i1≤i≤n} *with* **y**_{i} ~ *N*(**0**, **Σ**_{0}). *Assume p _{n}*/

**Proof of Lemma 3**. Consider

$$\mathrm{A}(\mathrm{S}-{\Sigma}_{0})\mathrm{B}={K}_{1}+{K}_{2}+{K}_{3}+{K}_{4},$$

(5.3)

where *K*_{1} = **A**_{0}(**S** − **Σ**_{0})**B**_{0}, *K*_{2} = Δ_{1}(**S** − **Σ**_{0})**B**_{0}, *K*_{3} = **A**_{0}(**S** − **Σ**_{0})Δ_{2} and *K*_{4} = Δ_{1}(**S** − **Σ**_{0})Δ_{2}. Now max_{i,j} |(*K*_{1})_{ij} | = *O _{P}*({log

Hence, we can find *c _{n}* =

Then suitable choice of *c _{n}* leads to

$$\underset{i,j}{\mathrm{max}}\mid {\left({\Delta}_{1}(\mathrm{S}-{\Sigma}_{0}){\mathrm{B}}_{0}\right)}_{ij}\mid \le {c}_{n}\underset{k}{\mathrm{max}}\mid {\left({\mathrm{B}}_{0}^{T}{(\mathrm{S}-{\Sigma}_{0})}^{2}{\mathrm{B}}_{0}\right)}_{kk}\mid .$$

(5.4)

At the same time, Theorem 5.10 in Bai and Silverstein (2006) implies that, for **y*** _{i}* ~

$$\begin{array}{cc}\hfill -2\sqrt{y}-y\le & \underset{n\to \infty}{\mathrm{lim}\phantom{\rule{thinmathspace}{0ex}}\mathrm{inf}}{\lambda}_{\mathrm{min}}({\Sigma}_{0}^{-1\u22152}\mathrm{S}{\Sigma}_{0}^{-1\u22152}-\mathrm{I})\hfill \\ \hfill \le & \underset{n\to \infty}{\mathrm{lim}\phantom{\rule{thinmathspace}{0ex}}\mathrm{sup}}{\lambda}_{\mathrm{max}}({\Sigma}_{0}^{-1\u22152}\mathrm{S}{\Sigma}_{0}^{-1\u22152}-\mathrm{I})\le 2\sqrt{y}+y.\hfill \end{array}$$

Hence, if we have *p _{n}*/

Since **S** − **Σ**_{0} is symmetric, we can find a rotation matrix **Q** (i.e. **Q**^{T}**Q** = **QQ**^{T} = *I*) so that

$$\mathrm{S}-{\Sigma}_{0}=\mathrm{Q}\Lambda {\mathrm{Q}}^{T},$$

where Λ is a diagonal matrix with real entries. Then we are free to control *c _{n}* again so as to satisfy further that

$$\begin{array}{cc}\hfill {c}_{n}\underset{k}{\mathrm{max}}\mid {\left({\mathrm{B}}_{0}^{T}{(\mathrm{S}-{\Sigma}_{0})}^{2}{\mathrm{B}}_{0}\right)}_{kk}\mid =& \underset{k}{\mathrm{max}}\mid {\left({\mathrm{B}}_{0}^{T}\mathrm{Q}{c}_{n}{\Lambda}^{2}{\mathrm{Q}}^{T}{\mathrm{B}}_{0}\right)}_{kk}\mid \hfill \\ \hfill \le & \underset{k}{\mathrm{max}}\mid {\left({\mathrm{B}}_{0}^{T}\mathrm{Q}\Lambda {\mathrm{Q}}^{T}{\mathrm{B}}_{0}\right)}_{kk}\mid \hfill \\ \hfill =& \underset{k}{\mathrm{max}}\mid {\left({\mathrm{B}}_{0}^{T}(\mathrm{S}-{\Sigma}_{0}){\mathrm{B}}_{0}\right)}_{kk}\mid ={O}_{P}\left({\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}\right),\hfill \end{array}$$

where the last line used the previous proof for constant matrix **B**_{0}. Hence, combining this with (5.4), we have max_{i,j} |(*K*_{2})_{ij}| = *O _{P}*({log

**Proof of Theorem 1**. The main idea of the proof is inspired by Fan and Li (2001) and Rothman *et al.* (2008). Let *U* be a symmetric matrix of size *p _{n}*,

$$P\left(\underset{U\in \mathcal{A}}{\mathrm{inf}}{q}_{1}({\Omega}_{0}+{\Delta}_{U})>{q}_{1}\left({\Omega}_{0}\right)\right)\to 1,$$

for sufficiently large constants *C*_{1} and *C*_{2}. This implies that there is a local minimizer in $\{{\Omega}_{0}+{\Delta}_{U}:{\parallel {\Delta}_{U}\parallel}_{F}^{2}\le {C}_{1}^{2}{\alpha}_{n}^{2}+{C}_{2}^{2}{\beta}_{n}^{2}\}$ such that ${\parallel \widehat{\Omega}-{\Omega}_{0}\parallel}_{F}={O}_{P}({\alpha}_{n}+{\beta}_{n})$ for sufficiently large *n*, since Ω_{0} + Δ_{U} is positive definite. This is shown by noting that

$${\lambda}_{\mathrm{min}}({\Omega}_{0}+{\Delta}_{U})\ge {\lambda}_{\mathrm{min}}\left({\Omega}_{0}\right)+{\lambda}_{\mathrm{min}}\left({\Delta}_{U}\right)\ge {\lambda}_{\mathrm{min}}\left({\Omega}_{0}\right)-{\parallel {\Delta}_{U}\parallel}_{F}>0,$$

since Ω_{0} has eigenvalues uniformly bounded away from 0 and ∞ by condition (A), and Δ_{U}_{F} = *O*(*α _{n}* +

Consider, for **Σ** = **Σ**_{0} + Δ_{U}, the difference

$${q}_{1}\left(\Omega \right)-{q}_{1}\left({\Omega}_{0}\right)={I}_{1}+{I}_{2}+{I}_{3},$$

where

$${I}_{1}=\mathrm{tr}\left(\mathrm{S}\Omega \right)-\mathrm{log}\mid \Omega \mid -(\mathrm{tr}\left(\mathrm{S}{\Omega}_{0}\right)-\mathrm{log}\mid {\Omega}_{0}\mid ),$$

$${I}_{2}=\sum _{(i,j)\in {S}_{1}^{c}}({p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}\mid \right)-{p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}^{0}\mid \right)),$$

$${I}_{3}=\sum _{(i,j)\in {S}_{1},i\ne j}({p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}\mid \right)-{p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}^{0}\mid \right)).$$

It is sufficient to show that the difference is positive asymptotically with probability tending to 1. Using Taylor’s expansion with the integral remainder, we have *I*_{1} = *K*_{1}+*K*_{2}, where

$$\begin{array}{c}{K}_{1}=\mathrm{tr}\left((\mathrm{S}-{\Sigma}_{0}){\Delta}_{U}\right),\hfill \\ {K}_{2}=\mathrm{vec}{\left({\Delta}_{U}\right)}^{T}\left\{{\int}_{0}^{1}g(v,{\Omega}_{v})(1-v)dv\right\}\mathrm{vec}\left({\Delta}_{U}\right),\hfill \end{array}$$

(5.5)

with the definitions **Ω**_{v} = **Ω**_{0} + *v*Δ_{U}, and $g(v,{\Omega}_{v})={\Omega}_{v}^{-1}\otimes {\Omega}_{v}^{-1}$. Now,

$$\begin{array}{cc}\hfill {K}_{2}\ge & {\int}_{0}^{1}(1-v)\underset{0\le v\le 1}{\mathrm{min}}{\lambda}_{\mathrm{min}}({\Omega}_{v}^{-1}\otimes {\Omega}_{v}^{-1})dv\cdot {\parallel \mathrm{vec}\left({\Delta}_{U}\right)\parallel}^{2}\hfill \\ \hfill =& {\parallel \mathrm{vec}\left({\Delta}_{U}\right)\parallel}^{2}\u22152\cdot \underset{0\le v\le 1}{\mathrm{min}}{\lambda}_{\mathrm{max}}^{-2}\left({\Omega}_{v}\right)\hfill \\ \hfill \ge & {\parallel \mathrm{vec}\left({\Delta}_{U}\right)\parallel}^{2}\u22152\cdot {(\parallel {\Omega}_{0}\parallel +\parallel {\Delta}_{U}\parallel )}^{-2}\hfill \\ \hfill \ge & ({C}_{1}^{2}{\alpha}_{n}^{2}+{C}_{2}^{2}{\beta}_{n}^{2})\u22152\cdot ({\tau}_{1}^{-1}+o\left(1\right))-2,\hfill \end{array}$$

where we used Δ_{U} ≤ *C*_{1}*α _{n}* +

Consider *K*_{1}. It is clear that |*K*_{1}| ≤ *L*_{1} + *L*_{2}, where

$${L}_{1}=\mid \sum _{(i,j)\in {S}_{1}}{(\mathrm{S}-{\Sigma}_{0})}_{ij}{\left({\Delta}_{U}\right)}_{ij}\mid ,$$

$${L}_{2}=\mid \sum _{(i,j)\in {S}_{1}^{c}}{(\mathrm{S}-{\Sigma}_{0})}_{ij}{\left({\Delta}_{U}\right)}_{ij}\mid .$$

Using Lemmas 1 and 2, we have

$$\begin{array}{cc}\hfill {L}_{1}\le & {({s}_{n1}+{p}_{n})}^{1\u22152}\underset{i,j}{\mathrm{max}}\mid {(\mathrm{S}-{\Sigma}_{0})}_{ij}\mid \cdot {\parallel {\Delta}_{U}\parallel}_{F}\hfill \\ \hfill \le & {O}_{P}({\alpha}_{n}+{\beta}_{n})\cdot {\parallel {\Delta}_{U}\parallel}_{F}\hfill \\ \hfill =& {O}_{P}({C}_{1}{\alpha}_{n}^{2}+{C}_{2}{\beta}_{n}^{2}),\hfill \end{array}$$

This is dominated by *K*_{2} when *C*_{1} and *C*_{2} are sufficiently large.

Now, consider *I*_{2} − *L*_{2} for penalties other than *L*_{1}. Since ${\parallel {\Delta}_{U}\parallel}_{F}^{2}={C}_{1}^{2}{\alpha}_{n}^{2}+{C}_{2}^{2}{\beta}_{n}^{2}$ on $\mathcal{A}$, we have that |*ω _{ij}*| =

$${p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}\mid \right)\ge {\lambda}_{n1}{k}_{1}\mid {\omega}_{ij}\mid .$$

This implies that

$${I}_{2}=\sum _{(i,j)\in {S}_{1}^{c}}{p}_{{\lambda}_{n1}}\left(\mid {\omega}_{ij}\mid \right)\ge {\lambda}_{n1}{k}_{1}\sum _{(i,j)\in {S}_{1}^{c}}\mid {\omega}_{ij}\mid .$$

Hence,

$$\begin{array}{cc}\hfill {I}_{2}-{L}_{2}\ge & \sum _{(i,j)\in {S}_{1}^{c}}\{{\lambda}_{n1}{k}_{1}\mid {\omega}_{ij}\mid -\mid {(\mathrm{S}-{\Sigma}_{0})}_{ij}\mid \cdot \mid {\omega}_{ij}\mid \}\hfill \\ \hfill \ge & \sum _{(i,j)\in {S}_{1}^{c}}[{\lambda}_{n1}{k}_{1}-{O}_{P}\left({\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}\right)]\cdot \mid {\omega}_{ij}\mid \hfill \\ \hfill =& {\lambda}_{n1}\sum _{(i,j)\in {S}_{1}^{c}}[{k}_{1}-{O}_{P}\left({\lambda}_{n1}^{-1}{\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}\right)]\cdot \mid {\omega}_{ij}\mid .\hfill \end{array}$$

With the assumption that $({p}_{n}+{s}_{n1})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n=O\left({\lambda}_{n1}^{2}\right)$, we see from the above that *I*_{2}−*L*_{2} ≥ 0 since ${O}_{P}=\left({\lambda}_{n1}^{-1}{\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}\right)={o}_{P}\left(1\right)$, using $\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n=o(({p}_{n}+{s}_{n1})\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)=o\left({\lambda}_{n1}^{2}\right)$.

For the *L*_{1}-penalty, since we have max_{i≠j} |**S** − **Σ**_{0}| = *O _{P}*((log

$$\underset{i\ne j}{\mathrm{max}}\mid \mathrm{S}-{\Sigma}_{0}\mid =W{(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n)}^{1\u22152}.$$

Then we can set *λ*_{n1} = 2*W*(log *p _{n}*/

Now, with *L*_{1} dominated by *K*_{2} and *I*_{2} − *L*_{2} ≥ 0, the proof completes if we can show that *I*_{3} is also dominated by *K*_{2}, since we have proved that *K*_{2} > 0. Using Taylor’s expansion, we can arrive at

$$\mid {I}_{3}\mid \le \mathrm{min}{({C}_{1},{C}_{2})}^{-1}\cdot O\left(1\right)\cdot ({C}_{1}^{2}{\alpha}_{n}^{2}+{C}_{2}^{2}{\beta}_{n}^{2})+o\left(1\right)\cdot ({C}_{1}^{2}{\alpha}_{n}^{2}+{C}_{2}^{2}{\beta}_{n}^{2}),$$

where *o*(1) and *O*(1) are the terms independent of *C*_{1} and *C*_{2}. By condition (B), we have

$$\mid {I}_{3}\mid =C\cdot O({\alpha}_{n}^{2}+{\beta}_{n}^{2})+{C}^{2}\cdot o({\alpha}_{n}^{2}+{\beta}_{n}^{2}),$$

which is dominated by *K*_{2} with large enough constants *C*_{1} and *C*_{2}. This completes the proof of the theorem. □

*Proof of Theorem 2*. For **Ω** a minimizer of (1.1), the derivative for *q*_{1}(**Ω**) w.r.t. *ω _{ij}* for $(i,j)\in {S}_{2}^{c}$ is

$$\frac{\partial {q}_{1}\left(\Omega \right)}{\partial {\omega}_{ij}}=2({s}_{ij}-{\sigma}_{ij}+{p}_{{\lambda}_{n1}}^{\prime}\left(\mid {\omega}_{ij}\mid \right)\mathrm{sgn}\left({\omega}_{ij}\right)),$$

where sgn(*a*) denotes the sign of *a*. If we can show that the sign of *q*_{1}(**Ω**)/*ω _{ij}* depends on sgn(

Decompose *s _{ij}* −

$${I}_{1}={s}_{ij}-{\sigma}_{ij}^{0},\phantom{\rule{1em}{0ex}}{I}_{2}={\sigma}_{ij}^{0}-{\sigma}_{ij}.$$

By Lemma 2 or Lemma A.3 of Bickel and Levina (2008b), it follows that max_{i,j} |*I*_{1}| = *O _{P}*({log

By Lemma 1, $\mid {\sigma}_{ij}-{\sigma}_{ij}^{0}\mid \le \parallel \Sigma -{\Sigma}_{0}\parallel $, which has order

$$\begin{array}{cc}\hfill \parallel \Sigma -{\Sigma}_{0}\parallel =& \parallel \Sigma (\Omega -{\Omega}_{0}){\Sigma}_{0}\parallel \hfill \\ \hfill \le & \parallel \Sigma \parallel \cdot \parallel \Omega -{\Omega}_{0}\parallel \cdot \parallel {\Sigma}_{0}\parallel \hfill \\ \hfill =& O\left(\parallel \Omega -{\Omega}_{0}\parallel \right),\hfill \end{array}$$

where we used condition (A) to get **Σ**_{0} = *O*(1), and using *η _{n}* → 0 so that ${\lambda}_{\mathrm{min}}(\Omega -{\Omega}_{0})=o\left(1\right)\phantom{\rule{thickmathspace}{0ex}}\mathrm{for}\phantom{\rule{thickmathspace}{0ex}}\parallel \Omega -{\Omega}_{0}\parallel =O\left({\eta}_{n}^{1\u22152}\right)$,

$$\begin{array}{cc}\hfill \parallel \Sigma \parallel ={\lambda}_{\mathrm{min}}^{-1}\left(\Omega \right)\le & {({\lambda}_{\mathrm{min}}\left({\Omega}_{0}\right)+{\lambda}_{\mathrm{min}}(\Omega -{\Omega}_{0}))}^{-1}\hfill \\ \hfill =& {(O\left(1\right)+o\left(1\right))}^{-1}=O\left(1\right).\hfill \end{array}$$

Hence, $\parallel \Omega -{\Omega}_{0}\parallel =O\left({\eta}_{n}^{1\u22152}\right)$ implies $\mid {I}_{2}\mid =O\left({\eta}_{n}^{1\u22152}\right)$.

Combining the last two results yields that

$$\begin{array}{cc}\hfill \underset{i,j}{\mathrm{max}}\mid {s}_{ij}-{\sigma}_{ij}\mid =& {O}_{P}(\mid {s}_{ij}-{\sigma}_{ij}^{0}\mid +{\eta}_{n}^{1\u22152})\hfill \\ \hfill =& {O}_{P}({\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}+{\eta}_{n}^{1\u22152}).\hfill \end{array}$$

By conditions (C) and (D), we have

$${p}_{{\lambda}_{n1}}^{\prime}\left(\mid {\omega}_{ij}\mid \right)={C}_{3}{\lambda}_{n1}$$

for *ω _{ij}* in a small neighborhood of 0 (excluding 0 itself) and some positive constant

**Proof of Theorem 3**. Because of the similarity between equations (2.4) and (1.1), the Frobenius norm result has nearly identical proof as Theorem 1, except that we now set Δ_{U} = *α _{n}U*. For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman

**Proof of Theorem 5**. The proof is similar to that of Theorem 1. We only sketch briefly the proof, pointing out the important differences.

Let *α _{n}* = (

$$P\left(\underset{U\in \mathcal{A}}{\mathrm{inf}}{q}_{2}({\Sigma}_{0}+{\Delta}_{U})>{q}_{2}\left({\Sigma}_{0}\right)\right)\to 1,$$

for sufficiently large constants *C*_{1}and *C*_{2}.

For **Σ** = **Σ**_{0} + Δ_{U}, the difference

$${q}_{2}\left(\Sigma \right)-{q}_{2}\left({\Sigma}_{0}\right)={I}_{1}+{I}_{2}+{I}_{3},$$

where

$${I}_{1}=\mathrm{tr}\left(S\Omega \right)+\mathrm{log}\mid \Sigma \mid -(\mathrm{tr}\left(S{\Omega}_{0}\right)+\mathrm{log}\mid {\Sigma}_{0}\mid ),$$

$${I}_{2}=\sum _{(i,j)\in {S}_{2}^{c}}({p}_{{\lambda}_{n2}}\left(\mid {\sigma}_{ij}\mid \right)-{p}_{{\lambda}_{n2}}\left(\mid {\sigma}_{ij}^{0}\mid \right)),$$

$${I}_{3}=\sum _{(i,j)\in {S}_{2},i\ne j}({p}_{{\lambda}_{n2}}\left(\mid {\sigma}_{ij}\mid \right)-{p}_{{\lambda}_{n2}}\left(\mid {\sigma}_{ij}^{0}\mid \right)),$$

with *I*_{1} = *K*_{1} + *K*_{2}, where

$$\begin{array}{c}{K}_{1}=-\mathrm{tr}\left((\mathrm{S}-{\Sigma}_{0}){\Omega}_{0}{\Delta}_{U}{\Omega}_{0}\right)=-\mathrm{tr}\left(({\mathrm{S}}_{{\Omega}_{0}}-{\Omega}_{0}){\Delta}_{U}\right),\hfill \\ {K}_{2}=\mathrm{vec}{\left({\Delta}_{U}\right)}^{T}\left\{{\int}_{0}^{1}g(v,{\Sigma}_{v})(1-v)dv,\right\}\mathrm{vec}\left({\Delta}_{U}\right),\hfill \end{array}$$

(5.6)

and **Σ**_{v} = **Σ**_{0} + *v*Δ_{U}, **S**_{Ω0} is the sample covariance matrix of a random sample {**x**_{i}}_{1≤i≤n} having **x**_{i} ~ *N*(**0**, **Ω**_{0}). Also,

$$g(v,{\Sigma}_{v})={\Sigma}_{v}^{-1}\otimes {\Sigma}_{v}^{-1}\mathrm{S}{\Sigma}_{v}^{-1}+{\Sigma}_{v}^{-1}\mathrm{S}{\Sigma}_{v}^{-1}\otimes {\Sigma}_{v}^{-1}-{\Sigma}_{v}^{-1}\otimes {\Sigma}_{v}^{-1}.$$

(5.7)

The treatment of *K*_{2}is different from that in Theorem 1. By condition (A), and (*p _{n}* +

$$\parallel v{\Delta}_{U}{\Omega}_{0}\parallel \le \parallel {\Delta}_{U}\parallel \parallel {\Omega}_{0}\parallel \le {\tau}_{1}^{-1}({C}_{1}{\alpha}_{n}+{C}_{2}{\beta}_{n})=O\left({\left(\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\right)}^{1-k}\right)=o\left(1\right).$$

Thus, we can use the Neumann series expansion to arrive at

$${\Sigma}_{v}^{-1}={\Omega}_{0}{(I+v{\Delta}_{U}{\Omega}_{0})}^{-1}={\Omega}_{0}(I-v{\Delta}_{U}{\Omega}_{0}+o\left(1\right)),$$

where the little *o* (or *o _{P}*,

$$\parallel \mathrm{S}-{\Sigma}_{0}\parallel ={O}_{P}\left(\parallel {\mathrm{S}}_{I}-I\parallel \right)={o}_{P}\left(1\right)$$

(see arguments in Lemma 3). These entail

$$\begin{array}{cc}\hfill \mathrm{S}{\Sigma}_{v}^{-1}=& (\mathrm{S}-{\Sigma}_{0}){\Sigma}_{v}^{-1}+{\Sigma}_{0}{\Sigma}_{v}^{-1}\hfill \\ \hfill =& {o}_{P}\left(1\right)+I+{O}_{P}({\alpha}_{n}+{\beta}_{n})\hfill \\ \hfill =& I+{o}_{P}\left(1\right).\hfill \end{array}$$

Combining these results, we have

$$g(v,{\Sigma}_{v})={\Omega}_{0}\otimes {\Omega}_{0}+{O}_{P}({\alpha}_{n}+{\beta}_{n}).$$

Consequently,

$$\begin{array}{cc}\hfill {K}_{2}=& \mathrm{vec}{\left({\Delta}_{U}\right)}^{T}\left\{{\int}_{0}^{1}{\Omega}_{0}\otimes {\Omega}_{0}(1+{o}_{P}\left(1\right))(1-v)dv\right\}\mathrm{vec}\left({\Delta}_{U}\right)\hfill \\ \hfill \ge & {\lambda}_{\mathrm{min}}({\Omega}_{0}\otimes {\Omega}_{0}){\parallel \mathrm{vec}\left({\Delta}_{U}\right)\parallel}^{2}\u22152\cdot (1+{o}_{P}\left(1\right))\hfill \\ \hfill =& {\tau}_{1}^{-2}({C}_{1}^{2}{\alpha}_{n}^{2}+{C}_{2}^{2}{\beta}_{n}^{2})\u22152\cdot (1+{o}_{P}\left(1\right)).\hfill \end{array}$$

All other terms are dealt with similarly as in the proof of Theorem 1, and hence we omit them. □

**Proof of Theorem 6**. The proof is similar to that of Theorem 2. We only show the main differences.

It is easy to show

$$\frac{\partial {q}_{2}\left(\Sigma \right)}{\partial {\sigma}_{ij}}=2(-{\left(\Omega \mathrm{S}\Omega \right)}_{ij}+{\omega}_{ij}+{p}_{{\lambda}_{n}}^{\prime}\left(\mid {\sigma}_{ij}\mid \right)\mathrm{sgn}\left({\sigma}_{ij}\right)).$$

Our aim is to estimate the order of |(−**ΩSΩ** + **Ω**)_{ij}|, finding an upper bound which is independent of both *i* and *j*.

Write

$$-\Omega \mathrm{S}\Omega +\Omega ={I}_{1}+{I}_{2},$$

where *I*_{1} = −**Ω**(**S** − **Σ**_{0})**Ω** and *I*_{2} = **Ω**(**Σ** − **Σ**_{0})**Ω**. Since

$$\begin{array}{cc}\hfill \parallel \Omega \parallel =& {\lambda}_{\mathrm{min}}^{-1}\left(\Sigma \right)\le {({\lambda}_{\mathrm{min}}\left({\Sigma}_{0}\right)+{\lambda}_{\mathrm{min}}(\Sigma -{\Sigma}_{0}))}^{-1}\hfill \\ \hfill =& {\tau}_{1}^{-1}+o\left(1\right),\hfill \end{array}$$

we have

$$\Omega ={\Omega}_{0}+(\Omega -{\Omega}_{0})={\Omega}_{0}-\Omega (\Sigma -{\Sigma}_{0}){\Omega}_{0}={\Omega}_{0}+\Delta ,$$

where $\parallel \Delta \parallel \le \parallel \Omega \parallel \cdot \parallel \Sigma -{\Sigma}_{0}\parallel \cdot \parallel {\Omega}_{0}\parallel =O\left({\eta}_{n}^{1\u22152}\right)=o\left(1\right)$ by Lemma 1, with **Σ**−**Σ**_{0}^{2} = *O*(*η _{n}*). Hence, we can apply Lemma 3 and conclude that max

For *I*_{2}, we have

$$\underset{i,j}{\mathrm{max}}\mid {\left({I}_{2}\right)}_{ij}\mid \le \parallel \Omega \parallel \cdot \parallel \Sigma -{\Sigma}_{0}\parallel \cdot \parallel \Omega \parallel =O\left(\parallel \Sigma -{\Sigma}_{0}\parallel \right)=O\left({\eta}_{n}^{1\u22152}\right).$$

Hence, we have

$$\underset{i,j}{\mathrm{max}}\mid {(-\Omega \mathrm{S}\Omega +\Omega )}_{ij}\mid =O({\{\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}{p}_{n}\u2215n\}}^{1\u22152}+{\eta}_{n}^{1\u22152}).$$

The rest goes similar to the proof of Theorem 2, and is omitted. □

**Proof of Theorem 7**. The proof is nearly identical to that of Theorem 5, except that we now set Δ_{U} = *α _{n}U*. The fact that ${\left({\widehat{\Gamma}}_{\mathrm{S}}\right)}_{ii}=1={\gamma}_{ii}^{0}$ has no estimation error eliminates an order (

For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman *et al.* (2008). □

**Proof of Theorem 10**. For (**T**,**D**) a minimizer of (4.2), the derivative for *q*_{3}(**T**,**D**) w.r.t. *t _{ij}* for $(i,j)\in {S}_{3}^{c}$ is

$$\frac{\partial {q}_{3}(\mathrm{T},\mathrm{D})}{\partial {t}_{ij}}=2({\left({\mathrm{ST}}^{T}{\mathrm{D}}^{-1}\right)}_{ji}+{p}_{{\lambda}_{n3}}^{\prime}\left(\mid {t}_{ij}\mid \right)\mathrm{sgn}\left({t}_{ij}\right)).$$

Now **ST**^{T}**D**^{−1} = *I*_{1} + *I*_{2} + *I*_{3} + *I*_{4}, where

$${I}_{1}=(\mathrm{S}-{\Sigma}_{0}){\mathrm{T}}^{T}{\mathrm{D}}^{-1}\phantom{\rule{1em}{0ex}}{I}_{2}={\Sigma}_{0}{(\mathrm{T}-{\mathrm{T}}_{0})}^{T}{\mathrm{D}}^{-1},$$

$${I}_{3}={\Sigma}_{0}{\mathrm{T}}_{0}^{T}({\mathrm{D}}^{-1}-{\mathrm{D}}_{0}^{-1}),\phantom{\rule{1em}{0ex}}{I}_{4}={\Sigma}_{0}{\mathrm{T}}_{0}^{T}{\mathrm{D}}_{0}^{-1}.$$

By the MCD (4.1), ${I}_{4}={T}_{0}^{-1}$. Since *i* > *j* for $(i,j)\in {S}_{3}^{c}$, we must have ${\left({T}_{0}^{-1}\right)}_{ji}=0$. Hence, we can ignore *I*_{4}.

Since **T**−**T**_{0}^{2} = *O*(*η _{n}*) and

For *I*_{2}, we have ${\text{max}}_{ij}\mid {\left({I}_{2}\right)}_{ij}\mid \le \parallel {\Sigma}_{0}\parallel \cdot \parallel \mathrm{T}-{\mathrm{T}}_{0}\parallel \cdot \parallel {\mathrm{D}}^{-1}\parallel =O\left({\eta}_{n}^{1\u22152}\right)$. And finally. ${\text{max}}_{ij}\mid {\left({I}_{3}\right)}_{ij}\mid \le \parallel {\Sigma}_{0}\parallel \cdot \parallel {\mathrm{T}}_{0}\parallel \cdot \parallel {\mathrm{D}}^{-1}-{\mathrm{D}}_{0}^{-1}\parallel =O\left({\zeta}_{n}^{1\u22152}\right)$.

With all these, we have ${\text{max}}_{\left(ij\right)\in {S}_{3}^{c}}{\mid {\left({\mathrm{ST}}^{T}{\mathrm{D}}^{-1}\right)}_{ij}\mid}^{2}=\mathrm{log}{p}_{n}\u2215n+{\eta}_{n}+{\zeta}_{n}$. The rest of the proof goes like that of Theorem 2 or 6. □

^{*}Financial support from the NSF grant DMS-0354223, DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged.

*AMS 2000 subject classifications*. Primary 62F12; secondary 62J07.

Clifford Lam, Department of Statistics, London School of Economics and Political Science, London, WC2A 2AE (Email: ku.ca.esl@2maL.C)

Jianqing Fan, Department of Operation Research and Financial Engineering, Princeton University, Princeton, NJ 08544 (Email: ude.notecnirp@nafqj)

[1] Bai Z, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Science Press; Beijing: 2006.

[2] Bickel PJ, Levina E. Covariance Regularization by Thresholding. Ann. Statist. 2008a;36(6):2577–2604.

[3] Bickel PJ, Levina E. Regularized Estimation of Large Covariance Matrices. Ann. Statist. 2008b;36(1):199–227.

[4] Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for co-variance matrix estimaiton. the Wharton School, University of Pennsylvania; 2009. Technical report.

[5] d’Aspremont A, Banerjee O, El Ghaoui L. First-order Methods For Sparse Covariance Selection. SIAM. J. Matrix Anal. and Appl. 2008;30(1):56–66.

[6] Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175.

[7] Diggle P, Verbyla A. Nonparametric Estimation of Covariance Structure in Longitudinal Data. Biometrics. 1998;54(2):401–415. [PubMed]

[8] El Karoui N. Operator Norm Consistent Estimation of a Large Dimensional Sparse Covariance Matrices. Ann. Statist. 2008;36(6):2717–2756.

[9] Fan J, Feng Y, Wu Y. Network Exploration via the Adaptive LASSO and SCAD Penalties. Annals of Applied Statistics. 2009;3(2):521–541. [PMC free article] [PubMed]

[10] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Amer. Statist. Assoc. 2001;96:1348–1360.

[11] Fan J, Peng H. Nonconcave Penalized Likelihood With a Diverging Number of Parameters. Ann. Statist. 2004;32:928–961.

[12] Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical LASSO. Biostatistics. 2008;9(3):432–441. [PMC free article] [PubMed]

[13] Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613.

[14] Huang J, Liu N, Pourahmadi M, Liu L. Covariance Matrix Selection and Estimation via Penalised Normal Likelihood. Biometrika. 2006;93(1):85–98.

[15] Levina E, Rothman AJ, Zhu J. Sparse Estimation of Large Covariance Matrices via a Nested Lasso Penalty. Ann. Applied Statist. 2008;2(1):245–263.

[16] Meier L, van de Geer S, Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society, B. 2008;70:53–71.

[17] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462.

[18] Pourahmadi M. Joint Mean-Covariance Models with Applications to Longitudinal Data: Unconstrained Parameterisation. Biometrika. 1999;86:677–690.

[19] Ravikumar P, Lafferty J, Liu H, Wasserman L. Advances in Neural Information Processing Systems. MIT Press; 2008. Sparse additive models; p. 20.

[20] Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electron. J. Statist. 2008;2:494–515.

[21] Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. J. Amer. Statist. Assoc. 2002;97(460):1141–1153.

[22] Wagaman AS, Levina E. Discovering sparse covariance structures with the Isomap. Journal of Computational and Graphical Statistics. 2008;18 to appear.

[23] Wong F, Carter C, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830.

[24] Wu WB, Pourahmadi M. Nonparametric Estimation of Large Covariance Matrices of Longitudinal Data. Biometrika. 2003;94:1–17.

[25] Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;90:831–844.

[26] Zhang CH. Penalized Linear Unbiased Selection. the statistics dept., Rutgers University; 2007. Technical report 2007-003.

[27] Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563.

[28] Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429.

[29] Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (With Discussion) Ann. Statist. 2008;36(4):1509–1533. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |