PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Ann Stat. Author manuscript; available in PMC 2010 December 1.
Published in final edited form as:
Ann Stat. 2009; 37(6B): 4254–4278.
doi:  10.1214/09-AOS720
PMCID: PMC2995610
NIHMSID: NIHMS248826

Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation*

Clifford Lam, Lecturer and Jianqing Fan, Professor

Abstract

This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (sn log pn/n)1/2, where sn is the number of nonzero elements, pn is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λn goes to 0 have been made explicit and compared under different penalties. As a result, for the L1-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: sn=O(pn) at most, among O(pn2) parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where sn is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.

Keywords: Covariance matrix, high dimensionality, consistency, nonconcave penalized likelihood, sparsistency, asymptotic normality

1 Introduction

Covariance matrix estimation is a common statistical problem in many scientific applications. For example, in financial risk assessment or longitudinal study, an input of covariance matrix Σ is needed, whereas an inverse of the covariance matrix, the precision matrix Σ−1, is required for optimal portfolio selection, linear discriminant analysis or graphical network models. Yet, the number of parameters in the covariance matrix grows quickly with dimensionality. Depending on the applications, the sparsity of the covariance matrix or precision matrix is frequently imposed to strike a balance between biases and variances. For example, in longitudinal data analysis [see e.g., Diggle and Verbyla (1998), or Bickel and Levina (2008b)], it is reasonable to assume that remote data in time are weakly correlated, whereas in Gaussian graphical models, the sparsity of the precision matrix is a reasonable assumption (Dempster (1972)).

This initiates a series of researches focusing on the parsimony of a covariance matrix. Smith and Kohn (2002) used priors which admit zeros on the off-diagonal elements of the Cholesky factor of the precision matrix Ω = Σ−1, while Wong, Carter and Kohn (2003) used zero-admitting prior directly on the off-diagonal elements of Ω to achieve parsimony. Wu and Pourahmadi (2003) used the Modified Cholesky Decomposition (MCD) to find a banded structure for Ω nonparametrically for longitudinal data. Bickel and Levina (2008b) developed consistency theories on banding methods for longitudinal data, for both Σ and Ω.

Various authors have used penalized likelihood methods to achieve parsimony on covariance selection. Fan and Peng (2004) has laid down a general framework for penalized likelihood with diverging dimensionality, with general conditions for the oracle property stated and proved. However, it is not clear whether it is applicable to the specific case of covariance matrix estimation. In particular, they did not link the dimensionality pn with the number of nonzero elements sn in the true covariance matrix Σ0, or the precision matrix Ω0. A direct application of their results to our setting can only handle a relatively small covariance matrix of size pn = o(n1/10).

Recently, there is a surge of interest on the estimation of sparse covariance matrix or precision matrix using penalized likelihood method. Huang, Liu, Pourahmadi and Liu (2006) used the LASSO on the off-diagonal elements of the Cholesky factor from MCD, while Meinshausen and Bühlmann (2006), d’Aspremont, Banerjee, and El Ghaoui (2008) and Yuan and Lin (2007) used different LASSO algorithms to select zero elements in the precision matrix. A novel penalty called the nested Lasso was constructed in Levina, Rothman and Zhu (2008) to penalize off-diagonal elements. Thresholding the sample covariance matrix in high-dimensional setting was thoroughly studied by El Karoui (2008) and Bickel and Levina (2008a) and Cai, Zhang and Zhou (2009) with remarkable results for high dimensional applications. However, it is not directly applicable to estimating sparse precision matrix when the dimensionality pn is greater than the sample size n. Wagaman and Levina (2008) proposed an Isomap method for discovering meaningful orderings of variables based on their correlations that result in block-diagonal or banded correlation structure, resulting in an Isoband estimator. A permutation invariant estimator, called SPICE, was proposed in Rothman, Bickel, Levina and Zhu (2008) based on penalized likelihood with L1-penalty on the off-diagonal elements for the precision matrix. They obtained remarkable results on the rates of convergence. The rate for estimating Ω under the Frobenius norm is of order (sn log pn/n)1/2, with dimensionality cost only a logarithmic factor in the overall mean-square error, where sn = pn + sn1, pn is the number of the diagonal elements and sn1 is the number of the nonzero off-diagonal entries. However, such a rate of convergence neither addresses explicitly the issues of sparsistency such as those in Fan and Li (2001) and Zhao and Yu (2006), nor the bias issues due to the L1-penalty and the sampling distribution of the estimated nonzero elements. These are the core issues of the study. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one, a weaker requirement than that of Ravikumar, Lafferty, Liu and Wasserman (2008).

In this paper, we investigate the aforementioned problems using the penalized pseudo-likelihood method. Assume a random sample {yi}1≤in with mean zero and covariance matrix Σ0, satisfying some sub-Gaussian tails conditions as specified in Lemma 2 (see Section 5). The sparsity of the true precision matrix Ω0 can be explored by maximizing the Gaussian quasi-likelihood or equivalently minimizing

q1(Ω)=tr(SΩ)logΩ+ijpλn1(ωij),
(1.1)

which is the penalized negative log-likelihood if the data is Gaussian. The matrix S=n1i=1nyiyiT is the sample covariance matrix, Ω = (ωij), and pλn1(·) is a penalty function, depending on a regularization parameter λn1, which can be nonconvex. For instance, the L1-penalty pλ(θ) = λ|θ| is convex, while the hard-thresholding penalty defined by pλ(θ) = λ2 − (|θ| − λ)21{|θ|<λ}, and the SCAD penalty defined by

pλ(θ)=λ1{θλ}+(aλθ)+1{θ>λ}(a1),for somea>2,
(1.2)

are folded-concave. Nonconvex penalty is introduced to reduce bias when the true parameter has a relatively large magnitude. For example, the SCAD penalty remains constant when θ is large, while the L1-penalty grows linearly with θ. See Fan and Li (2001) for a detailed account of this and other advantages of such a penalty function. The computation can be done via the local linear approximation (Zhou and Li, 2008, Fan et al. 2009); see Section 2.1 for additional details.

Similarly, the sparsity of the true covariance matrix Σ0 can be explored by minimizing

q2(Σ)=tr(SΣ1)+logΣ+ijpλn2(σij),
(1.3)

where Σ = (σij). Note that we only penalize the off-diagonal elements of Σ or Ω in the aforementioned two methods, since the diagonal elements of Σ0 and Ω0 do not vanish.

In studying a sparse covariance or precision matrix, it is important to distinguish between the diagonal and off-diagonal elements, since the diagonal elements are always positive and they contribute to the overall mean-squares errors. For example, the true correlation matrix, denoted by Γ0, has the same sparsity structure as Σ0 without the need to estimating its diagonal elements. In view of this fact, we introduce a revised method (3.2) to take this advantage. It turns out that the correlation matrix can be estimated with a faster rate of convergence, at (sn1 log pn/n)1/2 instead of ((pn + sn1) log pn/n)1/2, where sn1 is the number of nonzero correlation coefficients. We can take similar advantages over the estimation of the true inverse correlation matrix, denoted by Ψ0. See Section 2.5. This is an extension of the work of Rothman et al. (2008) using the L1-penalty. Such an extension is important since the nonconcave penalized likelihood ameliorates the bias problem for the L1-penalized likelihood.

The bias issues of the commonly used L1-penalty, or LASSO, can be seen from our theoretical results. In fact, due to the bias of LASSO, an upper bounded of λni is needed in order to achieve fast rate of convergence. On the other hand, a lower bound is required in order to achieve sparsity of estimated precision or covariance matrices. This is in fact one of the motivations for introducing nonconvex penalty functions in Fan and Li (2001) and Fan and Peng (2004), but we state and prove the explicit rates in the current context. In particular, we demonstrate that the L1-penalized estimator can achieve simultaneously the optimal rate of convergence and sparsistency for estimation of Σ0 or Ω0 when the number of nonzero elements in the off-diagonal entries are no larger than O(pn), but not guaranteed so otherwise. On the other hand, using the nonconvex penalties like the SCAD or hard-thresholding penalty, such an extra restriction is not needed.

We also compare two different formulations of penalized likelihood using the modified Cholesky decomposition, exploring their respective rates of convergence and sparsity properties.

Throughout this paper, we use λmin(A), λmax(A) and tr(A) to denote the minimum eigenvalue, maximum eigenvalue, and trace of a symmetric matrix A respectively. For a matrix B, we define the operator norm and the Frobenius norm, respectively, as B=λmax12(BTB) and ||B||F = tr1/2(BTB).

2 Estimation of sparse precision matrix

In this section, we present the analysis of (1.1) for estimating a sparse precision matrix. Before this, let us first present an algorithm for computing the nonconcave maximum (pseudo)-likelihood estimator and then state the conditions needed for our technical results.

2.1 Algorithm based on iterated reweighted L1-penalty

The computation of the nonconcave maximum likelihood problems can be solved by a sequence of L1-penalized likelihood problems via local linear approximation (Zou and Li 2008, Fan et al. 2009). For example, given the current estimate Ωk = (ωij,k), by the local linear approximation to the penalty function,

q1(Ω)tr(SΩ)logΩ+ij[pλn1(ωij,k)+pλn1(ωij,k)(ωijωij,k)].
(2.1)

Hence, Ωk+1 should be taken to maximize the right-hand side of (2.1):

Ωk+1=argmaxΩ[tr(SΩ)logΩ+ijpλn1(ωij,k)ωij],
(2.2)

after ignoring the two constant terms. Problem (2.2) is the weighted penalized L1-likelihood. In particular, if we take the most primitive initial value Ω0 = 0, then

Ω1=argmaxΩ[tr(SΩ)logΩ+λn1ijωij],

is already a good estimator. Iterations of (2.2) reduces the biases of the estimator, as larger estimated coefficients in the previous iterations receive less penalty. In fact, in a different setup, Zou and Li (2008) showed that one iteration of such a procedure is sufficient as long as the initial values are good enough.

Fan et al. (2009) has implemented the above algorithm for optimizing (1.1). They have also demonstrated in Section 2.2 in their paper how to utilize the graphical lasso algorithm of Friedman, Hastie and Tibshirani (2008), which is essentially a group coordinate descent procedure, to solve problem (2.2) quickly, even when pn > n. Such a group coordinate decent algorithm was also used by Meier et al. (2008) to solve the group LASSO problem. Thus iteratively, (2.2), and hence (1.1), can be solved quickly with the graphical lasso algorithm. See also Zhang (2007) for a general solution to the folded-concave penalized least-squares problem. The following is a brief summary of the numerical results in Fan et al. (2009).

2.2 Some numerical results

We give a brief summary of a breast cancer data analysis with pn > n considered in Fan et al. (2009). For full details, please refer to Section 3.2 of Fan et al. (2009). Other simulation results are also in Section 4 in their paper.

Breast cancer data

Normalized gene expression data from 130 patients with stage I-III breast cancers are analyzed, with 33 of them belong to class 1 and 97 belong to class 2. The aim is to assess prediction accuracy in predicting which class a patient will belong to, using a set of pre-selected genes (pn = 110, chosen by t-tests) as gene expression profile data. The data is randomly divided into training (n = 109) and testing sets. The mean vector for the genes expression levels is obtained from the training data, as well as the associated inverse covariance matrix estimated using LASSO, adaptive LASSO and SCAD penalties as three different regularization methods. A linear discriminant score is then calculated for each regularization method and applied to the testing set to predict if a patient belongs to class 1 or 2. This is repeated 100 times.

On average, the estimated precision matrix Ω^ using LASSO has many more nonzeros than that using SCAD (3923 versus 674). This is not surprising when we look at equation (2.3) in our paper, where the L1 penalty imposes an upper bound on the tuning parameter λn1 for consistency, which links to reducing the bias in the estimation. This makes the λn1 in practice too small to set many of the elements in Ω^ to zero. While we do not know which elements in the true Ω are zero, the large number of nonzero elements in the L1 penalized estimator seems spurious, and the resulting gene network is not easy to interpret.

On the other hand, SCAD-penalized estimator has a much smaller number of nonzero elements, since the tuning parameter λn1 is not bounded above under consistency of the resulting estimator. This makes the resulting gene network easier to interpret, with some clusters of genes identified.

Also, classification results on the testing set using the SCAD penalty for precision matrix estimation is better than that using the L1 penalty, in the sense that the specificity (#True Negative/#class 2) is higher (0.794 to 0.768) while the sensitivity (#True Positive/#class 1) is similar to that using L1-penalized precision matrix estimator.

2.3 Technical conditions

We now introduce some notations and present regularity conditions for the rate of convergence and sparsistency.

Let S1={(i,j):ωij00}, where Ω0=(ωij0) is the true precision matrix. Denote by sn1 = |S1| − pn, which is the number of nonzero elements in the off-diagonal entries of Ω0. Define

an1=max(i,j)S1pλn1(ωij0),bn1=max(i,j)S1pλn1(ωij0).

The term an1 is related to the asymptotic bias of the penalized likelihood estimate due to penalization. Note that for L1-penalty, an1 = λn1 and bn1 = 0, whereas for SCAD, an1 = bn1 = 0 for sufficiently large n under the last assumption of condition (B) below.

We assume the following regularity conditions:

  1. There are constants τ1 and τ2 such that
    0<τ1<λmin(Σ0)λmax(Σ0)<τ2<for alln.
  2. an1 = O({(1 + pn/(sn1 + 1)) log pn/n}1/2), bn1 = o(1), and min(i,j)S1ωij0λn1 as n → ∞.
  3. The penalty pλ(·) is singular at the origin, with limt↓0 pλ(t)/(λt) = k > 0.
  4. There are constants C and D such that, when θ1,θ2<Cλn1,pλn1(θ1)pλn1(θ2)Dθ1θ2.

Condition (A) bounds uniformly the eigenvalues of Σ0, which facilitates the proof of consistency. It also includes a wide class of covariance matrices as noted in Bickel and Levina (2008b). The rates an1 and bn1 in condition (B) are also needed for proving consistency. If they are too large, the bias due to penalty can dominate the variance from the likelihood, resulting in poor estimates.

The last requirement in condition (B) states the rate at which the nonzero parameters should be distinguished from zero asymptotically. It is not explicitly needed in the proofs, but for asymptotically unbiased penalty functions, this is a necessary condition so that an1 and bn1 are converging to zero fast enough as needed in the first part of condition (B). In particular, for the SCAD and hard-thresholding penalty functions, this condition implies that an1 = bn1 = 0 exactly for sufficiently large n, thus allowing a flexible choice of λn1. For the SCAD penalty (1.2), the condition can be relaxed as min(i,j)S1ωij0λn1<a.

The singularity in condition (C) gives sparsity in the estimates [Fan and Li (2001)]. Finally, condition (D) is a smoothing condition for the penalty function, and is needed in proving asymptotic normality. The SCAD penalty, for instance, satisfies this condition by choosing the constant D, independent of n, to be large enough.

2.4 Properties of sparse precision matrix estimation

Minimizing (1.1) involves nonconvex minimization, and we need to prove that there exists a local minimizer Ω^ for the minimization problem with a certain rate of convergence, which is given under the Frobenius norm. The proof is given in Section 5. It is similar to the one given in Rothman et al. (2008), but now the penalty function is nonconvex.

Theorem 1 (Rate of convergence). Under regularity conditions (A)-(D), if (pn+sn1)logpnn=O(λn12) and (pn + sn1)(log pn)k/n = O(1) for some k > 1, then there exists a local minimizer Ω^ such that Ω^Ω0F2=OP{(pn+sn1)logpnn}. For the L1-penalty, we only need logpnn=O(λn12).

The proofs of this theorem and others are relegated to Section 5 so that readers can get more quickly what the results are. As in Fan and Li (2001), the asymptotic bias due to the penalty for each nonzero parameter is an1. Since we penalized only on the off-diagonal elements, the total bias induced by the penalty is asymptotically of order sn1an1. The square of this total bias over all nonzero elements is of order OP{(pn + sn1) log pn/n} under condition (B).

Theorem 1 states explicitly how the number of nonzero elements and dimensionality affect the rate of convergence. Since there are (pn + sn1) nonzero elements and each of them can be estimated at best with rate n−1/2, the total square errors are at least of rate (pn + sn1)/n. The price that we pay for high-dimensionality is merely a logarithmic factor log pn. The results holds as long as (pn+sn1)/n is at a rate O((log pn)k) with some k > 1, which decays to zero slowly. This means that in practice pn can be comparable to n without violating the results. The condition here is not minimum possible; we expect it holds for p [dbl greater-than sign] n. Here, we refer the local minimizer as an interior point within a given close set such that it minimizes the target function. Following a similar argument to Huang et al. (2008), the local minimizer in Theorem 1 can be taken as the global minimizer with additional conditions on the tail of the penalty function.

Theorem 1 is also applicable to the L1-penalty function, where the local minimizer becomes the global minimizer. The asymptotic bias of the L1-penalized estimate is given in the term sn1an1 = sn1λn1 as shown in the technical proof. In order to control the bias, we impose condition (B), which entails an upper bound on λn1 = O({(1+pn/(sn1+1)) log pn/n}1/2). The bias problem due to the L1-penalty for finite parameter has already been unveiled by Fan and Li (2001) and Zou (2006).

Next we show the sparsistency of the penalized estimator from (1.1). We use Sc to denote the complement of a set S.

Theorem 2 (Sparsistency). Under the conditions given in Theorem 1, for any local minimizer of (1.1) satisfying Ω^Ω0F2=OP{(pn+sn1)logpnn} and Ω^Ω02=OP(ηn) for a sequence of ηn0, if logpnn+ηn=O(λn12), then with probability tending to 1, ω^ij=0 for all (i,j)S1c.

First, since M2MF2 for any matrix M, we can always take ηn = (pn + sn1) log pn/n in Theorem 2, but this will result in more stringent requirement on the number of zero elements when L1-penalty is used, as we now explain. The sparsistency requires a lower bound on the rate of the regularization parameter λn1. On the other hand, condition (B) imposes an upper bound on λn1 when L1-penalty is used in order to control the biases. Explicitly, we need, for L1-penalized likelihood,

logpnn+ηn=O(λn12)=(1+pn(sn1+1))logpnn
(2.3)

for both consistency and sparsistency to be satisfied. We present two scenarios here for the two bounds to be compatible, making use of the inequalities MF2pnM2MF2 for a matrix M of size pn.

  1. We always have Ω^Ω0Ω^Ω0F. In the worst case scenario where they have the same order, Ω^Ω02=OP((pn+sn1)logpnn), so that ηn = (pn+sn1) log pn/n. It is then easy to see from (2.3) that the two bounds are compatible only when sn1 = O(1).
  2. We also have Ω^Ω0F2pnΩ^Ω02. In the optimistic scenario where they have the same order,
    Ω^Ω02=OP((1+sn1pn)logpnn).
    Hence, ηn = (1 + sn1/pn) log pn/n, and compatibility of the bounds requires sn1 = O(pn).

Hence, even in the optimistic scenario, consistency and sparsistency are guaranteed only when sn1 = O(pn) if the L1-penalty is used, i.e., the precision matrix has to be sparse enough.

However, if the penalty function used is unbiased, like the SCAD or the hard-thresholding penalty, we do not impose an extra upper bound for λn1 since its first derivative pλn1(θ) goes to zero fast enough as |θ| increases (exactly equals zero for the SCAD and hard-thresholding penalty functions, when n is sufficiently large; see condition (B) and the explanation thereof). Thus, λn1 is allowed to decay to zero slowly, allowing even the largest order sn1=O(pn2).

We remark that asymptotic normality for the estimators of the elements in S1 have been established in a previous version of this paper. We omit it here for brevity.

2.5 Properties of sparse inverse correlation matrix estimation

The inverse correlation matrix Ψ0 retains the same sparsity structure of Ω0. Consistency and sparsistency results can be achieved with pn as large as log pn = o(n), as long as (sn1 + 1)(log pn)k/n = O(1) for some k > 1 as n → ∞. We minimize, w.r.t. Ψ = (ψij),

tr(ΨΓ^S)logΨ+ijpνn1(ψij),
(2.4)

where Γ^S=W^1SW^1 is the sample correlation matrix, with W^2=DS being the diagonal matrix with diagonal elements of S, and υn1 is a regularization parameter. After obtaining Ψ^, Ω0 can also be estimated by Ω~=W^1Ψ^W^1.

To present the rates of convergence for Ψ^ and Ω~, we define

cn1=max(i,j)S1pνn1(ψij0),dn1=max(i,j)S1pνn1(ψij0),

where Ψ0=(ψij0) and modify condition (D) to (D’) with λn1 there replaced by υn1, and impose (B’) cn1 = O({log pn/n}1/2), dn1 = o(1). Also, min(i,j)S1ψij0νn1 as n → ∞.

Theorem 3 Under regularity conditions (A),(B’),(C) and (D’), if (sn1+1)(log pn)k/n = O(1) for some k > 1 and (sn1+1)logpnn=o(νn12), then there exists a local minimizer Ψ^ for (2.4) such that Ψ^Ψ0F2=OP(sn1logpnn) and Ω~Ω02=OP((sn1+1)logpnn) under the operator norm. For the L1-penalty, we only need logpnn=O(νn12)

Note that we can allow pn [dbl greater-than sign] n without violating the result as long as log pn/n = o(1). Note also that an order of {pn log pn/n}1/2 is removed by estimating the inverse correlation rather than the precision matrix, which is somewhat surprising since the inverse correlation matrix, unlike the correlation matrix, does not have known diagonal elements that contribute no errors to the estimation. This can be explained and proved as follows. If sn1 = O(pn), the result is obvious. When sn1 = o(pn), most of the off-diagonal elements are zero. Indeed, there are at most O(sn1) columns of the inverse correlation matrix which contain at least one nonzero element. The rest of the columns that have all zero off-diagonal elements must have diagonal entries 1. These columns represent variables that are actually uncorrelated from the rest. Now, it is easy to see from (2.4) that these diagonal elements, which are one, are all estimated exactly as one with no estimation error. Hence, an order of (pn log pn/n)1/2 is not present even in the case of estimating the inverse correlation matrix.

For the L1-penalty, our result reduces to that given in Rothman et al. (2008). We offer the sparsistency result as follows.

Theorem 4 (Sparsistency) Under the conditions given in Theorem 3, for any local minimizer of (2.4) satisfying Ψ^Ψ0F2=OP(sn1logpnn) and Ψ^Ψ02=OP(nn) for some ηn → 0, if logpnn+nn=O(νn12), then with probability tending to 1, ψ^ij=0 for all (i,j)S1c.

The proof follows exactly the same as that for Theorem 2 in Section 2.4, and is thus omitted.

For the L1-penalty, control of bias and sparsistency require υn1 to satisfy bounds like (2.3):

logpnn+ηn=O(νn12)=logpnn.
(2.5)

This leads to two scenarios:

  1. The worst case scenario has
    Ψ^Ψ02=Ψ^Ψ0F2=OP(sn1logpnn),
    meaning ηn = sn1 log pn/n. Then compatibility of the bounds in (2.5) requires sn1 = O(1).
  2. The optimistic scenario has
    Ψ^Ψ02=Ψ^Ψ0F2pn=OP(sn1pnlogpnn),
    meaning ηn = sn1/pn · log pn/n. Then compatibility of the bounds in (2.5) requires sn1 = O(pn).

On the other hand, for penalties like the SCAD or the hard-thresholding penalty, we do not need an upper bound for sn1. Hence, we only need (sn + 1)(log pn)k/n = O(1) as n → ∞ for some k > 1. It is clear that SCAD results in better sampling properties than the L1-penalized estimator in precision or inverse correlation matrix estimation.

3 Estimation of sparse covariance matrix

In this section, we analyze the sparse covariance matrix estimation using the penalized likelihood (1.3). Then it is modified to estimating the correlation matrix, which improves the rate of convergence. We assume that the yi’s are i.i.d. N(0, Σ0) throughout this section.

3.1 Properties of sparse covariance matrix estimation

Let S2={(i,j):σij00}, where Σ0=(σij0). Denote sn2 = |S2| − pn, so that sn2 is the number of nonzero elements in Σ0 on the off-diagonal entries. Put

an2=max(i,j)S2pλn2(σij0),bn2=max(i,j)S2pλn2(σij0).

Technical conditions in Section 2 need some revision. In particular, condition (D) now becomes condition (D2) with λn1 there replaced by λn2. Condition (B) should now be (B2) an2 = O({(1 + pn/(sn2 + 1)) log pn/n}1/2), bn2 = o(1), and min(i,j)S2σij0λn2 as n → ∞.

Theorem 5 (Rate of convergence). Under regularity conditions (A), (B2), (C) and (D2), if (pn + sn2)(log pn)k/n = O(1) for some k > 1 and (pn + sn2) log pnn=O(λn22), then there exists a local minimizer Σ^ such that Σ^Σ0F2=Op((pn+sn2)logpnn). For the L1-penalty, we only need logpnn=O(λn22).

Like the case for precision matrix estimation, the asymptotic bias due to the L1-penalty is of order sn2an2 = sn2λn2. To control this term, for the L1-penalty, we require λn2 = O({(1 + pn/(sn2 + 1)) log pn/n}1/2).

Theorem 6 (Sparsistency). Under the conditions given in Theorem 5, for any local minimizer Σ^ of (1.3) satisfying Σ^Σ0F2=OP((pn+sn2)logpnn) and Σ^Σ02=OP(nn) for some ηn → 0, if logpnn+nn=O(λn22), then with probability tending to 1, σ^ij=0 for all (i,j)S2c.

For the L1-penalized likelihood, controlling of bias for consistency together with sparsistency requires

logpnn+ηn=O(λn22)=(1+pn(sn2+1))logpnn.
(3.1)

This is the same condition as (2.3), and hence in the worst case scenario where

Σ^Σ02=Σ^Σ0F2=OP((pn+sn2)logpnn),

we need sn2 = O(1). In the optimistic scenario where

Σ^Σ02=Σ^Σ0F2pn,

we need sn2 = O(pn). In both cases, the matrix Σ0 has to be very sparse, but the former is much sparser.

On the other hand, if unbiased penalty functions like the SCAD or hard-thresholding penalty are used, we do not need an upper bound on λn2 since the bias an2 = 0 for sufficiently large n. This gives more flexibility on the order of sn2.

Similar to Section 2, asymptotic normality for the estimators of the elements in S2 can be proved under certain assumptions.

3.2 Properties of sparse correlation matrix estimation

The correlation matrix Γ0 retains the same sparsity structure of Σ0 with known diagonal elements. This special structure allows us to estimate Γ0 more accurately. To take advantage of the known diagonal elements, the sparse correlation matrix Γ0 is estimated by minimizing w.r.t. Γ = (γij),

tr(Γ1Γ^S)+logΓ+ijpνn2(γij),
(3.2)

where υn2 is a regularization parameter. After obtaining Γ^ Σ0 can be estimated by Σ~=W^Γ^W^.

To present the rates of convergence for Γ^ and Σ~, we define

cn2=max(i,j)S2pνn2(γij0),dn2=max(i,j)S2pνn2(γij0),

where Γ0=(γij0). We modify condition (D) to (D2′) with λn2 there replaced by υn2, and (B) to (B2′) as follows: (B2′) cn2 = O({log pn/n}1/2), dn2 = o(1), and min(i,j)S2γij0νn2 as n → ∞.

Theorem 7 Under regularity conditions (A),(B2′),(C) and (D2′), if (pn+sn2)(log pn)k/n = O(1) for some k > 1 and (sn2+1)logpnn=o(νn22), then there exists a local minimizer Γ^ for (3.2) such that

Γ^Γ0F2=OP(sn2logpnn).

In addition, for the operator norm, we have

Σ~Σ02=OP{(sn2+1)logpnn}.

For the L1-penalty, we only need logpnn=O(νn22).

The proof is sketched in Section 5. This theorem shows that the correlation matrix, like the inverse correlation matrix, can be estimated more accurately, since diagonal elements are known to be one.

Theorem 8 (Sparsistency). Under the conditions given in Theorem 7, for any local minimizer Γ^ of (3.2) satisfying Γ^Γ0F2=OP(sn2logpnn) and Γ^Γ02=OP(nn) for some ηn → 0 , if logpnn+nn=O(νn22), then with probability tending to 1, γ^ij=0 for all (i,j)S2c.

The proof follows exactly the same as that of Theorem 6 in Section 5, and is omitted. For the L1-penalized likelihood, controlling of bias and sparsistency requires

logpnn+ηn=O(νn22)=logpnn.
(3.3)

This is the same condition as (2.5), hence in the worst scenario where

Γ^Γ02=Γ^Γ0F2=OP(sn2logpnn),

we need sn2 = O(1). In the optimistic scenario where

Γ^Γ02=Γ^Γ0F2pn=OP(sn2pnlogpnn),

we need sn2 = O(pn).

The use of unbiased penalty functions like the SCAD or the hard-thresholding penalty, similar to results in the previous sections, does not impose an upper bound on the regularization parameter since bias cn2 = 0 for sufficiently large n. This gives more flexibility to the order of sn2 allowed.

4 Extension to sparse Cholesky decomposition

Pourahmadi (1999) proposed the modified Cholesky decomposition (MCD) which facilitates the sparse estimation of Ω through penalization. The idea is to represent zero-mean data y = (y1, (...) , ypn)T using the autoregressive model:

yi=j=1i1ϕijyj+i,andTΣTT=D,
(4.1)

where T is the unique unit lower triangular matrix with ones on its diagonal and (i, j)th element being −ϕij for j < i, and D is diagonal with ith element being σi2=var(i). The optimization problem is unconstrained (since the ϕij’s are free variables), and the estimate for Ω is always positive-definite.

Huang et al. (2006) and Levina et al. (2008) both used the MCD for estimating Ω0. The former maximized the log-likelihood (ML) over T and D simultaneously, while the latter suggested also a least square version (LS), with D being first set to the identity matrix and then minimizing over T to obtain T^. The latter corresponds to the original Cholesky decomposition. The sparse Cholesky factor can be estimated through minimizing

(ML):q3(T,D)=tr(TTD1TS)+logD+2i<jpλn3(tij).
(4.2)

This is indeed the same as (1.1) with the substitution of Ω = TTD−1T and penalization parameter λn3. Noticing that (4.1) can be written as Ty=ε, the least square version is to minimize tr(εεT)=tr(TTTyyT) in the matrix notation. Aggregating the n observations and adding penalty functions, the least-square criterion is to minimize

(LS):q4(T)=tr(TTTS)+2i<jpλn4(tij).
(4.3)

In view of the results in Sections 2.5 and 3.2, we can also write the sample covariance matrix in (4.2) as S=W^Γ^SW^ and then replace D12TW^ by T, resulting in the normalized (NL) version as follows:

(NL):q5(T)=tr(TTTΓ^S)2logT+2i<jpλn5(tij).
(4.4)

We will also assume the yi’s are i.i.d. N(0, Σ0) as in the last section.

4.1 Properties of sparse Cholesky factor estimation

Since all the T’s introduced in the three models above have the same sparsity structure, let S and sn3 be the nonzero set and number of nonzeros associated with each T above. Define

an3=max(i,j)Spλn3(tij0),bn3=max(i,j)Spλn3(tij0).

For (ML), condition (D) is adapted to (D3) with λn1 there replaced by λn3. Condition (B) is modified as (B3) an3 = O({(1 + pn/(sn3 + 1)) log pn/n}1/2), bn3 = o(1) and min(i,j)Sϕij0λn3 as n → ∞.

After obtaining T^ and D^ from minimizing (ML), we set Ω^=T^TD^1T^.

Theorem 9 Under regularity conditions (A),(B3),(C),(D3), if (pn + sn3)(log pn)k/n = O(1) for some k > 1 and (pn+sn3)logpnn=O(λn32), then there exists a local minimizer T^ and D^ for (ML) such that T^T0F2=OP(sn3logpnn), D^D0F2=OP(pnlogpnn) and Ω^Ω0F2=OP{(pn+sn3)logpnn}. For the L1-penalty, we only need logpnn=O(λn32).

The proof is similar to those of Theorems 5 and 7 and is omitted. The Cholesky factor T has ones on its main diagonal without the need for estimation. Hence, the rate of convergence is faster than Ω^.

Theorem 10 (Sparsistency). Under the conditions in Theorem 9, for any local minimizer T^, D^ of (4.2) satisfying T^T0F2=OP(sn3logpnn) and D^D0F2=OP(pnlogpnn), if logpnn+ηn+ζn=O(λn32), then sparsistency holds for T^, provided that T^T02=OP(ηn) and D^D02=OP(ζn), for some ηn, ζn → 0.

The proof is in Section 5. For the L1-penalized likelihood, control of bias and sparsistency impose the following:

logpnn+ηn+ζn=O(λn32)=(1+pn(sn3+1))logpnn.
(4.5)

The worst scenario corresponds to ηn = sn3 log pn/n and ζn = pn log pn/n, so that we need sn3 = O(1). The optimistic scenario corresponds to ηn = sn3/pn · log pn/n and ζn = log pn/n, so that we need sn3 = O(pn).

On the other hand, such a restriction is not needed for unbiased penalties like the SCAD or hard-thresholding penalty, giving more flexibility on the order of sn3.

4.2 Properties of sparse normalized Cholesky factor estimation

We now turn to analyzing the normalized penalized likelihood (4.4). With T = (tij) in (NL) which is lower triangular, define

an5=max(i,j)Spλn5(tij0),bn5=max(i,j)Spλn5(tij0).

Condition (D) is now changed to (D5) with λn1 there replaced by λn5. Condition (B) is now substituted by (B5) an52=O(logpnn), bn5 = o(1), min(i,j)Stij0λn5 as n → ∞.

Theorem 11 (Rate of convergence) Under regularity conditions (A),(B5),(C) and (D5), if sn3(log pn)k/n = O(1) for some k > 1 and (sn3+1)logpnn=o(λn52), then there exists a local minimizer T^ for (NL) such that T^T0F2=OP(sn3logpnn) and rate of convergence in the Frobenius norm

Ω^Ω0F2=OP{(pn+sn3)logpnn},

and in the operator norm, it is improved to

Ω^Ω02=OP{(sn3+1)logpnn)}.

For the L1-penalty, we only need logpnn=O(λn52).

The proof is similar to that of Theorems 5 and 7 and is omitted. In this theorem, like Lemma 3, we can have pn so that pn/n goes to a constant less than 1. It is evident that normalizing with W^ results in an improvement in the rate of convergence in operator norm.

Theorem 12 (Sparsistency). Under the conditions given in Theorem 11, for any local minimizer T^ of (4.4) satisfying T^T0F2=OP(sn3logpnn) if logpnn+ηn=O(λn52), then sparsistency holds for T^, provided that T^T02=O(ηn) for some ηn → 0.

Proof is omitted since it goes exactly the same as that of Theorem 10. The above results apply also to the L1-penalized estimator. For simultaneous persistency and optimal rate of convergence using the L1-penalty, the biases inherent in it induce the restriction sn3 = O(1) in the worst scenario where ηn2=sn3logpnn, and sn3 = O(pn) in the optimistic scenario where ηn2=sn3pnlogpnn. This restriction does not apply to the SCAD and other asymptotically unbiased penalty functions.

5 Proofs

We first prove three lemmas. The first one concerns with inequalities involving the operator and the Frobenius norms. The other two concern with order estimation for elements in a matrix of the form A(SΣ0)B, which are useful in proving results concerning sparsistency.

Lemma 1 Let A and B be real matrices such that the product AB is defined. Then, defining Amin2=λmin(ATA), we have

AminBFABFABF.
(5.1)

In particular, if A = (aij), then |aij| ≤ ||A|| for each i, j.

Proof of Lemma 1. Write B = (b1, (...) , bq), where bi is the i-th column vector in B. Then

ABF2=tr(BTATAB)=i=1qbiTATAbiλmax(ATA)i=1qbi2=A2BF2.

Similarly,

ABF2=i=1qbiTATAbiλmin(ATA)i=1qbi2=Amin2BF2,

which completes the proof of (5.1). To prove |aij| ≤ ||A||, note that aij=eiTAej, where ei is the unit column vector with one at the i-th position, and zero elsewhere. Hence, using (5.1),

aij=eiTAejAejFAejF=A,

and this completes the proof of the lemma. □

Lemma 2 Let S be a sample covariance matrix of a random sample {yi}1≤in, with E(yi) = 0 and var(yi) = Σ0. Let yi = (yi1, (...) , yipn) with yij ~ Fj, where Fj is the c.d.f. of yij , and let Gj be the c.d.f. of yij2, with

max1ipn0exp(λt)dGj(t)<,0<λ<λ0
(5.2)

for some λ0 > 0. Assume log pn/n = o(1), and that Σ0 has eigenvalues uniformly bounded above as n → ∞. Then for constant matrices A and B with ||A||, ||B|| = O(1), we have maxi,j |(A(SΣ0)B)ij | = OP({log pn/n}1/2).

Remark: The conditions on the yij’s above are the same as those used in Bickel and Levina (2008b) for relaxing the normality assumption.

Proof of Lemma 2. Let xi = Ayi and wi=BTyi. Define ui=(xiT,wiT)T , with covariance matrix

Σu=var(ui)=(AΣ0ATAΣ0BBTΣ0ATBTΣ0B).

Since ||(AT B)T|| ≤ (||A||2 + ||B||2)1/2 = O(1) and ||Σ0|| = O(1) uniformly, we have ||Σu|| = O(1) uniformly, Then, with Su=n1i=1nuiuiT, which is the sample covariance matrix for the random sample {ui}1≤in, by Lemma A.3 of Bickel and Levina (2008b) which holds under the assumption for the yij’s and log pn/n = o(1), we have

maxi,j(SuΣu)ij=OP({logpnn}12).

In particular, it means that

maxi,j(A(SΣ0)B)ij=(n1r=1nxrwrTAΣ0B)ij=OP({logpnn}12),

which completes the proof of the lemma. □

Lemma 3 Let S be a sample covariance matrix of a random sample yi1≤in with yi ~ N(0, Σ0). Assume pn/ny [set membership] [0, 1), Σ0 has eigenvalues uniformly bounded as n → ∞, and A = A0 + Δ1, B = B0 + Δ2 are such that the constant matrices ||A0||, ||B0|| = O(1), with ||Δ1||, ||Δ2|| = oP(1). Then we still have maxi,j|(A(SΣ0)B)ij| = OP({log pn/n}1/2).

Proof of Lemma 3. Consider

A(SΣ0)B=K1+K2+K3+K4,
(5.3)

where K1 = A0(SΣ0)B0, K2 = Δ1(SΣ0)B0, K3 = A0(SΣ02 and K4 = Δ1(SΣ02. Now maxi,j |(K1)ij | = OP({log pn/n}1/2) by Lemma 2. Consider K2. Suppose the maximum element of the matrix is at the (i, j)-th position. Consider ((SΣ0)B0)ij, the (i, j)-th element of (SΣ0)B0. Since each element in SΣ0 has a rate OP(n−1/2), the i-th row of SΣ0 has a norm of OP({pn/n}1/2). Also, the j-th column of B0 has ||B0ej||||B0|| = O(1). Hence, ((SΣ0)B0)ij = OP({pn/n}1/2).

Hence, we can find cn = o({n/pn}1/2) such that each element in cnB0T(SΣ0) has an order larger than that in Δ1, since ||Δ1|| = oP(1) implies that each element in Δ1 is also oP(1) by Lemma 1.

Then suitable choice of cn leads to

maxi,j(Δ1(SΣ0)B0)ijcnmaxk(B0T(SΣ0)2B0)kk.
(5.4)

At the same time, Theorem 5.10 in Bai and Silverstein (2006) implies that, for yi ~ N(0, Σ0) and pn/ny [set membership] (0, 1), with probability one,

2yyliminfnλmin(Σ012SΣ012I)limsupnλmax(Σ012SΣ012I)2y+y.

Hence, if we have pn/n = o(1), we must have Σ012SΣ012I=oP(1), or it will contradict the above. It means that ||SΣ0|| = oP(1) since Σ0 has eigenvalues uniformly bounded. Or, if pn/ny [set membership] (0, 1), then we have ||SΣ0|| = OP(1) by the above.

Since SΣ0 is symmetric, we can find a rotation matrix Q (i.e. QTQ = QQT = I) so that

SΣ0=QΛQT,

where Λ is a diagonal matrix with real entries. Then we are free to control cn again so as to satisfy further that cn||Λ||2 = oP(||Λ||), since ||Λ|| = ||SΣ0|| = OP(1) at most. Hence,

cnmaxk(B0T(SΣ0)2B0)kk=maxk(B0TQcnΛ2QTB0)kkmaxk(B0TQΛQTB0)kk=maxk(B0T(SΣ0)B0)kk=OP({logpnn}12),

where the last line used the previous proof for constant matrix B0. Hence, combining this with (5.4), we have maxi,j |(K2)ij| = OP({log pn/n}1/2). Similar arguments go for K3 and K4. □

Proof of Theorem 1. The main idea of the proof is inspired by Fan and Li (2001) and Rothman et al. (2008). Let U be a symmetric matrix of size pn, DU be its diagonal matrix and RU = UDU be its off-diagonal matrix. Set ΔU = αnRU + βnDU. We would like to show that, for αn = (sn1 log pn/n)1/2 and βn = (pn log pn/n)1/2, and for a set A defined as A={U:ΔUF2=C12αn2+C22βn2},

P(infUAq1(Ω0+ΔU)>q1(Ω0))1,

for sufficiently large constants C1 and C2. This implies that there is a local minimizer in {Ω0+ΔU:ΔUF2C12αn2+C22βn2} such that Ω^Ω0F=OP(αn+βn) for sufficiently large n, since Ω0 + ΔU is positive definite. This is shown by noting that

λmin(Ω0+ΔU)λmin(Ω0)+λmin(ΔU)λmin(Ω0)ΔUF>0,

since Ω0 has eigenvalues uniformly bounded away from 0 and ∞ by condition (A), and ||ΔU||F = O(αn + βn) = o(1).

Consider, for Σ = Σ0 + ΔU, the difference

q1(Ω)q1(Ω0)=I1+I2+I3,

where

I1=tr(SΩ)logΩ(tr(SΩ0)logΩ0),

I2=(i,j)S1c(pλn1(ωij)pλn1(ωij0)),

I3=(i,j)S1,ij(pλn1(ωij)pλn1(ωij0)).

It is sufficient to show that the difference is positive asymptotically with probability tending to 1. Using Taylor’s expansion with the integral remainder, we have I1 = K1+K2, where

K1=tr((SΣ0)ΔU),K2=vec(ΔU)T{01g(v,Ωv)(1v)dv}vec(ΔU),
(5.5)

with the definitions Ωv = Ω0 + vΔU, and g(v,Ωv)=Ωv1Ωv1. Now,

K201(1v)min0v1λmin(Ωv1Ωv1)dvvec(ΔU)2=vec(ΔU)22min0v1λmax2(Ωv)vec(ΔU)22(Ω0+ΔU)2(C12αn2+C22βn2)2(τ11+o(1))2,

where we used ||ΔU||C1αn + C2βn = O((log pn)(1−k)/2) = o(1) by our assumption.

Consider K1. It is clear that |K1| ≤ L1 + L2, where

L1=(i,j)S1(SΣ0)ij(ΔU)ij,

L2=(i,j)S1c(SΣ0)ij(ΔU)ij.

Using Lemmas 1 and 2, we have

L1(sn1+pn)12maxi,j(SΣ0)ijΔUFOP(αn+βn)ΔUF=OP(C1αn2+C2βn2),

This is dominated by K2 when C1 and C2 are sufficiently large.

Now, consider I2L2 for penalties other than L1. Since ΔUF2=C12αn2+C22βn2 on A, we have that |ωij| = O(C1αn + C2βn) = o(1) for all (i,j)S1c. Also, note that the condition on λn1 ensures that, for (i,j)S1c, |ωij| = O(αn + βn) = o(λn1). Hence, by condition (C), for all (i,j)S1c, we can find a constant k1 > 0 such that

pλn1(ωij)λn1k1ωij.

This implies that

I2=(i,j)S1cpλn1(ωij)λn1k1(i,j)S1cωij.

Hence,

I2L2(i,j)S1c{λn1k1ωij(SΣ0)ijωij}(i,j)S1c[λn1k1OP({logpnn}12)]ωij=λn1(i,j)S1c[k1OP(λn11{logpnn}12)]ωij.

With the assumption that (pn+sn1)logpnn=O(λn12), we see from the above that I2L2 ≥ 0 since OP=(λn11{logpnn}12)=oP(1), using logpnn=o((pn+sn1)logpnn)=o(λn12).

For the L1-penalty, since we have maxij |SΣ0| = OP((log pn/n)1/2) by Lemma 2, we can find a positive W = OP(1) such that

maxijSΣ0=W(logpnn)12.

Then we can set λn1 = 2W(log pn/n)1/2 or one with order greater than (log pn/n)1/2, and the above arguments are still valid, so that I2L2 > 0.

Now, with L1 dominated by K2 and I2L2 ≥ 0, the proof completes if we can show that I3 is also dominated by K2, since we have proved that K2 > 0. Using Taylor’s expansion, we can arrive at

I3min(C1,C2)1O(1)(C12αn2+C22βn2)+o(1)(C12αn2+C22βn2),

where o(1) and O(1) are the terms independent of C1 and C2. By condition (B), we have

I3=CO(αn2+βn2)+C2o(αn2+βn2),

which is dominated by K2 with large enough constants C1 and C2. This completes the proof of the theorem. □

Proof of Theorem 2. For Ω a minimizer of (1.1), the derivative for q1(Ω) w.r.t. ωij for (i,j)S2c is

q1(Ω)ωij=2(sijσij+pλn1(ωij)sgn(ωij)),

where sgn(a) denotes the sign of a. If we can show that the sign of [partial differential]q1(Ω)/[partial differential]ωij depends on sgn(ωij) only with probability tending to 1, the optimum will be at 0, so that ω^ij=0 for all (i,j)S2c with probability tending to 1. We need to estimate the order of sijσij independent of i and j.

Decompose sijσij = I1 + I2, where

I1=sijσij0,I2=σij0σij.

By Lemma 2 or Lemma A.3 of Bickel and Levina (2008b), it follows that maxi,j |I1| = OP({log pn/n}1/2). It remains to estimate the order of I2.

By Lemma 1, σijσij0ΣΣ0, which has order

ΣΣ0=Σ(ΩΩ0)Σ0ΣΩΩ0Σ0=O(ΩΩ0),

where we used condition (A) to get ||Σ0|| = O(1), and using ηn → 0 so that λmin(ΩΩ0)=o(1)forΩΩ0=O(ηn12),

Σ=λmin1(Ω)(λmin(Ω0)+λmin(ΩΩ0))1=(O(1)+o(1))1=O(1).

Hence, ΩΩ0=O(ηn12) implies I2=O(ηn12).

Combining the last two results yields that

maxi,jsijσij=OP(sijσij0+ηn12)=OP({logpnn}12+ηn12).

By conditions (C) and (D), we have

pλn1(ωij)=C3λn1

for ωij in a small neighborhood of 0 (excluding 0 itself) and some positive constant C3. Hence, if ωij lies in a small neighborhood of 0, we need to have logpnn+nn=O(λn12) in order to have the sign of [partial differential]q1(Ω)/[partial differential]ωij depends on sgn(ωij) only with probability tending to 1. The proof of the theorem is completed. □

Proof of Theorem 3. Because of the similarity between equations (2.4) and (1.1), the Frobenius norm result has nearly identical proof as Theorem 1, except that we now set ΔU = αnU. For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman et al. (2008). □

Proof of Theorem 5. The proof is similar to that of Theorem 1. We only sketch briefly the proof, pointing out the important differences.

Let αn = (sn2 log pn/n)1/2 and βn = (pn log pn/n)1/2, and define A={U:ΔUF2=C12αn2+C22βn2}. Want to show

P(infUAq2(Σ0+ΔU)>q2(Σ0))1,

for sufficiently large constants C1and C2.

For Σ = Σ0 + ΔU, the difference

q2(Σ)q2(Σ0)=I1+I2+I3,

where

I1=tr(SΩ)+logΣ(tr(SΩ0)+logΣ0),

I2=(i,j)S2c(pλn2(σij)pλn2(σij0)),

I3=(i,j)S2,ij(pλn2(σij)pλn2(σij0)),

with I1 = K1 + K2, where

K1=tr((SΣ0)Ω0ΔUΩ0)=tr((SΩ0Ω0)ΔU),K2=vec(ΔU)T{01g(v,Σv)(1v)dv,}vec(ΔU),
(5.6)

and Σv = Σ0 + vΔU, SΩ0 is the sample covariance matrix of a random sample {xi}1≤in having xi ~ N(0, Ω0). Also,

g(v,Σv)=Σv1Σv1SΣv1+Σv1SΣv1Σv1Σv1Σv1.
(5.7)

The treatment of K2is different from that in Theorem 1. By condition (A), and (pn + sn2)(log pn)k/n = O(1) for some k > 1, we have

vΔUΩ0ΔUΩ0τ11(C1αn+C2βn)=O((logpn)1k)=o(1).

Thus, we can use the Neumann series expansion to arrive at

Σv1=Ω0(I+vΔUΩ0)1=Ω0(IvΔUΩ0+o(1)),

where the little o (or oP, O or OP in any matrix expansions in the remainder of this proof) represents a function of the L2 norm of the residual matrix in the expansion. That is, ν1=Ω0+Op(αn+βn), and ν1=τ11+Op(αn+βn). With SI difined as the sample covariance matrix formed from a random sample {xi}1≤in having xi ~ N(0, I),

SΣ0=OP(SII)=oP(1)

(see arguments in Lemma 3). These entail

SΣv1=(SΣ0)Σv1+Σ0Σv1=oP(1)+I+OP(αn+βn)=I+oP(1).

Combining these results, we have

g(v,Σv)=Ω0Ω0+OP(αn+βn).

Consequently,

K2=vec(ΔU)T{01Ω0Ω0(1+oP(1))(1v)dv}vec(ΔU)λmin(Ω0Ω0)vec(ΔU)22(1+oP(1))=τ12(C12αn2+C22βn2)2(1+oP(1)).

All other terms are dealt with similarly as in the proof of Theorem 1, and hence we omit them. □

Proof of Theorem 6. The proof is similar to that of Theorem 2. We only show the main differences.

It is easy to show

q2(Σ)σij=2((ΩSΩ)ij+ωij+pλn(σij)sgn(σij)).

Our aim is to estimate the order of |(−ΩSΩ + Ω)ij|, finding an upper bound which is independent of both i and j.

Write

ΩSΩ+Ω=I1+I2,

where I1 = −Ω(SΣ0)Ω and I2 = Ω(ΣΣ0)Ω. Since

Ω=λmin1(Σ)(λmin(Σ0)+λmin(ΣΣ0))1=τ11+o(1),

we have

Ω=Ω0+(ΩΩ0)=Ω0Ω(ΣΣ0)Ω0=Ω0+Δ,

where ΔΩΣΣ0Ω0=O(ηn12)=o(1) by Lemma 1, with ||ΣΣ0||2 = O(ηn). Hence, we can apply Lemma 3 and conclude that maxi,j |(I1)ij| = OP({log pn/n}1/2).

For I2, we have

maxi,j(I2)ijΩΣΣ0Ω=O(ΣΣ0)=O(ηn12).

Hence, we have

maxi,j(ΩSΩ+Ω)ij=O({logpnn}12+ηn12).

The rest goes similar to the proof of Theorem 2, and is omitted. □

Proof of Theorem 7. The proof is nearly identical to that of Theorem 5, except that we now set ΔU = αnU. The fact that (Γ^S)ii=1=γii0 has no estimation error eliminates an order (pn log pn/n)1/2 that contributes from estimating tr((Γ^SΓ0)Ψ0ΔUΨ0) for (3.2). This is why we can estimate a sparse correlation matrix more accurately.

For the operator norm result, we refer readers to the proof of Theorem 2 of Rothman et al. (2008). □

Proof of Theorem 10. For (T,D) a minimizer of (4.2), the derivative for q3(T,D) w.r.t. tij for (i,j)S3c is

q3(T,D)tij=2((STTD1)ji+pλn3(tij)sgn(tij)).

Now STTD−1 = I1 + I2 + I3 + I4, where

I1=(SΣ0)TTD1I2=Σ0(TT0)TD1,

I3=Σ0T0T(D1D01),I4=Σ0T0TD01.

By the MCD (4.1), I4=T01. Since i > j for (i,j)S3c, we must have (T01)ji=0. Hence, we can ignore I4.

Since ||TT0||2 = O(ηn) and ||DD0||2 = O(ζn) with ηn, ζn = o(1), and by condition (A) we can easily show D1D01=O(DD0)=O(ζn12). Then we can apply Lemma 3 to show that maxij |(I1)ij| = (log pn/n)1/2.

For I2, we have maxij(I2)ijΣ0TT0D1=O(ηn12). And finally. maxij(I3)ijΣ0T0D1D01=O(ζn12).

With all these, we have max(ij)S3c(STTD1)ij2=logpnn+ηn+ζn. The rest of the proof goes like that of Theorem 2 or 6. □

Footnotes

*Financial support from the NSF grant DMS-0354223, DMS-0704337 and NIH grant R01-GM072611 is gratefully acknowledged.

AMS 2000 subject classifications. Primary 62F12; secondary 62J07.

Contributor Information

Clifford Lam, Department of Statistics, London School of Economics and Political Science, London, WC2A 2AE (ku.ca.esl@2maL.C)

Jianqing Fan, Department of Operation Research and Financial Engineering, Princeton University, Princeton, NJ 08544 (ude.notecnirp@nafqj)

References

[1] Bai Z, Silverstein JW. Spectral Analysis of Large Dimensional Random Matrices. Science Press; Beijing: 2006.
[2] Bickel PJ, Levina E. Covariance Regularization by Thresholding. Ann. Statist. 2008a;36(6):2577–2604.
[3] Bickel PJ, Levina E. Regularized Estimation of Large Covariance Matrices. Ann. Statist. 2008b;36(1):199–227.
[4] Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for co-variance matrix estimaiton. the Wharton School, University of Pennsylvania; 2009. Technical report.
[5] d’Aspremont A, Banerjee O, El Ghaoui L. First-order Methods For Sparse Covariance Selection. SIAM. J. Matrix Anal. and Appl. 2008;30(1):56–66.
[6] Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175.
[7] Diggle P, Verbyla A. Nonparametric Estimation of Covariance Structure in Longitudinal Data. Biometrics. 1998;54(2):401–415. [PubMed]
[8] El Karoui N. Operator Norm Consistent Estimation of a Large Dimensional Sparse Covariance Matrices. Ann. Statist. 2008;36(6):2717–2756.
[9] Fan J, Feng Y, Wu Y. Network Exploration via the Adaptive LASSO and SCAD Penalties. Annals of Applied Statistics. 2009;3(2):521–541. [PMC free article] [PubMed]
[10] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Amer. Statist. Assoc. 2001;96:1348–1360.
[11] Fan J, Peng H. Nonconcave Penalized Likelihood With a Diverging Number of Parameters. Ann. Statist. 2004;32:928–961.
[12] Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical LASSO. Biostatistics. 2008;9(3):432–441. [PMC free article] [PubMed]
[13] Huang J, Horowitz J, Ma S. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann. Statist. 2008;36:587–613.
[14] Huang J, Liu N, Pourahmadi M, Liu L. Covariance Matrix Selection and Estimation via Penalised Normal Likelihood. Biometrika. 2006;93(1):85–98.
[15] Levina E, Rothman AJ, Zhu J. Sparse Estimation of Large Covariance Matrices via a Nested Lasso Penalty. Ann. Applied Statist. 2008;2(1):245–263.
[16] Meier L, van de Geer S, Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society, B. 2008;70:53–71.
[17] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1436–1462.
[18] Pourahmadi M. Joint Mean-Covariance Models with Applications to Longitudinal Data: Unconstrained Parameterisation. Biometrika. 1999;86:677–690.
[19] Ravikumar P, Lafferty J, Liu H, Wasserman L. Advances in Neural Information Processing Systems. MIT Press; 2008. Sparse additive models; p. 20.
[20] Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electron. J. Statist. 2008;2:494–515.
[21] Smith M, Kohn R. Parsimonious Covariance Matrix Estimation for Longitudinal Data. J. Amer. Statist. Assoc. 2002;97(460):1141–1153.
[22] Wagaman AS, Levina E. Discovering sparse covariance structures with the Isomap. Journal of Computational and Graphical Statistics. 2008;18 to appear.
[23] Wong F, Carter C, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830.
[24] Wu WB, Pourahmadi M. Nonparametric Estimation of Large Covariance Matrices of Longitudinal Data. Biometrika. 2003;94:1–17.
[25] Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;90:831–844.
[26] Zhang CH. Penalized Linear Unbiased Selection. the statistics dept., Rutgers University; 2007. Technical report 2007-003.
[27] Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563.
[28] Zou H. The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101:1418–1429.
[29] Zou H, Li R. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models (With Discussion) Ann. Statist. 2008;36(4):1509–1533. [PMC free article] [PubMed]