Home  About  Journals  Submit  Contact Us  Français 
We introduce a new sparse estimator of the covariance matrix for highdimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator which tapers the sample covariance matrix by a Toeplitz, sparselybanded, dataadaptive matrix. As a result of this adaptivity, the convex banding estimator enjoys theoretical optimality properties not attained by previous banding or tapered estimators. In particular, our convex banding estimator is minimax rate adaptive in Frobenius and operator norms, up to log factors, over commonlystudied classes of covariance matrices, and over more general classes. Furthermore, it correctly recovers the bandwidth when the true covariance is exactly banded. Our convex formulation admits a simple and efficient algorithm. Empirical studies demonstrate its practical effectiveness and illustrate that our exactlybanded estimator works well even when the true covariance matrix is only close to a banded matrix, confirming our theoretical results. Our method compares favorably with all existing methods, in terms of accuracy and speed. We illustrate the practical merits of the convex banding estimator by showing that it can be used to improve the performance of discriminant analysis for classifying sound recordings.
The covariance matrix is one of the most fundamental objects in statistics, and yet its reliable estimation is greatly challenging in high dimensions. The sample covariance matrix is known to degrade rapidly as an estimator as the number of variables increases, as noted more than four decades ago (see, e.g., Dempster 1972), unless additional assumptions are placed on the underlying covariance structure (see Bunea & Xiao 2014 for an overview). Indeed, the key to tractable estimation of a highdimensional covariance matrix lies in exploiting any knowledge of the covariance structure. In this vein, this paper develops an estimator geared toward a covariance structure that naturally arises in many applications in which the variables are ordered. Variables collected over time or representing spectra constitute two subclasses of examples.
We observe an independent sample, X_{1}, …, X_{n} ^{p} of mean zero random vectors with true population covariance matrix $\mathbb{E}[{\mathbf{X}}_{i}{\mathbf{X}}_{i}^{T}]={\sum}^{\ast}$. When the variables have a known ordering, it is often natural to assume that the dependence between variables is a function of the distance between variables’ indices. Common examples are in stationary time series modeling, where ${\sum}_{j,(j+h)}^{\ast}$ depends only on h, the lag of the series, with −K ≤ h ≤ K, for some K > 0. One can weaken this assumption, allowing $\{{\sum}_{j,(j+h)}^{\ast}:K\le h\le K\}$ to depend on j, while retaining the assumption that the bandwidth K does not depend on j. Another typical assumption is that ${\sum}_{jk}^{\ast}=\text{Cov}({X}_{j},{X}_{k})$ decreases with j−k. A simple example is the movingaverage process, where it is assumed that ${\sum}_{jk}^{\ast}=0\mathrm{if}jkK$ for some bandwidth parameter K, and a covariance matrix Σ^{*} with this structure is called banded. Likewise, in a firstorder autoregressive model, the elements decay exponentially with distance from the main diagonal, ${\sum}_{jk}^{\ast}\propto {\beta}^{jk}$, where β < 1, justifying the term “approximately banded” used to describe such a matrix.
More generally, banded or approximately banded covariance matrices can be used to model any generic vector (X_{1}, …, X_{p})^{T} whose entries are ordered such that any entries that are more than K apart are uncorrelated (or at most very weakly correlated). This situation does not specify any particular decay or ordering of correlation strengths. It is noteworthy that K itself is unknown, and may depend on n or p or both.
A number of estimators have been proposed for this setting that outperform the sample covariance matrix, $\mathbf{S}={n}^{1}{\sum}_{i=1}^{n}({\mathbf{X}}_{i}\overline{\mathbf{X}}){({\mathbf{X}}_{i}\overline{\mathbf{X}})}^{T}$, where $\overline{\mathbf{X}}={n}^{1}{\sum}_{i=1}^{n}{\mathbf{X}}_{i}$. In this paper we will focus on the squared Frobenius ${\Vert \Vert}_{F}^{2}$ and operator $\Vert {\Vert}_{\mathit{op}}$ norms of the difference between estimators and the population matrix, as measures of performance. Assuming that the random vectors are sampled from a multivariate normal distribution and that both the largest and smallest eigenvalues of Σ^{*} are bounded from above and below by some constants, it has been shown that $\frac{1}{p}{\Vert \mathbf{S}{\sum}^{\ast}\Vert}_{F}^{2}={O}_{P}\left(\frac{p}{n}\right)$ and that ${\Vert \mathbf{S}{\sum}^{\ast}\Vert}_{op}={O}_{P}\left(p\sqrt{\frac{\mathrm{log}p}{n}}\right)$, neither of which can be close to zero for p > n because the bounds are tight; see, for example, Bunea & Xiao (2014). This can be rectified when one uses, instead, estimators that take into account the structure of Σ^{*}. For instance, Bickel & Levina (2008) introduced a class of banding and tapering estimators of the form
where T is a Toeplitz matrix and denotes Schur multiplication. When the matrix T is of the form T_{jk} = 1{j − k > K}, for a prespecified K, one obtains what is referred to as the banded estimator. More general forms of T are allowed (Bickel & Levina 2008). The Frobenius and operator norm optimality of such estimators has been studied relative to classes of approximately banded population matrices discussed in detail in Section 4.2 below. Members of these classes are matrices with entries decaying with distance from the main diagonal at rate depending on the the sample size n and a parameter α > 0. The minimax lower bounds for estimation over these classes have been established, for both norms, in Cai et al. (2010). The banding estimator achieves them in both Frobenius norm (Bickel & Levina 2008) and operator norm (Xiao & Bunea 2014), and the same is true for a more general tapering estimator proposed in Cai et al. (2010). The common element of these estimators is that, while being minimax rate optimal, they are not minimax adaptive, in that their construction utilizes knowledge of α, which is typically not known in practice. Moreover, there is no guarantee that the banded estimators are positive definite, while this can be guaranteed via appropriate tapering.
Motivated by the desire to propose a rateoptimal estimator that does not depend on α, Cai & Yuan (2012) propose an adaptive estimator that partitions S into blocks of varying sizes, and then zeros out some of these blocks. They show that this estimator is minimax adaptive in operator norm, over a certain class of population matrices. The blockstructure form of their estimator is an artifact of an interesting proof technique, which relates the operator norm of a matrix to the operator norm of a smaller matrix whose elements are the operator norms of blocks of the original matrix. The construction is tailored to obtaining optimality in operator norm, and the estimator leans heavily on the assumption of decaying covariance away from the diagonal. In particular, it automatically zeros out all elements far from the diagonal without regard to the data (so a covariance far from the diagonal could be arbitrarily large and still be set to zero). In this sense, the method is less dataadaptive than may be desirable and may suffer in situations in which the longrange dependence does not decay fast enough. In addition, as in the case of the banded estimators, this blockthresholded estimator cannot be guaranteed to be positive definite. If positive definiteness is desired, the authors note that one may project the estimator onto the positive semidefinite cone (however, this projection step would lose the sparsity of the original solution).
Other estimators with good practical performance, in both operator and Frobenius norm, have been proposed, notably the banded Cholesky and nested lasso methods of Rothman et al. (2010). These approaches involve regularizing the Cholesky factor via solving a series of least squares or weighted lasso problems. The resulting estimators are sparse and positive definite; however, to our knowledge a theoretical study of these estimators has not been conducted.
In this work we introduce a new covariance estimator with the goal of bridging several gaps in the existing literature. Our contributions are as follows:
An unusual (and favorable) aspect of our estimator is that it is simultaneously sparse and positive definite with high probability—a property not shared by any other method with comparable theoretical guarantees.
The precise definition of our convex banding estimator and a discussion of the algorithm are given in Sections 2 and 3 below. Our target Σ^{*} either has all elements beyond a certain bandwidth K being zero or it is close (in a sense defined precisely in Section 4) to a Kbanded matrix. We therefore aim at constructing an estimator that will zero out all covariances that are beyond a certain datadependent bandwidth. If we regard the elements we would like to set to zero as a group, it is natural to consider a penalized estimator, with a penalty that zeros out groups. The most basic penalty that sets groups of parameters to zero simultaneously, without any other restrictions on these parameters, is known as the Group Lasso (Yuan & Lin 2006), a generalization of the Lasso (Tibshirani 1996). Zhao et al. (2009) show that by taking a hierarchicallynested group structure, one can develop penalties that set one group of parameters to zero whenever another group is zero. Penalties that render various hierarchical sparsity patterns have been proposed and studied in, for instance, Jenatton et al. (2010), Radchenko & James (2010), Bach et al. (2012), Bien et al. (2013). The most common applications considered in these works are to regression models.
The convex banding estimator employs a new hierarchical group structure that is tailored to covariance matrices with a banded or semibanded structure; its optimal properties cannot be obtained from simple extensions of any of the existing related penalties. We discuss this in Sections 2.1 and 4. We also provide a connection between our convex banded estimator and tapering estimators. Section 3.1 shows that our estimator can also be written in the form (1.1), but where T is a datadriven, sparse tapering matrix, with entries given by a datadependent recursion formula, not a prespecified, nonrandom, tapering function. This representation has both practical and theoretical implications: it allows us to compute our estimator efficiently, relate our estimator to previous banded or tapered estimators, and establish that it is banded and positive definite with high probability. These issues are treated in Sections 3.1 and 4.3, respectively.
In Section 4.1, we prove that, when the population covariance matrix is itself a banded matrix, our estimator recovers the sparsity pattern, under minimal signal strength conditions, which are made possible by the fact that we employ a hierarchical penalty. In Section 4.2 we show that our convex banding estimator is minimax rate adaptive, in both Frobenius and operator norms, up to multiplicative logarithmic factors, over appropriately defined classes of population matrices. In Section 5, we perform a thorough simulation study that investigates the sharpness of our bounds and demonstrates that our estimator compares favorably to previous methods. We conclude with a real data example in which we show that using our estimate of the covariance matrix leads to improved classification accuracy in both quadratic and linear discriminant analysis.
To describe our estimator, we must begin by defining a set of groups that will induce the desired bandedsparsity pattern. We define
to be the two right triangles of indices formed by taking the ( + 1) indices that are farthest from the main diagonal of a p × p matrix. See the left panel of Figure 1 for a depiction of g_{3} when p = 5. For notational ease we will denote, as above, jk = (j, k), and [p]^{2} = {1, …, p} × {1 …, p}.
We will also find it useful to express these groups as a union of subdiagonals s_{m}:
For example, g_{1} = s_{1} = {(1, p), (p, 1)} and, at the opposite extreme, ${g}_{p1}^{c}={s}_{p}$ is the diagonal. While the indexing of g_{} and s_{m} may at first seem “backwards” (in the sense that we count them from the outsidein rather than from the diagonalout), our indexing is natural here because s_{m} = 2m and g_{} consists of two equilateral triangles with sidelengths of elements. The right panel of Figure 1 depicts the nested group structure: g_{1} g_{p−}_{1}, where this largest group contains all offdiagonal elements.
The following notation is used in the definition of our estimator below. Given a subset of matrix indices g [p]^{2} of a p × p matrix Σ, let Σ_{g} ^{g} be the vector with elements {Σ_{jk} : jk g}. For a given nonnegative sequence of weights w_{m} with 1 ≤ m ≤ ≤ p − 1, discussed in the following subsection, and for a given λ ≥ 0, let $\widehat{\sum}$ be defined as the solution to the following convex optimization problem:
where ‖·‖_{F} is the Frobenius norm and
Our penalty term is a weighted group lasso, using {g_{} : 1 ≤ ≤ p−1} as the group structure. For the second equality, we express the penalty as the elementwise product, denoted by , of Σ with a sequence of weight matrices, W^{()}, that are Toeplitz with ${\mathbf{W}}_{{s}_{m}}^{(\ell )}={w}_{\ell m}{1}_{\{m\le \ell \}}\cdot {\mathbf{1}}_{2m}$.
Problem 2.1 is strictly convex, so $\widehat{\sum}$ is the unique solution.
As the tuning parameter λ is increased, subdiagonals of $\widehat{\sum}$ become exactly zero. As long as w_{m} > 0 for all 1 ≤ m ≤ ≤ p − 1, the hierarchical group structure ensures that a subdiagonal will only be set to zero if all elements farther from the main diagonal are already zero. Thus, we get an estimated bandwidth, $\widehat{K}$, which satisfies ${\widehat{\sum}}_{{g}_{p\widehat{K}1}}=0$ and ${\widehat{\sum}}_{{s}_{p\widehat{K}}}\ne 0$ (see Theorem 2 and Corollary 1 for details).
We refer to $\widehat{\sum}$ as the convex banding of the matrix S. We show in Section 4.3 that $\widehat{\sum}$ is positive definite with high probability. Empirically, we find that positive definiteness holds except in very extreme cases. Moreover, in Section 4.3 we propose a different version of convex banding that guarantees positive definiteness:
Of course when ${\lambda}_{min}(\widehat{\sum})\ge \delta $, the two estimators coincide.
Since for each , the weight w_{m} penalizes s_{m}, 1 ≤ m ≤ , and the subdiagonals s_{m} increase in size with m, we want to give the largest subdiagonals the largest weight. We will therefore choose weights with w_{} = max_{1}_{≤m≤}w_{m} and will sometimes write w_{}:= w_{}. We consider three choices of weights that satisfy this property.
Letting ${w}_{\ell m}={1}_{\{\ell =m\}}\sqrt{2\ell}$, for 1 ≤ m ≤ , 1 ≤ ≤ p − 1 yields
This is a traditional group lasso penalty that acts on subdiagonals, with weights ${w}_{\ell}=\sqrt{{s}_{\ell}}$. This penalty is not hierarchical, as each subdiagonal may be set to zero independently of the others. In Section 4.1 we show how failing to use a hierarchical penalty requires more stringent minimum signal conditions in order to correctly recover the true bandwidth of Σ^{*}. However, if the interest is in accurate estimation in Frobenius or operator norm of population matrices that are close to banded, we show in Section 4.2.1 that this estimator is minimax rate optimal, up to logarithmic factors, despite the fact that in finite samples it may fail to have the correct sparsity pattern.
If ${w}_{\ell m}=\sqrt{2\ell}$, for all m and , the penalty term becomes
which employs the same weight, $\sqrt{2\ell}$, as above, but now for a triangle g_{}. Recalling that g_{} = (+1), we note that this does not follow the common principle guiding weight choices in the (nonoverlapping) group lasso literature of using $\sqrt{{g}_{\ell}}$. In Section 4.1, we show this choice of weights leads to consistent bandwidth selection under minimal conditions on the strength of the signal; however, we shall show that a more refined choice ensures rate optimal estimation of Σ^{*} with respect to either Frobenius or operator norm. The problem with this simple choice of weights stems from the fact that subdiagonals far from the main diagonal (small m) are excessively penalized (since they appear in p − m terms). To balance this overaggressive enforcement of hierarchy, one desires weights that decay with decreasing m within a fixed group g_{} and yet still exhibit the $\sqrt{2\ell}$ growth on w_{}.
Based on the considerations above, we take the following choice of weights:
We will show in Sections 4.2.1 and 4.2.2 that the corresponding estimator is minimax rate adaptive, in Frobenius and operator norm, up to logarithmic factors, over appropriately defined classes of population covariance matrices.
We reemphasize the difference between the weighting schemes considered above. The estimators corresponding to (2.5) and (2.6) will always impose the sparsity structure of a banded matrix, a fact that is apparent from Theorem 2 below. In contrast, the group lasso estimator (2.4), which is performed on each subdiagonal separately, may fail to recover this pattern.
The most common approach to solving the standard group lasso is blockwise coordinate descent (BCD). Tseng (2001) proves that when the nondifferentiable part of a convex objective is separable, BCD is guaranteed to solve the problem. While this separable structure is not present in (2.1), Jenatton et al. (2010) show that the dual of this problem does, meaning that BCD on the dual can be used.
A dual of (2.1) is given by
In particular, given a solution to the dual, (Â^{(1)}, …, Â^{(p−1)}), the solution to (2.1) is given by the primaldual relation, $\widehat{\sum}=\mathbf{S}\lambda {\sum}_{\ell =1}^{p1}{\mathbf{W}}^{(\ell )}\ast {\widehat{\mathbf{A}}}^{(\ell )}$.
See Section A.1 of the supplementary materials.
Algorithm 1 gives a BCD algorithm for solving (3.1), which by the primaldual relation in turn gives a solution to (2.1). The blocks correspond to each dual variable matrix. The update over each A^{()} involves projection of a matrix ${\widehat{\mathbf{R}}}^{(\ell )}$ (defined in Algorithm 1) onto an ellipsoid, which amounts to finding ${\widehat{\nu}}_{\ell}$, a root of the univariate function h_{}(ν) − λ^{2}, where
We explain in Section A.2 of the supplementary materials the details of ellipsoid projection and observe that we can get ${\widehat{\nu}}_{\ell}$ in closed form for all but the last $\widehat{K}$ values of (where $\widehat{K}$ is the bandwidth of $\widehat{\sum}$). A remarkable feature of our algorithm is that only a single pass of BCD is required to reach convergence. This property is proved in Jenatton et al. (2011) and is a direct consequence of the nested group structure. When we use the simple weights (2.5), in which w_{m} = w_{}, Algorithm 1 becomes extraordinarly simple and transparent:
In words, we start with the sample covariance matrix, and then work from the corners of the matrix inward toward the main diagonal. At each step, the next largest trianglepair is groupsoftthresholded. If a triangle is ever set to zero, it will remain zero for all future steps. We will show that this simple weighting scheme admits exact bandwidth and pattern recovery in Section 4.1.

Algorithm 1 BCD on dual of Problem (2.1). 

Inputs: S, λ, and weights matrices, W^{()}. Initialize Â^{()} = 0 for all . 
For = 1, …, p − 1: 
• Compute
${\widehat{\mathbf{R}}}^{(\ell )}\leftarrow \mathbf{S}\lambda {\sum}_{{\ell}^{\prime}=1}^{p1}{W}^{({\ell}^{\prime})}\ast {\widehat{\mathbf{A}}}^{({\ell}^{\prime})}$ • For m ≤ , set ${\widehat{\mathbf{A}}}_{{s}_{m}}^{(\ell )}\leftarrow \frac{{w}_{\ell m}}{\lambda ({w}_{\ell m}^{2}+max\{{\widehat{\nu}}_{\ell},0\})}{\widehat{\mathbf{R}}}_{{s}_{m}}^{(\ell )}$ where ${\widehat{\nu}}_{\ell}$ satisfies ${\lambda}^{2}={h}_{\ell}({\widehat{\nu}}_{\ell})$, as in (3.2). 
$\{{\widehat{\mathbf{A}}}^{(\ell )}\}$ is a solution to (3.1) and ${\widehat{\mathbf{R}}}^{(p)}=\mathbf{S}\lambda {\sum}_{\ell =1}^{p1}{\mathbf{W}}^{(\ell )}\ast {\widehat{\mathbf{A}}}^{(\ell )}$ is the solution to (2.1). 

The next result shows that $\widehat{\sum}$ can be regarded as a tapering estimator with a datadependent tapering matrix. In Section 4, we will see that, in contrast to estimators that use a fixed tapering matrix, our estimator adapts to the unknown bandwidth of the true matrix Σ^{*}.
The convex banding estimator, $\widehat{\sum}$, can be written as a tapering estimator with a Toeplitz, datadependent tapering matrix, $\widehat{\sum}=\widehat{\mathbf{T}}\ast \mathbf{S}$:
where ${\widehat{\nu}}_{\ell}$ satisfies ${\lambda}^{2}={\sum}_{m=1}^{\ell}\frac{{w}_{\ell m}^{2}}{{({w}_{\ell m}^{2}+{\widehat{\nu}}_{\ell})}^{2}}{\Vert {\widehat{\mathbf{R}}}_{{s}_{m}}^{(\ell )}\Vert}^{2}$ and 1_{m} ^{m} is the vector of ones.
See Section A.3 of supplementary materials.□
The following result shows that, as desired, our estimator is banded.
$\widehat{\sum}$ is a banded matrix with bandwidth $\widehat{K}=p1max\{\ell :{\widehat{\nu}}_{\ell}\le 0\}$.
By definition, ${\widehat{\nu}}_{p1\widehat{K}}\le 0$ and ${\widehat{\nu}}_{p\widehat{K}}$, …, ${\widehat{\nu}}_{p1}>0$. It follows from the theorem that ${\widehat{\mathbf{T}}}_{{s}_{m}}={0.1}_{m}$ for all $m\le p1\widehat{K}$ and ${[{\widehat{\mathbf{T}}}_{{s}_{m}}]}_{i}>0$ for $p\widehat{K}\le m\le p1$.□
In this section, we study the statistical properties of the convex banding estimator. We begin by stating two assumptions that specify the general setting of our theoretical analysis.
Let X = (X_{1}, …, X_{p})^{T}. Assume $\mathbb{E}\mathbf{X}=\mathbf{0}$ and denote $\mathbb{E}{\mathbf{XX}}^{T}={\sum}^{\ast}$. We assume that each X_{j} is marginally subGaussian: $\mathbb{E}\mathrm{exp}(t{X}_{j}/\sqrt{{\sum}_{jj}^{\ast}})\le \mathrm{exp}(C{t}^{2})$ for all t ≥ 0 and for some constant C > 0 that is independent of j. Moreover, ${max}_{j}{\sum}_{jj}^{\ast}\le M,$ for some constant M > 0.
The dimension p can grow with n at most exponentially in n: γ_{0} log n ≤ log p ≤ γn, for some constants γ_{0} > 0, γ > 0.
The requirement that p grow at least polynomially fast as a function of n is not necessary but it simplifies the results allowing us to write log p instead of log[max(p, n)]. In Section 4.1, we prove that our estimator recovers the true bandwidth of Σ^{*} with high probability assuming the nonzero values of Σ^{*} are large enough. We will demonstrate that our estimator can detect lower signals than what could be recovered by an estimator that does not enforce hierarchy, reemphasizing the need for a hierarchical penalty. In Section 4.2, we show that our estimator is minimax adaptive, up to multiplicative logarithmic terms, with respect to both the operator and Frobenius norms. Our results hold either with high probability or in expectation, and are established over classes of population matrices defined in Sections 4.2.1 and 4.2.2, respectively.
We begin by introducing a random set, on which all our results hold. Fixing x > 0, let
The following lemma shows that this set has high probability.
Under Assumptions 1 and 2, there exists a constant c > 0 such that for sufficiently large x > 0, $\mathbb{P}({\mathcal{A}}_{x})\ge 1c/p$.
See Section A.4 of the supplementary materials, proof of (A.4.1) of Theorem A.1.□
Suppose Σ^{*} has bandwidth K, that is, for L = p−K−1, we have ${\sum}_{{g}_{L}}^{\ast}=0$ and ${\sum}_{{s}_{L+1}}^{\ast}\ne 0$ (see right panel of Figure 1). We prove in this section that under mild conditions our estimator $\widehat{\sum}$ correctly recovers K with high probability. The next theorem expresses the intuitive result that if λ is chosen sufficiently large, we will not overestimate the true bandwidth.
If $\lambda \ge x\sqrt{\mathrm{log}p/n}$ and ${w}_{\ell}=\sqrt{2\ell}$, then $\widehat{K}\le K$, on the set ${\mathcal{A}}_{x}$.
See Section A.5 of the supplementary materials.□
For our estimator to be able to detect the full band, we must require that the “signal” be sufficiently large relative to the noise. In the next theorem we measure the size of the signal by the _{2} norm of each subdiagonal (scaled in proportion to the square root of its size, ${w}_{\ell}=\sqrt{2\ell}$) and the size of the noise by λ.
If ${min}_{\ell \ge L+1}{\Vert {\sum}_{{s}_{\ell}}^{\ast}\Vert}_{2}/{w}_{\ell}>2\lambda $, where $\lambda \ge x\sqrt{\mathrm{log}p/n}$ and ${w}_{\ell}=\sqrt{2\ell}$, then $K\le \widehat{K}$, on the set ${\mathcal{A}}_{x}$.
See Section A.5 of the supplementary materials.□
In typical highdimensional statistical problems, the support set can generically be any subset and therefore one must require that each element in the support set be sufficiently large on its own to be detected. By contrast, in our current context we know that the support set is of the specific form {L + 1, …, p − 1}, for some unknown L. Thus, as long as the signal is sufficiently large at L + 1, one might expect that the signal could be weaker at subsequent elements of the support. In the next theorem we demonstrate this phenomenon by showing that when ${\sum}_{{s}_{L+1}}^{\ast}$ exceeds the threshold given in the previous theorem, it may “share the wealth” of this excess, relaxing the requirement on the size of ${\sum}_{{s}_{L+2}}^{\ast}$.
Suppose (for some γ > 0)
where $\lambda \ge x\sqrt{\mathrm{log}p/n}$ and ${w}_{\ell}=\sqrt{2\ell}$ and w_{,} ≥ w_{+1,} > 0, then $K\le \widehat{K}$ on the set ${\mathcal{A}}_{x}$.
See Section A.5 of the supplementary materials.□
When γ = 0, Theorem 5 reduces to Theorem 4. However, as γ increases, the required size of ${\Vert {\sum}_{{s}_{L+2}}^{\ast}\Vert}_{2}$ decreases without preventing the bandwidth from being misestimated. In fact, for γ ≥ 1, there is no requirement on ${\sum}_{{s}_{L+2}}^{\ast}$ for bandwidth recovery. This robustness to individual subdiagonals being small is a direct result of our method being hierarchical.
In this section we show that the estimator $\widehat{\sum}$, with weights given by either (2.4) or (2.6), is minimax adaptive, up to multiplicative logarithmic factors, with respect to the Frobenius norm, over a class of population covariance matrices that generalizes both the class of Kbanded matrices and previously studied classes of approximately banded matrices. We begin by stating Theorem 6, which is a general oracle inequality from which adaptivity to the optimal minimax rate will follow as immediate corollaries.
For any B ^{p×p}, let L(B) be an integer such that B_{gL}_{(}_{B}_{)} = 0 and ${\mathbf{B}}_{{s}_{L(\mathrm{B})+1}}\ne 0$, that is B has bandwidth K(B) = p − 1 − L(B). Let ${\mathcal{S}}_{p}$ be the class of all p × p positive definite covariance matrices. In Theorem A.2 of Section A.6 of the supplementary materials, we provide a deterministic upper bound on ${\Vert \widehat{\sum}{\sum}^{\ast}\Vert}_{F}^{2}$ that indicates what terms need to be controlled in order to obtain optimal stochastic performance of our estimator. The following theorem, which is the main result of this section, is a consequence of this theorem, and shows that for appropriate choices of the weights and tuning parameter, the proposed estimator achieves the best biasvariance tradeoff both in probability and in expectation, up to small additive terms, which are the price paid for adaptivity.
Suppose ${\sum}^{\ast}\in {\mathcal{S}}_{p}$ and ${max}_{j}{\sum}_{jj}^{\ast}\le M$ for some constant M. If the weights are given by either (2.4) or (2.6) and $\lambda =x\sqrt{\mathrm{log}p/n}$, then on the set ${\mathcal{A}}_{x}$ the convex banding estimator of (2.1) satisfies
Moreover, for x sufficiently large, $\mathbb{E}{\Vert \widehat{\sum}{\sum}^{\ast}\Vert}_{F}^{2}\lesssim {\mathrm{inf}}_{\mathrm{B}\in {R}^{p\times p}}\left\{{\Vert {\sum}^{\ast}\mathbf{B}\Vert}_{F}^{2}+\frac{K(\mathrm{B})p\mathrm{log}p}{n}\right\}+p/n$ where the symbol is used to denote an inequality that holds up to positive multiplicative constants that are independent of n and p.
See Section A.6 of the supplementary materials.□
For any given K ≥ 1, the class of Kbanded matrices is
Motivated by Theorem 6 above, we define the following general class of covariance matrices, henceforth referred to as Ksemibanded:
for any K ≥ 1. Trivially, $\mathcal{B}(K)\subset \mathcal{G}(K)$, for any K ≥ 1. Notice that a semibanded covariance matrix does not have any explicit restrictions on the order of magnitude of ${\Vert \sum \Vert}_{op};$ nor does it require exact zero entries.
Theorem 7 below gives the minimax lower bound, in probability, for estimating a covariance matrix ${\sum}^{\ast}\in \mathcal{B}(K)$, in Frobenius norm, from a sample X_{1}, …, X_{n}. Let _{K} be the class of probability distributions of pdimensional vectors satisfying Assumption 1 and with covariance matrix in $\mathcal{B}(K)$.
If $K\sqrt{n}$, there exist absolute constants c > 0 and β (0, 1) such that
To the best of our knowledge, the minimax lower bound for Frobenius norm estimation over $\mathcal{B}(K)$ is new. The proof of the existing related result in Cai et al. (2010), established for approximately banded matrices, or that in Cai & Zhou (2012), for approximately sparse matrices, could be adapted to this situation, but the resulting proof would be involved, as it would require redoing all their calculations. We provide in Section A.7 of the supplementary materials a different and much shorter proof, tailored to our class of covariance matrices.
From Theorems 6 and 7 above we conclude that convex banding with either the (nonhierarchical) group lasso penalty, (2.4), or the general hierarchical penalty, (2.6), achieves, adaptively, the minimax rate, up to log terms, not only over $\mathcal{B}(K)$, but over the larger class $\mathcal{G}(K)$ of semibanded matrices. We summarize this below.
Suppose ${\sum}^{\ast}\in \mathcal{G}(K)$. The convex banding estimator of (2.1), with weights given by either (2.4) or (2.6), and with $\lambda =x\sqrt{\mathrm{log}p/n}$ satisfies
on the set ${\mathcal{A}}_{x}$ and also $\mathbb{E}{\Vert \widehat{\sum}{\sum}^{\ast}\Vert}_{F}^{2}\lesssim pK\mathrm{log}p/n$.
To the best of our knowledge, this is the first estimator which, in particular, has been proven to be minimax adaptive over the class of exactly banded matrices, up to logarithmic factors.
It should be emphasized that, despite certain similarities, the problem we are studying is different from that of estimating the autocovariance matrix of a stationary time series based on the observation of a single realization of the series. For a discussion of this distinction and a treatment of this latter problem, see Xiao et al. (2012).
An immediate consequence of Theorem A.2 of the supplementary materials is that for the basic hierarchical penalty, (2.5), we have, on the set ${\mathcal{A}}_{x}$, ${\Vert \widehat{\sum}{\sum}^{\ast}\Vert}_{F}^{2}\lesssim p{K}^{2}\mathrm{log}p/n$, for ${\sum}^{\ast}\in \mathcal{B}(K)$, which is a suboptimal rate. Therefore, if Frobenius norm rate optimality is of interest and one desires a banded estimator, then the simple weights (2.5) do not suffice, and one must resort to the more general ones given by (2.6).
The class of semibanded matrices is more general than the previously studied class of approximately banded matrices considered in Cai et al. (2010): ${\mathcal{C}}_{\alpha}=\left\{\sum \in {\mathcal{S}}_{p}:{\sum}_{ij}\le {M}_{1}{ij}^{(\alpha +1)}\text{forall}\phantom{\rule{0.2em}{0ex}}i\ne j\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}{\lambda}_{max}(\sum )\le {M}_{0}\right\}$ Bickel & Levina (2008) and Cai & Yuan (2012) consider a closely related class, ${\mathcal{D}}_{\alpha}=\left\{\sum \in {\mathcal{S}}_{p}:{max}_{j}{\sum}_{i:ij>k}{\sum}_{ij}\le {M}_{1}{ij}^{\alpha},\text{forall}k0,0{M}_{0}^{1}\le {\lambda}_{min}(\sum )\le {\lambda}_{max}(\sum )\le {M}_{0}\right\}$.
Interestingly, the class of approximately banded matrices ${\mathcal{C}}_{\alpha}$ does not include, in general, the class of exactly banded matrices $\mathcal{B}(K)$, as membership in these classes forces neighboring entries to decay rapidly, a restriction that is not imposed on elements of $\mathcal{B}(K)$. Moreover, the largest eigenvalue of a Kbanded matrix can be of order K (e.g., consider a block diagonal matrix with p/K blocks of size K, with fixed withinblock correlations), and hence cannot be bounded by a constant when K is allowed to grow with n and p. Therefore, ${\mathcal{C}}_{\alpha}$ is different from, but not necessarily more general than, $\mathcal{B}(K)$, for arbitrary values of K.
In contrast, the class $\mathcal{G}(K)$ introduced in (4.2) above is more general than $\mathcal{B}(K)$, for all K, and also contains ${\mathcal{C}}_{\alpha}$ for an appropriate value of K. Let ${K}_{\alpha}\lesssim {(\frac{n}{\mathrm{log}p})}^{\frac{1}{2\alpha +2}}$. Then, it is immediate to see that, for any α > 0, ${\mathcal{C}}_{\alpha}\subset \mathcal{G}({K}_{\alpha})$. To gain further insight into particular types of matrices in $\mathcal{G}({K}_{\alpha})$, notice that it contains, for instance, matrices in which neighboring elements do not have to start decaying immediately, but only when they are far enough from the main diagonal. Specifically, for some positive constants M_{1} < M_{0} and any α > 0, let ${\mathcal{G}}_{\alpha}=\left\{\sum \in {\mathcal{S}}_{p}:{\sum}_{ij}\le {M}_{0},\text{if}ij\le {K}_{\alpha}\mathrm{and}{\sum}_{ij}\le {M}_{1}{ij{K}_{\alpha}}^{(\alpha +1)},\text{if}ij{K}_{\alpha}\right\}$ Then we have, for any fixed α > 0 and appropriately adjusted constants, that ${\mathcal{C}}_{\alpha}\subset {\mathcal{G}}_{\alpha}\subset \mathcal{G}({K}_{\alpha})$ and $\mathcal{B}({K}_{\alpha})\subset {\mathcal{G}}_{\alpha}\subset \mathcal{G}(2{K}_{\alpha})$. The following corollary gives the rate of estimation over $\mathcal{G}({K}_{\alpha})$.
Suppose ${\sum}^{\ast}\in \mathcal{G}({K}_{\alpha})$. The convex banding estimator of (2.1), with weights given by either (2.4) or (2.6), and with $\lambda =x\sqrt{\mathrm{log}p/n\phantom{\rule{0.2em}{0ex}}}$ where x is as in Theorem 6 satisfies.
The upper bound in (ii), up to logarithmic factors, is the minimax lower bound derived in Cai et al. (2010) over ${\mathcal{C}}_{\alpha}$ and ${\mathcal{D}}_{\alpha}$. Therefore, Corollary 3 shows that the minimax rate, with respect to the Frobenius norm, can be achieved, adaptively over α > 0, not only over the classes of approximately banded matrices, but also over the larger class of semibanded matrices $\mathcal{G}({K}_{\alpha})$. In contrast, the two existing estimators with established minimax rates of convergence in Frobenius norm require knowledge of α prior to estimation. This is the case for the banding estimator of Bickel & Levina (2008) and the tapering estimator of Cai et al. (2010), over the class ${\mathcal{D}}_{\alpha}$. Adaptivity to the minimax rate, over ${\mathcal{D}}_{\alpha}$, with respect to the Frobenius norm, is alluded to in Section 5 of Cai & Yuan (2012), who proposed a blockthresholding estimator, but a formal statement and proof are left open.
In this section we study the operator norm optimality of our estimator over classes of matrices that are close, in operator norm, to banded matrices, with bandwidths that might increase with n or $p:{\mathcal{G}}_{op}(K)=\left\{\sum \in {\mathcal{S}}_{p}:{\Vert \sum \mathbf{B}\Vert}_{op}\lesssim K\sqrt{\mathrm{log}p/n}\mathrm{forsome}\mathbf{B}\in \mathcal{B}(K)\right\}$.
Minimax lower bounds in operator norm over the classes of approximately sparse matrices and approximately banded matrices are provided in Cai & Zhou (2012) and Cai et al. (2010). Moreover, an inspection of the proof of Theorem 3 in Cai et al. (2010) shows that the proof of the bound stated in Theorem 8 above follows immediately from it, by the same arguments, by letting their k be our K. We therefore state the bound for clarity and completeness, and omit the proof. The following theorem gives the minimax lower bound, in probability, for estimating a covariance matrix ${\sum}^{\ast}\in {\mathcal{G}}_{op}(K)$, in operator norm, from a sample X_{1},…, X_{n}. Let _{K} be the class of probability distributions of pdimensional vectors satisfying Assumption 1 and with covariance matrix in $\mathcal{B}(K)$.
If $K<\sqrt{n}$, there exist absolute constants c > 0 and β (0, 1) such that
Having established the best possible rate, we next show that our proposed estimator achieves adaptively, up to logarithmic factors, the target rate.
The convex banding estimator of (2.1) with weights given by either (2.4) or (2.6), and with $\lambda =2x\sqrt{\mathrm{log}p/n}$, where x is as in Theorem 6, satisfies
for any ${\sum}^{\ast}\in {\mathcal{G}}_{op}(K)$ that meets the signal strength condition ${min}_{\ell >L(\sum \ast )}{\Vert {\sum}_{{s}_{\ell}}^{\ast}\Vert}_{2}/\sqrt{2\ell}\ge c$ for some constant c > 0.
See Section A.8 of the supplementary materials.□
We show in Section A.9 of the supplementary materials that if the signal strength condition fails and p > n, then the estimator $\widehat{\sum}$ may not even be consistent in operator norm.
Our focus in this section was on the behavior of our estimator over a class that generalizes that of banded matrices, such as ${\mathcal{G}}_{\mathit{op}}$, since no minimax adaptive estimators in operator norm have been constructed, to the best of our knowledge, for this class. Our estimator is however slightly suboptimal, in operator norm, over different classes, for instance over the class of approximately banded matrices. It is immediate to show that it achieves, adaptively, the slightly suboptimal rate $O\{{(\mathrm{log}p/n)}^{\frac{\alpha}{2\alpha +2}}\}$ over the class ${\mathcal{D}}_{\alpha}$, when a signal strength condition similar to that given in Theorem 9 is satisfied. However, the minimax optimal rate over this class is $O\{{(\mathrm{log}p/n)}^{\frac{\alpha}{2\alpha +1}}\}$, and has been shown to be achieved, nonadaptively, by the tapering estimator of Cai et al. (2010) and in Xiao & Bunea (2014), by the banding estimator. For the latter case, a SUREtype bandwidth selection is proposed, as an alternative to the usual cross validationtype criteria, for the practical bandwidth choice. Moreover, the optimal rate of convergence in operator norm can be achieved, adaptively, by a blockthresholding estimator proposed by Cai & Zhou (2012). For these reasons, we do not further pursue the issue of operator norm minimax adaptive estimation over ${\mathcal{D}}_{\alpha}$ in this work.
In this section, we study the positive definiteness of our estimator. Empirically, we have observed $\widehat{\sum}$ to be positive definite except in quite extreme circumstances (e.g., very low n settings). We begin by proving that $\widehat{\sum}$ is positive definite with high probability assuming certain conditions.
Under Assumptions 1 and 2, the convex banding estimator with weights given by either (2.4) or (2.6) and with $\lambda =2x\sqrt{\mathrm{log}p/n}$ (for appropriately large x) has minimum eigenvalue at least ${C}^{\prime}K\sqrt{\mathrm{log}p/n}>0$ with probability larger than 1 − c/p, for some c > 0, provided that (i) ${\lambda}_{min}({\sum}^{\ast})\ge 2{C}^{\prime}K\sqrt{\mathrm{log}p/n}$ and (ii) ${min}_{\ell >L({\sum}^{\ast})}{\Vert {\sum}_{{s}_{\ell}}^{\ast}\Vert}_{2}/\sqrt{2\ell}\ge {c}^{\prime}$ for some constant c′ > 0.
See Section A.10 of the supplementary materials.□
Although we find empirically that $\widehat{\sum}$ is reliably positive definite, it is reasonable to desire an estimator that is always positive definite. In such a case, we suggest using the estimator $\sum ^{\sim}$ defined in (2.3). Algorithm 1 of Section A.10 of the supplementary materials provides a BCD approach similar to Algorithm 1.
Under the assumptions of Theorem 10, $\sum ^{\sim}=\widehat{\sum}$ with high probability as long as we choose $\delta \le {C}^{\prime}K\sqrt{\frac{\mathrm{log}p}{n}}.$
We conclude this section by noting that, in light of the corollary, $\sum ^{\sim}$ has all the same properties as $\widehat{\sum}$ under the additional assumptions of Theorem 10.
In this section, we study the empirical performance of our estimator. In Section 5.1, we investigate how our estimator’s performance depends on K, n, and p, and we compare our method to several existing methods for banded covariance estimation. Section 5.2 considers using the convex banding estimator and several other banded estimators within the context of a classification problem.
We fix n = 100, and for p {500, 1000, 2000} and ten equally spaced values of K between 10 and 500, we simulate data according to an instance of a movingaverage(K) process. In particular, we draw n multivariate normal vectors with covariance matrix ${\sum}_{ij}^{\ast}=(1ij/K)\xb71\{ij\le K\}$. We draw 20 samples of size n to compute the average scaled squared Frobenius norm, $\Vert \widehat{\sum}{\sum}^{\ast}{\Vert}_{F}^{2}/p$, and operator norm, $\Vert \widehat{\sum}{\sum}^{\ast}{\Vert}_{op}$. Figure 2 shows how the squared Frobenius norm varies with K when we take $\lambda =2\sqrt{\mathrm{log}(p)/n}$. It is seen to grow approximately linearly in K in agreement with our bound for this quantity given in Corollary 2. In the right panel, we scale the error by log p, which appears on the righthand side of Corollary 2. The fact that with this scaling the curves become more aligned shows that the p dependence of the bound holds (approximately). Also, we see that the behavior matches that of the bounds most closely for large p and small K. In Section A.11 of the supplementary materials we show that the empirical behavior of our estimator in operator norm likewise corroborates the rate derived in Theorem 9. In Figure 3 we repeat the same simulation for $\widehat{\sum}$ with the simple weighting scheme, ${w}_{{\ell}_{m}}=\sqrt{2\ell}$. We find that this weighting scheme performs empirically worse than the general weighting scheme of (2.6); this is also seen in further simulations included in Section A.11 of the supplementary materials.
To investigate the dependence on sample size, we vary n along an equallyspaced grid from 10 to 500, fixing K = 50 and taking, as before, p {500, 1000, 2000}. Simulating as before and taking, again $\lambda =2\sqrt{\mathrm{log}(p)/n}$, Figure 4 exhibits the 1/n dependence suggested by the bound in Corollary 2. In Section A.11 of the supplementary materials we observe the $1/\sqrt{n}$ behavior for operator norm in agreement with Theorem 9.
In this section, we compare the performance of the following methods that perform banded covariance estimation: (i) <m>Hier Band:</m> our method with weights given in (2.6) and no eigenvalue constraint; (ii) <m>Simple Hier Band:</m> our method with weights given in (2.5) and no eigenvalue constraint; (iii) <m>GP Band:</m> group lasso without hierarchy (weighting scheme given in (2.4)) and no eigenvalue constraint; (iv) <m>Block Thresh:</m> Cai & Yuan (2012)’s blockthresholding approach (our implementation); (v) <m>BandChol:</m> Rothman et al. (2010)’s banding of the Cholesky factor approach via solving successive regression problems; (vi) <m>Banding:</m> Bickel & Levina (2008)’s T S estimator with T_{jk} = 1_{{}_{j−k≤K}_{}}.
Each method has a single tuning parameter that is varied to its bestcase value. In the case of <m>Banding</m>, this makes it equivalent to an oraclelike procedure in which all nonsignal elements are set to zero and all signal elements are left unshrunken.
While we focus on estimation error as a means of comparison, it should be noted that these methods could be evaluated in other ways as well. For example, <m>Banding</m> and <m>Block Thresh</m> frequently lead to estimators that are not positive semidefinite. This means that these covariance estimates cannot be directly used in other procedures requiring a covariance matrix (see, e.g., Section 5.2 for an example where not being positive definite would be problematic).
We simulate four scenarios for the basis of our comparison: (i) <m>MA(5):</m> with p = 200, n = 100. See Section 5.1.1 for description. (ii) <m>MA(80):</m> with p = 500, n = 100. See Section 5.1.1 for description. (iii) <m>CY:</m> A setting used in Cai & Yuan (2012) in which Σ^{} is only approximately banded: ${\sum}_{ij}^{\ast}=0.6{ij}^{2}{U}_{ij}$, where U_{ij} = U_{ji} ~ Uniform(0, 1). We take p = 200, n = 100. (iv) <m>Increasing operator norm:</m> We produce a sequence of Σ^{} having increasing operator norm. In particular, we take a blockdiagonal matrix with p/K blocks of size K: Σ = I_{p/K} B_{K}; each block is ${B}_{K}=(0.8){1}_{K}{1}_{K}^{T}+(0.2){I}_{K}$; we take p = 1000, n = 100, and vary K between 20 and 400. We can make several observations from this comparison. We find that the hierarchical group lasso methods outperform the regular group lasso. This is to be expected since in our simulation scenarios, the true covariance matrix is banded (or approximately banded). As noted before, since we report the best performance of each method over all values of the tuning parameter, <m>Banding</m> is effectively an “oracle” method in that the banding is being performed exactly where it should be for <m>MA(5)</m>. So it is no surpise that this method does so well in this scenario. However, in the approximately banded scenario it suffers, likely because it does not do as well as methods that do shrink the nonzero values. We find that the <m>Block Thresh</m> does not perform very well in any of these scenarios (though Figure 5 does show it substantially improves on the sample covariance matrix). Finally, we find that the <m>BandChol</m> method performs well (although not shown here, we also observed that the related Nested Lasso method of Rothman et al. (2010) performs quite well).
Figure 6 shows the results for the third scenario, in which we consider a sequence of Σ^{}’s that are block diagonal with operator norm increasing linearly in K. Upon examination of the <m>Block Thresh</m> method’s estimates, we find that the reason this estimator peforms poorly in the regime of large K is that it sets all elements to zero that are sufficiently far from the main diagonal without regard to their observed size in S (this is referred to as step b in their paper). <m>Banding</m> performs very well in terms of operator norm, which makes sense since Σ^{} is a banded matrix and so simple banding with the correct bandwidth is essentially an oracle procedure.
In this section, we develop an application of our banded estimator. In general, one would expect any procedure that requires an estimate of the covariance matrix to benefit from convex banding when the true covariance matrix is banded (or approximately banded). Consider the classification setting in which we observe n i.i.d. pairs, (x_{i}, y_{i}), where x_{i} ^{p} is a vector of predictors and y_{i} labels the class of the ith point. In quadratic discriminant analysis (QDA), one assumes that x_{i}y_{i} = k ~ N_{p} (μ^{(k)}, Σ^{(k)}). The QDA classification rule is to assign x ^{p} to the class k maximizing $\widehat{\mathbb{P}}(y=k\mathbf{x})$. Here $\widehat{\mathbb{P}}$ denotes the estimate of (y = kx) given by replacing the parameters μ^{(k)}, Σ^{(k)}, and (y = k) with their maximum likelihood estimates.
To demonstrate the use of our covariance estimate in QDA, we consider a binary classification problem described in Hastie et al. (2009). The dataset consists of short sound recordings of male voices saying one of two similar sounding phonemes, and the goal is to build a classifier for automatic labeling of the sounds. The predictor vectors, x, are logperiodograms, representing the (log) intensity of the recordings across p = 256 frequencies. Because the predictors are naturally ordered, one might expect a regularized estimate of the withinclass covariance matrix to be appropriate, and inspection of the sample covariance matrices (see Figure A.3 in Section A.12 of the supplementary materials) supports this. There are n_{1} = 695 and n_{2} = 1022 phonemes in the two classes. We randomly split the data into equal sized training and test sets. On the training set, fivefold cross validation is performed to select tuning parameters (λ_{1}, λ_{2}) and then the convex banding estimates, ${\widehat{\sum}}_{{\lambda}_{1}}^{(1)}$ and ${\widehat{\sum}}_{{\lambda}_{2}}^{(2)}$, are used in place of the sample covariances to form the basis of the prediction rule, $\widehat{\mathbb{P}}(y=k\mathbf{x})$. The left panel of Figure 7 shows that QDA can be substantially improved by using a regularized estimate (and, in particular, that convex banding does quite well). Linear discriminant analysis (LDA) is like QDA but assumes a common covariance matrix between classes, Σ^{(k)} = Σ^{}, which is typically estimated as ${\mathbf{S}}_{w}=\frac{1}{{n}_{1}+{n}_{2}2}[({n}_{1}1){\mathbf{S}}^{(1)}+({n}_{2}1){\mathbf{S}}^{(2)}]$, where ${\mathbf{S}}^{(k)}={({n}_{k}1)}^{1}{\displaystyle {\sum}_{i=1}^{{n}_{k}}({\mathrm{x}}_{i}{\overline{\mathrm{x}}}^{(k)}){({\mathrm{x}}_{i}{\overline{\mathrm{x}}}^{(k)})}^{T}}$. The right panel of Figure 7 shows that LDA does better than the QDA methods, as expected in this regime of p and n (see, e.g., Cheng 2004) and can itself be improved by a regularized version of S_{w}, in which we replace S^{(k)} in the above expression with ${\widehat{\sum}}_{{\lambda}_{k}}^{(k)}$ (again, cross validation is performed to select the tuning parameters).
We have introduced a new kind of banding estimator of the covariance matrix that has strong practical and theoretical performance. Formulating our estimator through a convex optimization problem admits precise theoretical results and efficient computational algorithms. We prove that our estimator is minimax rate adaptive in Frobenius norm over the class of semibanded matrices introduced in Section 4.2.1 and minimax rate adaptive in operator norm over the class of banded matrices, up to multiplicative logarithmic factors. Both classes of matrices over which optimality is established are allowed to have bandwidth growing with n, p or both. The construction of adaptive estimators over these classes, with established rate optimality is, to the best of our knowledge, new. Moreover, the proposed estimators recover the true bandwidth with high probability, for truly banded matrices, under minimal signal strength conditions. In contrast to existing estimators, our proposed estimator, which enjoys all the above theoretical properties, is guaranteed to be both exactly banded, and positive definite, with high probability. We also propose a version of the estimator, that is guaran teed to be positive definite in finite samples. Simulation studies show that the hierarchically banded estimator is a strong and fast competitor of the existing estimators. Through a data example, we demonstrate how our estimator can be effectively used to improve classification error of both quadratic and linear discriminant analysis. Indeed, we expect that many common statistical procedures that require a covariance matrix may be improved by using our estimator when the assumption of semibandedness is appropriate. An <m>R</m> (R Core Team 2013) package, named <m>hierband</m>, will be made available, implementing our estimator.
Jacob Bien was supported by National Science Foundation grant DMS1405746, Florentina Bunea by National Science Foundation grant DMS10007444, and Luo Xiao by National Institute of Neurological Disorders and Stroke Grant R01NS060910. We thank Adam Rothman for providing <m>R</m> code for the Nested Lasso method and also Xiaohan Yan and Guo Yu for finding typos in an earlier version.
Jacob Bien, Cornell University, Department of Biological Statistics and Computational Biology, 1178 Comstock Hall, Cornell University, Ithaca, 14853 United States.
Florentina Bunea, Cornell University, Statistical Science, Ithaca, 14850 United States.
Luo Xiao, Johns Hopkins University, Department of Biostatistics, Baltimore, United States.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 