PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Mach Learn Res. Author manuscript; available in PMC 2017 September 19.
Published in final edited form as:
PMCID: PMC5603346
NIHMSID: NIHMS842539

Guarding against Spurious Discoveries in High Dimensions

Abstract

Many data-mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and L1-regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.

Keywords: Bootstrap, Gaussian approximation, generalized linear models, model selection, spurious correlation

1 Introduction

Technological developments in science and engineering lead to collections of massive amounts of high-dimensional data. Scientific advances have become more and more data-driven, and researchers have been making efforts to understand the contemporary large-scale and complex data. Among these efforts, variable selection plays a pivotal role in high-dimensional statistical modeling, where the goal is to extract a small set of explanatory variables that are associated with given responses such as biological, clinical, and societal outcomes. Toward this end, in the past two decades, statisticians have developed many data learning methods and algorithms, and have applied them to solve problems arising from diverse fields of sciences, engineering and humanities, ranging from genomics, neurosciences and health sciences to economics, finance and machine learning. For an overview, see Bühlmann and van de Geer (2011) and Hastie, Tibshirani and Wainwright (2015).

To guard against spurious discoveries, one naturally asks how good a response can be fitted by optimally selected subsets of covariates, even when the response variable is independent of all the covariates. Such a measure of the goodness of spurious fit (GOSF) is a random variable whose distribution can provide a benchmark to gauge whether the discoveries by statistical machine learning methods any better than a spurious fit (chance). Measuring such a goodness of spurious fit and estimating its theoretical distributions are the aims of this paper. This problem arises from not only high-dimensional linear models and generalized linear models, but also robust regression and other statistical model fitting, as well as model selection, in which we restrict candidate models to those that have better fits than spurious fits.

Linear regression is often used to investigate the relationship between a response variable Y and explanatory variables X = (X1, …, Xp)[top top]. For the high-dimensional linear model Y = X[top top]β* + ε, the coefficient β* is assumed to be sparse with support S0 = supp(β*). Variable selection techniques such as the forward stepwise regression, the Lasso (Tibshirani, 1996) and folded concave penalized least squares (Fan and Li, 2001; Zou and Li, 2008) are frequently used. However, it has been recently noted in Fan, Guo and Hao (2012) that high dimensionality introduces large spurious correlations between response and unrelated covariates, which may lead to wrong statistical inference and false Scientific discoveries. As an illustration, Fan, Shao and Zhou (2015) considered a real data example using the gene expression data from the international ‘HapMap’ project (Thorisson et al., 2005). There, the sample correlation between the observed and post-Lasso fitted responses is as large as 0.92. While conventionally it is a common belief that a correlation of 0.92 between the response and a fit is noteworthy, in high-dimensional scenarios, this intuition may no longer be true. In fact, even if the response and all the covariates are Scientifically independent in the sense that β* = 0, simply by chance, some covariates will appear to be highly correlated with the response. As a result, the findings of any variable selection techniques are hardly impressive unless they are proven to be better than by chance. To formally measure the degree of spurious fit, Fan, Shao and Zhou (2015) derived the distributions of maximum spurious correlations, which provide a benchmark to assess the strength of the spurious associations (between response and independent covariates) and to judge whether discoveries by a certain variable selection technique are better than by chance.

The response, however, is not always a quantitative value. Instead, it is often binary; for example, positive or negative, presence or absence and success or failure. In this regard, generalized linear models (GLIM) serve as a flexible parametric approach to modeling the relationship between explanatory and response variables (McCullagh and Nelder, 1989). Prototypical examples include linear, logistic and Poisson regression models which are frequently encountered in practice.

In GLIM, the relationship between the response and covariates is more complicated and cannot be effectively measured via Pearson correlation coefficient, which is essentially a measure of the linear correlation between two variables. We need to extend the concept of spurious correlation or the measure of goodness of spurious fit to more general models and study its null distribution. A natural measure of goodness of fit is the likelihood ratio statistic, denoted by LRn(s, p), where n is the sample size and s is size of optimally fitted model. It measures the goodness of spurious fit when X and Y are independent. This generalization is consistent with the spurious correlation studied in Fan, Shao and Zhou (2015), that is, applying LRn(s, p) to linear regression yields the maximum spurious correlation. We plan to study the limiting null distribution of 2LRn(s, p) under various scenarios. This reference distribution then serves as a benchmark to determine whether the discoveries are spurious.

To gain further insights, let us illustrate the issue by using the gene expression profiles for 10, 707 genes from 251 patients in the German Neuroblastoma Trials NB90-NB2004 (Oberthuer et al., 2006). The response labeled as “3-year event-free survival” (3-year EFS) is a binary outcome indicating whether each patient survived 3 years after the diagnosis of neuroblastoma. Excluding five outlier arrays, there are 246 subjects (101 females and 145 males) with 3-year EFS information available. Among them, 56 are positives and 190 are negatives. We apply Lasso using the logistic regression model with tuning parameter selected via ten-fold cross validation (40 genes are selected). The fitted likelihood ratio 2^=211.9583. To judge the credibility of the finding of these 40 genes, we should compare the value 211.9586 with the distribution of the Goodness Of Spurious Fit (GOSF) 2LRn(s, p) when X and Y are indeed independent, where n = 246, p = 10,707 and s = 40. This requires some new methodology and technical work. Figure 1 shows the distribution of the GOSF estimated by our proposed method and indicates how abnormal the value 211.9586 is. It can be concluded that the goodness of fit to the binary outcome is not statistically significantly better than GOSF.

Figure 1
Lasso fitted likelihood ratio 2^ in comparison to the distribution of GOSF 2LRn(s, p) with n = 246, p = 10,707 and s = 40.

The above result shows that the 10-fold cross-validation chooses a too large model with 40 variables. This prompts us to reduce the model sizes along the Lasso path such that their fits are better than GOSF. The results are reported in Table 2. The largest model selected by the cross-validation that fits better than GOSF has model size 17. We can use the cross-validation to select a model with model size no more than 17 or to select a best model among all models that fit better than GOSF. This is another important application of our method.

Table 2
Lasso fitted square-root likelihood ratio statistic, the mean cross-validated error, and upper 0.1- and 0.05-quantiles of the multiplier bootstrap approximation based on 2000 bootstrap samples.

1.1 Structure of the paper

In Section 2, we introduce a general measure of spurious fit via generalized likelihood ratios, which extends the concept of spurious correlation in the linear model to more general models, including generalized linear models and robust linear regression. We also introduce a local adaptive majorization-minimization (LAMM) algorithm for computing the GOSF. Section 3 presents the main results on the limiting laws of goodness of spurious fit and their bootstrap approximations. For conducting inference, we use the proposed LAMM algorithm to compute the bootstrap statistic. In Section 4, we discuss an application of our theoretical findings to high-dimensional statistical inference and model selection. Section 5 presents numerical studies. Proofs of the main results, Theorems 3.1 and 3.2, are provided in Section 6; in each case, we break down the key steps in a series of lemmas with proofs deferred to the appendix.

1.2 Notations

We collect standard pieces of notation here for readers’ convenience. For two sequences {an} and {bn} of positive numbers, we write an = O(bn) or an [less, similar] bn if there exists a constant C > 0 such that an/bnC for all sufficiently large n; we write an [asymptotically equal to] bn if there exist constants C1, C2 > 0 such that, for all n large enough, C1an/bnC2; and we write an = o(bn) if limn→∞ an/bn = 0, respectively. For a, b [set membership] R, we write a [logical or] b = max(a, b).

For every positive integer [ell], we write [[ell]] = {1, 2, …, [ell]}, and for any set S, we use Sc to denote its complement and |S| for its cardinality. For any real-valued random variable X, its sub-Gaussian norm is defined by ‖Xψ2 = sup[ell]≥1 [ell]−1/2(E|X|[ell])1/[ell]. We say that a random variable X is sub-Gaussian if ‖Xψ2 < ∞.

Let p, q be two positive integers. For every p-vector u = (u1, …, up)[top top], we define its [ell]q-norm to be uq=(i=1p|ui|q)1/q, and set u0=i=1pI{ui0}. Let Sp − 1 = {u [set membership] Rp : ‖u2 = 1} be the unit sphere in Rp. Moreover, for each subset S [subset, dbl equals] [p] with |S| = s [set membership] [p], we denote by uS the s-variate sub-vector of u containing only the coordinates indexed by S. We use ‖M‖ to denote the spectral norm of a matrix M.

2 Goodness of spurious fit

Let Y, Y1, …, Yn be independent and identically distributed (i.i.d.) random variables with mean zero and variance σ2 > 0, and X, X1, …, Xn be i.i.d. p-dimensional random vectors. Write

X=(X1,,Xp),𝕏=(X1,,Xn)n×p  and  Xi=(Xi1,,Xip),i=1,,n.

For s [set membership] [p], the maximum s-multiple correlation between Y and X is given by

R^n(s,p)=maxαp:α0scorr^n(Y,αX),
(2.1)

where corr^n(·,·) denotes the sample Pearson correlation coefficient. When Y and X are independent, we regard Rn(s, p) as the maximum spurious (multiple) correlation. The limiting distribution of Rn(s, p) is studied in Cai and Jiang (2012) and Fan, Guo and Hao (2012) when s = 1 and X ~ N(0, Ip) (the standard normal distribution), and later in Fan, Shao and Zhou (2015) under a general setting where s ≥ 1 and X is sub-Gaussian with an arbitrary covariance matrix.

For binary data, the sample Pearson correlation is not effective for measuring the regression effect. We need a new metric. In classical regression analysis, the multiple correlation coefficient, also known as the R2, is the proportion of variance explained by the regression model. For each submodel S [subset, dbl equals] [p], its R2 statistic can be computed as

RS2=maxθscorr^n2(Y,XSθ).
(2.2)

Then, the maximum s-multiple correlation Rn(s, p) can be expressed as the maximum R2 statistic:

R^n2(s,p)=maxS[p]:|S|=sRS2.
(2.3)

The concept of R2 can be extended to more general models. For binary response models, Maddala (1983) suggested the following generalization: log(1R2)=2n{(β^)(0)} where [ell]([beta]) = log L([beta]) and [ell](0) = log L(0) denote the log-likelihoods of the fitted and the null model, respectively. This motivates us to use the likelihood ratio as a generalization of the goodness of fit beyond the linear model.

Let Ln(β), β [set membership] Rp be the negative logarithm of a quasi-likelihood process of the sample {(Yi,Xi)}i=1n. For a given model size s [set membership] [p], the best subset fit is [beta](s) := argminβ[set membership]Rp:‖β0s Ln(β). The goodness of such a fit, in comparison with the baseline fit Ln(0), can be measured by

n(s,p)Ln(0)Ln(β^(s))=Ln(0)minβp:β0sLn(β).
(2.4)

When X and Y are independent, it becomes the Goodness OF Spurious Fit (GOSF). According to (2.2) and (2.3), this definition is consistent with the maximum spurious correlation when it is applied to the linear model with Gaussian quasi-likelihood, where Ln(β;β0,σ)=12 log(2πσ2)+12Yβ0𝕏β22/σ2.

Throughout, we refer to Ln(·) as the loss function which is assumed to be convex. This setup encompasses the generalized linear models (McCullagh and Nelder, 1989) with Ln(β)=i=1n{b(Xiβ)YiXiβ} under the canonical link where b(·) is a model-dependent convex function (we take the dispersion parameter as one, as we don’t consider the dispersion issue), robust regression with Ln(β)=i=1n|YiXiβ|, the hinge loss Ln(β)=i=1n(1YiXiβ)+ in the support vector machine (Vapnik, 1995) and exponential loss Ln(β)=i=1n exp(YiXiβ) in AdaBoost (Freund and Schapire, 1997) in classification with Y taking values ±1.

The prime goal of the present paper is to derive the limiting laws of GOSF LRn(s, p) in the null setting where the response Y and the explanatory variables X are independent. Here, both s and p can depend on n, as we shall use double-array asymptotics. We will mainly focus on the GLIM and robust linear regression that are of particular interest in statistics.

2.1 Generalized linear models

Recall that (Y1,X1),,(Yn,Xn) are i.i.d. copies of (Y, X[top top]). Assume that the conditional distribution of Y given X = x [set membership] Rp belongs to the canonical exponential family with the probability density function taking the form (McCullagh and Nelder, 1989)

f(y;x,β*)=exp [{yxβ*b(xβ*)}/ϕ+c(y,ϕ)],
(2.5)

where β*=(β1*,,βp*) is the unknown p-dimensional vector of regression coefficients, and ϕ > 0 is the dispersion parameter. The log-likelihood function with respect to the given data {(Yi,Xi)}n=1n is i=1nc(Yi,ϕ)+ϕ1i=1n{YiXiβb(Xiβ)}. For simplicity, we take ϕ = 1 with the exception that in the linear model with Gaussian noise, ϕ = σ2 is the variance. Two other showcases are

  1. Logistic regression: b(u) = log(1 + eu), u [set membership] R and ϕ = 1.
  2. Poisson regression: b(u) = eu, u [set membership] R and ϕ = 1.

In GLIM, the loss function is Ln(β)=i=1n{b(Xiβ)YiXiβ}. By (2.4), the generalized measure of goodness of fit for GLIM is

n(s,p)=nb(0)minβp:β0sLn(β).
(2.6)

In Section 3, we derive under mild regularity conditions the limiting distribution of GOSF LRn(s, p) in the null model. This extends the classical Wilks theorem (Wilks, 1938). Here, we interpret LRn(s, p) as the degree of spuriousness caused by the high-dimensionality.

2.2 L1-regression

In this section, we revisit the high-dimensional linear model

Y=𝕏β*+ε  or  Yi=Xiβ*+εi,i=1,,n,
(2.7)

where ε = (ε1, …, εn)[top top] is the n-vector of measurement errors. Robustness considerations lead to least absolute deviation (LAD) regression and more generally quantile regression (Koenker, 2005). For simplicity, we consider the [ell]1-loss Ln(β)=i=1n|YiXiβ|, β [set membership] Rp. The generalized measure of goodness of fit (2.4) now becomes

n(s,p)=Y1minβp:β0sLn(β).
(2.8)

The limiting distribution of GOSF LRn(s, p) is studied in Section 3.4.

In particular, if ε1, …, εn in (2.7) are i.i.d. from the double exponential distribution with the density fε(u)=12e|u|, u [set membership] R, the [ell]1-loss Ln(·) corresponds to the negative log-likelihood function. In general, we assume that the regression error εi has median zero, that is, P(εi0)=12. Hence, the conditional median of Yi given Xi is Xiβ* for i [set membership] [n], and β* = argminβ[set membership]Rp EX {Ln (β)}, where EX(·) = E(· |X1, …, Xn) denotes the conditional expectation given {Xi}i=1n.

2.3 An LAMM algorithm

The computation of the best subset regression coefficient [beta] (s) in (2.4) requires solving a combinatorial optimization problem with a cardinality constraint, and therefore is NP-hard. In the following, we suggest a fast and easily implementable method, which combines the forward selection (stepwise addition) algorithm and a local adaptive majorization-minimization (LAMM) algorithm (Lange, Hunter and Yang, 2000; Fan et al., 2015) to provide an approximate solution.

Our optimization problem is minβ[set membership]Rp:‖β0s f(β), where f(β) = Ln(β). We say that a function g(β | β(k)) majorizes f(β) at the point β(k) if f(β(k)) = g(β(k) | β(k)) and f(β) ≤ g(β | β(k)) for all β [set membership] Rp. An majorization-minimization (MM) algorithm initializes at β(0) and then iteratively computes β(k + 1) = argminβ[set membership]Rp:‖β0s g(β | β(k)). The target value of such an algorithm is non-increasing since

f(β(k+1))majorizationg(β(k+1)|β(k))minimizationg(β(k)|β(k))=initializationf(β(k)).
(2.9)

We now majorize f(β) at [beta](k) by an isotropic quadratic function

gλ(β|β^(k))=f(β)+f(β^(k)),ββ^(k)+λ2ββ^(k)22,βp.
(2.10)

This is a valid majorization as long as λ ≥ maxβ[nabla]2 f(β)‖ (this will be relaxed below). The isotropic form on the right-hand side of (2.10) allows a simple analytic solution given by

β^λ(k+1)=argminβp:β0sg(β+β(k))={β^(k)λ1f(β^(k))}[1:s].

Here, we used the notation that for any β [set membership] Rp, β[1:s] [set membership] Rp retains the s largest (in magnitude) entries of β and assigns the rest to zero.

We propose to use the stepwise forward selection algorithm to compute an initial estimator [beta](0). As the MM algorithm decreases the target value as shown in (2.9), the resulting target value is no larger than that produced by the stepwise forward selection algorithm.

To properly choose the isotropic parameter λ > 0 without computing the maximum eigen-value, we use the local adaptive procedure as in Fan et al. (2015). Note that, in order to have a non-increasing target value, the majorization is not actually required. As long as f(β(k + 1)) ≤ g(β(k + 1) | β(k)), arguments in (2.9) hold. Starting from a prespecified value λ = λ0, we successfully inflate λ by a factor ρ > 1. After the [ell]th iteration, λ = λ[ell] = ρ[ell]−1λ0. We take the first [ell] such that f(β^λ(k+1))gλ(β^λ(k+1)|β^(k)) and β^(k+1)=β^λ(k+1). Such an [ell] always exists as a large [ell] will major the function f. We then continue with the iteration in the MM part. A simple criteria for stopping the iteration is that |f([beta](k + 1)) − f([beta](k))| ≤ ε for a sufficiently small ε, say 10−5.

While the LAMM algorithm can be applied to compute [beta](s) in a general setting, in our application, the algorithm is mainly applied to compute GOSF under the null model (see Figure 1 and Section 3.5). From our simulation experiences, our algorithm delivers a good enough solution under the null model. It always provides an upper certificate f([beta]0) to the problem minβ0s f(β), where [beta]0 is the output of the LAMM algorithm. As in Bertsimas, King and Mazumder (2015), if needed to verify the accuracy of our method, a lower certificate is f([beta]1), where [beta]1 is the solution to the convex problem minβ1Bs f(β), and Bs is a sufficient large constant so that the L0-solution satisfies ‖[beta](s)‖1Bs. For example, under the null model, it is well known that β^(s)1=OP{s(log p)/n}. Therefore, we can take Bs=Css(log p)/n for a sufficiently large constant Cs. A data-driven heuristic approach is to take Bs = 2‖[beta]1 (s)‖1 along the Lasso path such that ‖[beta]1 (s)‖0 = s.

Note that the minimum target value falls in the interval [f([beta]1), f([beta]0)]. If this interval is very tight, we have certified that [beta]0 is an accurate solution.

3 Asymptotic distribution of goodness of spurious fit

3.1 Preliminaries

Define p × p covariance matrices

Σ=𝔼(XX)  and  Σ^=n1i=1nXiXi.
(3.1)

For s [set membership] [p], we say that S [subset, dbl equals] [p] is an s-subset if |S| = s. For every s-subset S [subset, dbl equals] [p], let ΣSS and [Sigma]SS be the s × s sub-matrices of Σ and [Sigma] containing the entries indexed by S × S, i.e.

ΣSS=𝔼(XSXS),Σ^SS=n1i=1nXiSXiS.
(3.2)

Condition 3.1

The covariates are standardized to have unit second moment, i.e. 𝔼(Xj2)=1 for j = 1, …, p. There exits a random vector U [set membership] Rp satisfying E(UU[top top]) = Ip, such that X = Σ1/2U and A0 := supv[set membership]Sp−1v[top top]Uψ2 < ∞.

For 1 ≤ sp, the s-sparse condition number of Σ is given by

γs=γs(Σ)=λmax(s)/λmin(s),
(3.3)

where λmax(s) = maxu[set membership]Sp−1:‖u0s u[top top]Σu and λmin(s) = minu[set membership]Sp−1:‖u0s u[top top]Σu denote the s-sparse largest and smallest eigenvalues of Σ, respectively.

Let G = (G1, …, Gp)[top top] ~ N(0, Σ) be a p-dimensional centered Gaussian random vector with covariance matrix Σ. For any s-subset S [subset, dbl equals] [p], GS ~ N(0, ΣSS). Define the random variable

R0(s,p)=maxS[p]:|S|=sΣSS1/2GS2,
(3.4)

which is the maximum of the [ell]2-norms of a sequence of dependent chi-squared random variables with s degrees of freedom. The distribution of R0(s, p) depends on the unknown Σ and can be estimated by the multiplier bootstrap in Section 3.5. It will be shown that this distribution is the asymptotic distribution of GOSF. In particular, for the isotropic case where Σ = Ip, R0(s,p)=G(1)2++G(s)2, the sum of the largest s order statistics of p independent χ12 random variables.

3.2 Generalized linear models

For i.i.d. observations {(Yi,Xi)}i=1n from the distribution in (2.5), define individual residuals εi=Yi𝔼X(Yi)=Yib(Xiβ*) with conditional variance VarX(εi)=ϕb(Xiβ*), where VarX(·) = EX{· − EX(·)}2. In particular, under the null model, Y is independent of X with mean μY := E(Y) = b′(0) and variance σY2Var(Y)=ϕb(0).

Condition 3.2

There exists a0 > 0 such that 𝔼 exp{uσY1(YμY)}exp(a0u2/2) holds for all u [set membership] R. The function b(·) in (2.5) satisfies

minu:|u|1b(u)a1  and  maxu:|u|1|b(u)|A1
(3.5)

for some constants a1, A1 > 0.

Condition 3.2 is satisfied by a wide class of GLIMs, including the logistic and Poisson regression models. The following theorem shows that, under certain moment and regularity conditions, the distribution of the generalized likelihood ratio statistic 2LRn(s, p) can be consistently approximated by that of R02(s,p) given in (3.4).

Theorem 3.1

Let Conditions 3.1 and 3.2 be satisfied. Assume that ϕ = 1 in (2.5), p, n ≥ 3 and 1 ≤ s ≤ min(p, n). Then, under the null model (2.7) with β* = 0

supt0|P{2n(s,p)t}P{R02(s,p)t}|C[{s log(γspn)}7/8n1/8+γs1/2{s log(γspn)}2n1/2],
(3.6)

where C > 0 is a constant depending only on a0, a1, A0, A1 in Conditions 3.1 and 3.2.

Remark 3.1

We regard Theorem 3.1 as a nonasymptotic, high-dimensional version of the celebrated Wilks theorem. In the low-dimensional setting where s = p is fixed, Theorem 3.1 reduces to the conventional Wilks theorem, which asserts that the generalized likelihood ratio statistic converges in distribution to χp2. In addition, we also provide a Berry-Esseen bound in (3.6).

3.3 Linear least squares regression

As a specific case of GLIM, we consider the linear regression model (2.7) with the loss function Ln(β)=12Y𝕏β22. The corresponding likelihood ratio statistic

n(s,p)=12Y22minβp:β0sLn(β)
(3.7)

then coincides with that in (2.6) with b(u)=12u2. We state the null limiting distribution of LRn(s, p) in a general case, where ε1, …, εn are i.i.d. copies of a sub-Gaussian random variable ε. Specifically, we assume that

Condition 3.3

ε is a centered, sub-Gaussian random variable with Var(ε) = σ2 > 0 and K0 := ‖ε‖ψ2 < ∞. Moreover, write υ[ell] = E(|ε|[ell]) for [ell] ≥ 3.

The following corollary is a particular case of the general result Theorem 3.1 with b(u)=12u2, u [set membership] R and ϕ = σ2. By examining the proof of Theorem 3.1 and noting that b[triple prime] [equivalent] 0, it can be easily shown that the second term on the right-side of (3.6) vanishes. Hence, the proof is omitted.

Corollary 3.1

Let Conditions 3.1 and 3.3 hold. Assume that p, n ≥ 3 and 1 ≤ s ≤ min(p, n). Then, under the null model (2.7) with β* = 0,

supt0|P{2n(s,p)t}P{σ2R02(s,p)t}|C{s log(γspn)}7/8n1/8,

where C > 0 is a constant depending only on A0 and K0 in Conditions 3.1 and 3.3.

Remark 3.2

Under the null model, the variance σ2 can be consistently estimated by σ^02=n1i=1n(YiȲ)2, where Ȳ=n1i=1nYi. Under the same conditions of Corollary 3.1, it can be proved that

supt0|P{2n(s,p)t}P{σ^02R02(s,p)t}|{s log(γspn)}7/8n1/8,

which is in line with Theorem 3.1 in Fan, Shao and Zhou (2015). To see this, note that

2n(s,p)=Y22minS[p]:|S|=sminθsY𝕏Sθ22=maxS[p]:|S|=sY𝕏S(𝕏S𝕏S)1𝕏SY=maxαp:α0s(Y𝕏α)2/𝕏α22.

The estimator σ^02, used in computing the maximum spurious correlation, can be seriously biased beyond the null model and hence adversely affect the power. Thus, we suggest using either the refitted cross-validation procedure (Fan, Guo and Hao, 2012) or the scaled Lasso estimator (Sun and Zhang, 2012) to estimate σ2.

3.4 Linear median regression

We now state an analogous result to Theorem 3.1 regarding the [ell]1-loss considered in Section 2.2.

Condition 3.4

The noise ε1, …, εn in (2.7) are i.i.d. copies of a random variable ε satisfying E|ε|κ < ∞ for some 1 < κ ≤ 2. There exist positive constants a2 < (E|ε|)−1, A2 and A3 such that the distribution function Fε(·) and the density function fε(·) of ε satisfy

2 max{1Fε(u),Fε(u)}(1+a2u)1  for all u0,
(3.8)

maxufε(u)A2  and  maxu:|u|1max{|fε(u+)|,|fε(u)|}A3.
(3.9)

Theorem 3.2

If p, n ≥ 3 and 1 ≤ s ≤ min(p, n), then under the null model (2.7) with β* = 0 and Conditions 3.1 and 3.4, we have

supt0|P{2n(s,p)t}P{R02(s,p)/{2fε(0)}t}|C1n1κ+C2[{s log(γspn)}7/8n1/8+γs1/4{s log(γspn)}3/2n1/4],
(3.10)

where LRn(s, p) is given by (2.8), C1 > 0 is a constant depending on a2, κ, E|ε|, E|ε|κ and C2 > 0 is a constant depending on a2, A0, A2 and A3 in Conditions 3.1 and 3.4.

Remark 3.3

Under the null model, the unknown parameter fε(0) can be consistently estimated by the kernel density estimator f^ε(0)=(nh)1i=1nK(Yi/h), where K(·) is a kernel function and h = hn > 0 is the bandwidth. For simplicity, we may use the Epanechnikov kernel function KEpa(u)=34(1u2)I(|u|1) along with the rule-of-thumb bandwidth hROT = 2.34[sigma with hat]0n−1/5, where σ^02=n1i=1n(YiȲ)2.

3.5 Multiplier bootstrap procedure

The distribution of the random variable R0(s, p) given by (3.4) depends on the unknown covariance matrix Σ. In practice, it is natural to replace Σ by Σ^=n1i=1nXiXi and G ~ N(0, Σ) by Ĝ ~ N(0, [Sigma]) in the definition of R0(s, p). With this substitution, the distribution of R0(s, p) can be simulated. In particular, Ĝ can be simulated as n1/2i=1neiXi, where e1, …, en are i.i.d. standard normal random variables that are independent of {Xi}i=1n. The resulting estimator is

Rn(s,p)=maxS[p]:|S|=sΣ^SS1/2ĜS2,
(3.11)

which is a multiplier bootstrap version of R0(s, p). The following proposition follows directly from Theorem 3.2 in Fan, Shao and Zhou (2015).

Proposition 3.1

Assume that Condition (3.1) holds, 1 ≤ s ≤ min(p, n) and s log(γspn) = o(n1/5) as n → ∞. Then supt≥0 |P{R0(s, p) ≤ t} − P{Rn(s, p) ≤ t |X1, …, Xn}| → 0 in probability.

The computation of Rn(s, p) requires solving a combinatorial optimization. This can be alleviated by using the LAMM algorithm in Section 2.3. To begin with, by Remark 3.2, we write Rn(s, p) in (3.11) as

Rn2(s,p)=maxS[p]:|S|=se𝕏S(𝕏S𝕏S)1𝕏Se=e22minβp:β0se𝕏β22,

where e = (e1, …, en)[top top] and XS = (X1S, …, XnS)[top top] for every subset S [subset, dbl equals] [p]. This can be computed approximately by the LAMM algorithm in Section 2.3, resulting in the solution [beta](s). Finally, we set Rn2(s,p)=e22e𝕏β^(s)22.

The numerical performance may be improved by employing mixed integer optimization formulations (Bertsimas, King and Mazumder, 2015). Such an attempt, however, is beyond the scope of the paper and we leave it for future research.

4 Spurious discoveries and model selection

Based on the theoretical developments in Section 3, here we address the question whether discoveries by machine learning and data mining techniques for GLIM are any better than by chance. For simplicity, we focus on the Lasso. Let qα(s, p) be the upper α-quantile of the random variable R0(s, p) defined by (3.4). Assume that the dispersion parameter ϕ in (2.5) equals 1. By Theorem 3.1, we see that for any prespecified α [set membership] (0, 1),

P{2n(s,p)qα2(s,p)}1α,
(4.1)

where LRn(s, p) is as in (2.6).

Let [beta]λ = argminβ{Ln(β) + λ‖β1} be the [ell]1-penalized maximum likelihood estimator with ŝλ = | Ŝλ| = |supp([beta]λ)|, where λ > 0 is the regularization parameter. Since ŝλ covariates are selected, the goodness of fit is likelihood ratio Ln(0) − Ln([beta]λ), which should be compared with the distribution of GOSF LRn(s, p) by taking s = ŝλ. In view of (4.1), if

Ln(β^λ)Ln(0)qα2(ŝλ,p)/2=nb(0)qα2(ŝλ,p)/2,

then we may regard the discovery of variables Ŝλ as spurious, no better than fitting by chance.

In practice, the unknown quantile qα(s, p) should be replaced by its bootstrap version qn(s, p), the upper α-quantile of Rn(s, p) defined by (3.11). This leads to the following data-driven criteria for judging where the discovery Ŝ(λ) is spurious:

Ln(β^λ)nb(0)qn,α2(ŝλ,p)/2.
(4.2)

The theoretical justification is given by Theorem 3.1 and Proposition 3.1. In particular, when the loss is quadratic, this reduces to the case studied by Fan, Shao and Zhou (2015).

The concept of GOSF and its theoretical quantile provide important guidelines for model selection. Let [beta]cv be a cross-validated Lasso estimator, which selects ŝcv = ‖[beta]cv0 important variables. Due to the bias of the [ell]1 penalty, the Lasso typically selects far larger model size since the visible bias in Lasso forces the cross-validation procedure to choose a smaller value of λ. This phenomenon is documented in the simulations studies. See Table 1 in Section 5.2. With an over-selected model, the goodness of fit ^λ=Ln(0)Ln(β^λ) can sometimes be worse than the spurious fit. To avoid over-selecting, we suggest an alternative procedure that uses the quantity qn(s, p) as a guidance to choose the tuning parameter, which guards us from spurious discoveries. More specifically, for each λ in the Lasso solution path, we compute ^λ and qn(s, p)|s=ŝλ with a prespecified α. Starting from the largest λ, we stop the Lasso path the first time that the sign of 2^λqn,α2(ŝλ,p) is changed from positive to negative, and let [lambda with circumflex]fit be the smallest λ satisfying 2^λqn,α2(ŝλ,p). Denote by ŝfit the corresponding selected model size. This value can be regarded as the maximum model size for cross-validation to choose from. Another viable alternative is to only select the best cross-validated model among those whose fit are better than GOSF. We will show in Section 5.2 by simulation studies that this procedure selects much smaller model size which is closer to the truth.

Table 1
The empirical power and the median size of the selected models with its robust standard deviation (RSD) in the parenthesis based on 200 simulations when p = 400 and α = 10%. RSD is the interquantile range divided by 1.34.

5 Numerical studies

5.1 Accuracy of the Gaussian approximation

First we ran a simulation study to examine how accurate the Gaussian approximation R02(s,p) is to the generalized likelihood ratio statistic 2LRn(s, p) in the null model. To illustrate the method, we focus on the logistic regression model: P(Y = 1|X) = exp(X[top top]β*)/{1 + exp(X[top top]β*)}. Under the null model β* = 0, Y1, …, Yn are i.i.d. Bernoulli random variables with success probability 1/2. Independent of Yi’s, we generate Xi ~ N(0, Σ) with two different covariance matrices: Σ1 = (ρ|jk|)1≤j,kp and Σ2 = (σ2,jk)1≤j,kp, where

σ2,jk=(||jk|+1|2ρ+||jk|1|2ρ2|ij|2ρ)/2,1j,kp.

The first design has an AR(1) correlation structure (a short-memory process), whereas the second design reflects strong long memory dependence. We take ρ = 0.8 in both cases.

Figure 2 reports the distributions of generalized likelihood ratios (GLRs) and their Gaussian approximations (GARs) when n = 400, p = 1000 and s [set membership] {1, 2, 5, 10}. The results show that the accuracy of Gaussian approximation is fairly reasonable and is affected by the size of s as well as the dependence between the coordinates of X.

Figure 2
Distributions of generalized likelihood ratios (red) and Gaussian approximations (blue) based on 5000 simulations for n = 400, p = 1000 and s = 1, 2, 5, 10 when Σ is equal to Σ1 (upper panel) or Σ2 (lower panel).

5.2 Detection of spurious discoveries

In this section, we conduct a moderate scale simulation study to examine how effective the multiplier bootstrap quantile qn(s, p) serves as a benchmark for judging whether the discovery is spurious. To illustrate the main idea, again we restrict our attention to the logistic regression model.

The results reported here are based on 200 simulations with the ambient dimension p = 400 and the sample size n taken values in {120, 160, 200}. The true regression coefficient vector β* [set membership] Rp is (3, −1, 3, −1, 3, 0, …, 0)[top top]. We consider two random designs: Σ = Ip (independent) and Σ = (0.5|jk|)1≤j,kp (dependent).

Let [beta]cv be the five-fold cross-validated Lasso estimator, which selects a model of size ŝcv = ‖[beta]cv0. For a given α [set membership] (0, 1), consider the spurious discovery probability (SDP)

P{n log(2)Ln(β^cv)qn,α2(ŝcv,p)/2},

which is basically the probability of the type II error since the simulated model is not null. We take α = 0.1 and compute the empirical SDP based on 200 simulations. For each simulated data set, qn(s, p)|s=ŝcv, p=400 is computed based on 1000 bootstrap replications. The results are depicted in Table 1 below.

As reflected by Table 1, the empirical power, which is one minus the empirical SDP, increases rapidly as the sample size n grows. This is in line with our intuition that the more data we have, the less likely that the discovery by a variable selection method is spurious. When the sample size is small, the SDP can be high and hence the discovery Ŝcv = supp([beta]cv) should be interpreted with caution. We need either more samples or more powerful variable selection methods.

We see from Table 1 that the Lasso with cross-validation selects far larger model size than the true one, which is 5. This is because the intrinsic bias in Lasso forces the cross-validation procedure to choose a smaller value of λ. We now use our procedure in Section 4 to choose the tuning parameter from the Lasso solution path. As before, we take α = 0.1 in qn(s, p) to provide an upper bound on the model size from perspective of guarding again spurious discoveries. The empirical median of ŝfit and its robust standard deviation are 9 and 1.87 over 200 simulations when (n, p) = (200, 400) and Σ = (0.5|jk|)1≤j,kp. The feature over-selection phenomenon is considerably alleviated.

5.3 Neuroblastoma data

In this section, we apply the idea of detecting spurious discoveries to the neuroblastoma data reported in Oberthuer et al. (2006). This data set consists of 251 patients of the German Neuroblastoma Trials NB90-NB2004, diagnosed between 1989 and 2004. The complete data set, obtained via the MicroArray Quality Control phase-II (MAQC-II) project (Shi et al., 2012), includes gene expression over 10,707 probe sites. There are 246 subjects with 3-year event-free survival information available (56 positive and 190 negative). See Oberthuer et al. (2006) for more details about the data sets.

For each λ > 0, we apply Lasso using the logistic regression model to select ŝλ genes. In particular, ten-fold cross-validated Lasso selects ŝcv = 40 genes. Then we calculate the goodness of fit ^λLn(0)Ln(β^λ)=n log(2)Ln(β^λ). Along the Lasso path, we record in Table 2 the number of selected probes, the corresponding square-root the goodness of fit (2^λ)1/2 and upper α-quantiles of the multiplier bootstrap approximations R0(s, p)|s=ŝλ, p=10,707 with α = 10% and 5% based on 2000 bootstrap replications. For illustrative purposes, we only display partial Lasso solutions with selected model size ŝλ lying between 20 and 40. From Table 2, we observe that only the discovery of 17 probes has a generalized measure of the goodness of fit better than GOSF at α = 5%, whereas the finding (of the 40 probes) via the cross-validation procedure is likely to over-select.

6 Proofs

We now turn to the proofs of Theorems 3.1 and 3.2. In each proof, we provide the primary steps, with more technical details stated as lemmas and proved in the appendix.

6.1 Proof of Theorem 3.1

Throughout the proof, we work with the quasi-likelihood n(β)=Ln(β)=i=1n{YiXiβb(Xiβ)} and consider the general case where the dispersion parameter ϕ in (2.5) is specified (not necessarily equals 1 to facilitate the derivations for the normal case). For a given s [set membership] [p], define

Qn(s,p)=maxβp:β0sn(β)  and  Qn*=n(0).

We divide the proof into three steps. First, for each s-subset S [subset, dbl equals] [p], we prove the Wilks’s result for the S-restricted model where only a subset of the covariates indexed by S are included. Specifically, we show that the square root deviation of the S-restricted maximum log-likelihood from its baseline value under the null model can be well approximated by the [ell]2-norm of the normalized score. vector. Second, based on a high-dimensional invariance principle, we prove the Gaussian/chi-squared approximation for the maximum of the [ell]2-norms of normalized score vectors. Finally, we apply an anti-concentration argument to construct non-asymptotic Wilks approximation for 2{Qn(s,p)Qn*}.

Step 1: Wilks approximation

In the null model where Y and X are independent, the true parameter β* in (2.5) is zero, and thus the density function of Y has the form f(y) = exp{−ϕ−1b(0) + c(y, ϕ)}. Moreover, we have

arg maxβp𝔼X{n(β)}=arg maxβpi=1n𝔼X{YiXiβb(Xiβ)}=0.

To this see, note that in model (2.5) with β* = 0, E(Y) = b′(0) and Var(Y) = ϕb″(0). This implies that 𝔼X{n(β)}=i=1n{b(0)Xiβb(Xiβ)}. This function is strictly concave with respect to β and β = 0 satisfies its first order condition, and hence is its maximizer.

For each s-subset S [subset, dbl equals] [p], define the S-restricted log-likelihood nS(θ)=i=1n{YiXiSθb(XiSθ)} and the score function nS(θ)=i=1n{Yib(XiSθ)}XiS, θ [set membership] Rs. In this notation, it can be seen from (2.6) that

Qn(s,p)=maxS[p]:|S|=smaxθsnS(θ)=maxS[p]:|S|=snS(θ^S),
(6.1)

where

θ^S=(θ^S,1,,θ^S,s)=arg maxθsnS(θ)
(6.2)

denotes the maximum likelihood estimate of the target parameter for the S-restricted model, which is given by θS*arg maxθs𝔼X{nS(θ)}=0.

Given the i.i.d. observations {(Yi,Xi)}i=1n,𝔼X{nS(θ)}=i=1n{b(0)b(XiSθ)}XiS and HS(θ)2𝔼X{nS(θ)}=i=1nb(XiSθ)XiSXiS for θ [set membership] Rs. In particular, write

HS*HS(0)=nb(0)Σ^SS
(6.3)

for ΣSS as in (3.2). Further, define the S-restricted normalized score

ξ^S=HS*1/2nS(0)={nb(0)}1/2Σ^SS1/2i=1nεiXiS,εi=Yib(0).
(6.4)

The following result is a conditional analogue of Corollary 1.12 in the supplement of Spokoiny (2012), which provides an exponential inequality for the [ell]2-norm of [Xi w/ hat]S given {Xi}i=1n. The proofs of this Lemma and other lemmas can be found in the appendix.

Lemma 6.1

Assume that Conditions 3.1 and 3.2 hold. Then, for every t ≥ 0,

PX{ξ^S22a0ϕΔ(s,t)}2et
(6.5)

holds almost surely on the event {[Sigma]SS [succeeds] 0}, where

Δ(s,t){s+(8ts)1/2,if 0t118(2s)1/2,s+6t,if t>118(2s)1/2.
(6.6)

The following lemma characterizes the Wilks phenomenon from a non-asymptotic point of view. Recall that [theta w/ hat]S at (6.2) is the S-restricted maximum likelihood estimator, and in the null model, nS(0)=n(0)=nb(0),σY2=Var(Y)=ϕb(0). For every τ > 0, define the event

0(τ)=S[p]:|S|=s{Σ^SS0,max1inXiSΣ^SS1XiSτ}.
(6.7)

Lemma 6.2

Assume that Conditions 3.1 and 3.2 hold. Then, on the event x21300(τ), for any τ > 0,

PX(maxS[p]:|S|=s|[2{nS(θ^S)n(0)}]1/2ξ^S2|C1ϕτ1/2s log(pn)n)5n1
(6.8)

whenever nC2 ϕτs log(pn), where C1 and C2 are positive constants depending only on a0, a1, A1 and b″(0).

To apply Lemma 6.2, we need to show first that for properly chosen τ, the event x21300(τ) occurs with high probability. First, applying Theorem 5.39 in Vershynin (2012) to the random vectors ΣSS1/2X1S,,ΣSS1/2XnS yields that, for every t ≥ 0,

ΣSS1/2Σ^SSΣSS1/2Is=n1ΣSS1/2𝕏S𝕏SΣSS1/2Ismax(δ,δ2)
(6.9)

holds with probability at least 1 − 2et, where δ = C3(s [logical or] t)1/2n−1/2, and C3 > 0 is a constant depending only on A0. This, together with Boole’s inequality implies by taking t=s logeps+log n that, with probability at least 1 − 2n−1,

maxS[p]:|S|=sΣSS1/2Σ^SSΣSS1/2IsC3(s logeps+log nn)1/212
(6.10)

whenever n4C32(s logeps+log n). Providing (6.10) holds, the smallest eigenvalue of ΣSS1/2Σ^SSΣSS1/2 is bounded from below by 12 so that λmin(Σ^SS)12λmin(ΣSS). Moreover,

XiSΣ^SS1XiS2λmin1(ΣSS)XiS222sλmin1(ΣSS)maxjSXij2.
(6.11)

For the last term on the right-hand side of (6.11), let ej = (0, …, 0, 1, 0, …, 0)[top top] be the unit vector in Rp with 1 at the jth position and note that Xij=ejXi=ejΣSS1/2Ui with ejΣ1/22=1, where U1, …, Un are i.i.d. p-dimensional random vectors with covariance matrix Ip. By Condition 3.1, Xijψ2=ejΣ1/2Uiψ2A0 and hence for every t ≥ 0,

P(max1inmax1jpXij2t)2i=1nj=1pexp(C41t)2 exp{log(pn)C41t},

where C4 > 0 is a constant depending only on A0. This, together with (6.11) implies by taking t = 2C4 log(pn) that, with probability at least 1 − 3n−1,

max1inmaxS[p]:|S|=sXiSΣ^SS1XiS2λmin1(s){1+2C4s log(pn)}.
(6.12)

Now, by (6.7) and (6.12), we take τ0=2λmin1(s){1+2C4s log(pn)} such that the event x213000) occurs with probability greater than 1 − 3n−1 as long as n4C32(s logeps+log n). This, together with Lemma 6.2 yields that with probability at least 1 − 8n−1,

maxS[p]:|S|=s|[2{nS(θ^S)n(0)}]1/2ξ^S2|C5ϕλmin1/2(s){s log(pn)}3/2n1/2
(6.13)

whenever nC6(1ϕ)λmin1(s){s log(pn)}2, where C5, C6 > 0 are constants depending only on a0, a1, A0, A1 and b″(0).

Step 2: Gaussian approximation

For i = 1, …, n and S [subset, dbl equals] [p], define Zi = {b″(0)}−1/2εiXi and ZiS = {b″(0)}−1/2εiXiS such that ξ^S=n1/2i=1nΣ^SS1/2ZiS. Moreover, define

ξ=n1/2i=1nZi  and  ξS=n1/2i=1nΣSS1/2ZiS.
(6.14)

The following result shows that for each s-subset S [subset, dbl equals] [p], the [ell]2-norm of the S-restricted normalized score [Xi w/ hat]S is close to that of ξS with overwhelmingly high probability.

Lemma 6.3

Assume that Condition 3.1 holds. Then, for every s-subset S [subset, dbl equals] [p] and for every 0t34(n2s),

P[|ξ^S2ξS2|>C7{(s+t)ϕΔ(s,t)}1/2n1/2]12.4et,
(6.15)

provided that nC8(s + t), where Δ(s, t) is as in (6.6) and C7, C8 > 0 are constants depending only on a0 and A0.

Using the union bound and taking t=s log eps+log n in Lemma 6.3, we see that with probability at least 1 − 12.4 n−1,

maxS[p]:|S|=s|ξ^S2ξS2|C7ϕ1/2(s logeps+log n)n1/2
(6.16)

whenever nC9(s logeps+log n).

Note that, the random vectors ξ and ξS, S [subset, dbl equals] [p] defined in (6.14) satisfy E(ξ) = 0, E(ξξ[top top]) = ϕΣ, E(ξS) = 0 and 𝔼(ξSξS)=ϕIs. The following lemma provides a coupling inequality, showing that the random variable maxS[subset, dbl equals][p]:|S|=s ‖ϕ−1/2ξS2 can be well approximated, with high probability, by some random variable which is distributed as the maximum of the [ell]2-norms of a sequence of normalized Gaussian random vectors, i.e. {ΣSS1/2GS2:S[p],|S|=s}.

Lemma 6.4

Assume that Condition 3.1 holds. Then, there exists a random variable T0=dR0(s,p) such that for any δ [set membership] (0, 1],

|maxS[p]:|S|=sϕ1/2ξS2T0|C10[δ+{s log(γspn)}1/2n1/2+{s log(γspn)}2n3/2]
(6.17)

holds with probability greater than 1 − C11−3n−1/2{s log(γspn)}2 [logical or] δ−4n−1{s log(γspn)}5], where C10, C11 > 0 are constants depending only on a0 and A0

Step 3: Completion of the proof

We now apply an anti-concentration argument to construct the Berry-Esseen bound for the square root of the excess 2ϕ1{Qn(s,p)Qn*}. To this end, taking δ = {s log(γspn)}3/8n−1/8 in Lemma 6.4 leads to that, with probability at least 1 − C11{s log(γspn)}7/8n−1/8,

|maxS[p]:|S|=sϕ1/2ξS2T0|C12{s log(γspn)}3/8n1/8
(6.18)

whenever n ≥ {s log(γspn)}3. Further, for R0(s, p) in (3.4), note that

R02(s,p)=maxS[p]:|S|=smaxu𝕊s1(uGS)2uΣSSu=maxu(s,p)(uG)2uΣu,

where G ~ N(0, Σ) and F(s, p) := {x [mapsto] u[top top] x : u [set membership] Sp−1, ‖u0s} is a class of linear functions Rp [mapsto] R. Hence, it follows from Lemma 7.3 in Fan, Shao and Zhou (2015) with slight modification and Lemma A.1 in the supplement of Chernozhukov, Chetverikov and Kato (2014) that, for every t > 0,

supu0P(|T0u|t)=supu0P{|R0(s,p)u|t}C13(s logγseps)1/2t,
(6.19)

where C13 > 0 is an absolute constant. Combining (6.19) with the preceding results (6.13), (6.16) and (6.18) proves (3.6).

6.2 Proof of Theorem 3.2

The main strategy of the proof is similar to that of Theorem 3.1 but technical details are substantially different. As before, we keep working with the quasi-likelihood n(β)=i=1n|YiXiβ|, β [set membership] Rp. To begin with, observe that maxβp:β0sn(β)=maxS[p]:|S|=smaxθsnS(θ), where nS(θ)=i=1n|YiXiSθ|. In the null model (2.7) with β* = 0, we have for each s-subset S [subset, dbl equals] [p], arg maxθ𝔼X{nS(θ)}=0 by the first order condition and concavity, and the S-restricted least absolute deviation estimator can be written as

θ^S=arg maxθsnS(θ).
(6.20)

We first establish in Lemma 6.5 an upper bound for the maximum [ell]2-risks of [theta w/ hat]S.

Lemma 6.5

Assume that (3.8) holds and that E|ε|κ < ∞ for some 1 < κ ≤ 2. Then, on the event x21300(τ) for τ > 0, the sequence of LAD estimators {[theta w/ hat]S : S [subset, dbl equals] [p], |S| = s} satisfies

maxS[p]:|S|=sΣ^SS1/2θ^S2C1a21{s log(pn)}1/2n1/2
(6.21)

with conditional probability (over the randomness of {εi}i=1n ) greater than 1−c1n−1c2n1−κ, where C1, c1 > 0 are absolute constants and c2 > 0 is a constant depending only on a2, κ, E|ε| and E|ε|κ.

Based on Lemma 6.5, we further study the concentration property of the Wilks expansion for the excess nS(θ^S)nS(0). Since the function nS(·) is concave, we use nS(·) to denote its subgradient. For θ [set membership] Rs, let ζS(θ)=nS(θ)𝔼XnS(θ) be the stochastic component of nS(θ). Then, it is easy to see that

ζS(θ)=2i=1nwiS(θ)XiS,𝔼XnS(θ)=i=1n{2PX(YiXiSθ)1}XiS,
(6.22)

where wiS(θ)I(YiXiSθ)PX(YiXiSθ). In particular, we have ζS(0)=i=1n{2I(εi0)1}XiS. Recall that fε and Fε denote, respectively, the density function and the cumulative distribution function of ε. By the second expression in (6.22), 𝔼XnS(θ)=i=1n{2Fε(XiSθ)1}XiS and

HS(θ)2𝔼XnS(θ)=2i=1nfε(XiSθ)XiSXiS.
(6.23)

In line with (6.3), we have HS*=HS(0)=2nfε(0)Σ^SS, which is the negative Hessian of 𝔼XnS(0). As in (6.4), define the normalized score

ξ^S=HS*1/2nS(0)={2nfε(0)}1/2Σ^SS1/2i=1n{2I(εi0)1}XiS.
(6.24)

The following result is a non-asymptotic, conditional version of the Wilks theorem, saying that with high probability, the square root of the excess maxθnS(θ)nS(0) and the [ell]2-norm of the normalized score [Xi w/ hat]S are sufficiently close uniformly over all s-subsets S [subset, dbl equals] [p].

Lemma 6.6

Assume that Conditions 3.1 and 3.4 are satisfied. Then

maxS[p]:|S|=s|[2{nS(θ^S)nS(0)}]1/2ξ^S2|C2{fε(0)}1/2[λmin1/2(s){s log(pn)}3/2n1/2+λmin1/4(s)s log(pn)n1/4]
(6.25)

holds with probability greater than 1 − c2n1−κc3n−1 whenever nC3λmin1(s){s log(pn)}2, where C2 > 0 is a constant depending only on a2, A2 and A3, c2 is as in Lemma 6.5, c3 > 0 is an absolute constant and C3 > 0 is a constant depending only on a2 and A2.

Further, write [epsilon with tilde]i = 2Ii ≤ 0) − 1 and Xi = [epsilon with tilde]iXi. Note that [epsilon with tilde]1, …, [epsilon with tilde]n are i.i.d. Rademacher random variables and thus X1, …, Xn are sub-exponential random vectors. In this notation, we have ξ^S={2nfε(0)}1/2i=1nΣ^SS1/2X˜iS. For each S [subset, dbl equals] [p], define

ξS={2nfε(0)}1/2i=1nΣSS1/2XiS.

Then, applying Lemma 6.3 with slight modification and the union bound we obtain that, with probability at least 1 − c4n−1,

maxS[p]:|S|=s|ξ^S2ξS2|C4{fε(0)}1/2s log(pn)n1/2
(6.26)

for all nC5 s log(pn), where c4 > 0 is an absolute constant and C4, C5 > 0 are constants depending only on A0.

Observe that E(Xi) = E[Xi{2Pi ≤ 0|Xi) − 1}] = 0 and 𝔼(X˜iX˜i)=𝔼(XiXi)=Σ. Hence, it follows from Lemma 6.4 that there exists a random variable T0=dR0(s,p) such that for any δ [set membership] (0, 1],

|2fε(0)maxS[p]:|S|=sξS2T0|C6[δ+{s log(γspn)}1/2n1/2+{s log(γspn)}2n3/2]
(6.27)

holds with probability at least 1 − C7−3n−1/2{s log(γspn)}2 [logical or]δ−4n−1{s log(γspn)}5], where C6, C7 > 0 are constants depending only on A0.

Finally, combining (6.25), (6.26), (6.27) and (6.19) proves (3.10).

Acknowledgments

The research of Jianqing Fan was supported in part by NSF Grants DMS-1206464, DMS-1406266, and NIH Grant R01-GM072611-10.

The research of Wen-Xin Zhou was supported in part by NIH Grant R01-GM100474-5.

Appendix: Proofs of technical lemmas

7.1 Proof of Lemma 6.1

Define the loss function [ell](y, z) = yzb(z) for y, z [set membership] R. For each s-subset S [subset, dbl equals] [p] and θ [set membership] Rs, define ζS(θ)=nS(θ)𝔼XnS(θ)=i=1nζiS(θ), where ζiS(θ)=(Yi,XiSθ)𝔼X(Yi,XiSθ). Note that ζiS(θ)|θ=0=εiXiS with εi = Yib′(0). Thus, we have V02VarX{ζS(0)}=nϕb(0)Σ^SS.

For every u [set membership] Rs \ {0} and u [set membership] R,

𝔼X exp {uuζS(0)V0u2}=i=1n𝔼X exp (uuXiSV0u2εi)=i=1n𝔼X exp {un×uXiS(uΣ^SSu)1/2×εi(Var εi)1/2}exp (12a0u2×1ni=1n(uXiS)2uΣ^SSu)=exp (a0u2/2).

This verifies condition (ED0) with ν02=a0 in Theorem B.3 from the supplement of Spokoiny and Zhilova (2015). Consequently, taking 𝔹2=HS*1/2V02HS*1/2=ϕIs and g = {Ctr(B2)}1/2 for some C ≥ 2 there, we have λmax(B2) = ϕ, tr(B2) = ϕs, tr(B4) = ϕ2s and xc=12(32C1log 3)s34(C2)s. This implies that almost surely on the event {[Sigma]SS [succeeds] 0}, with conditional probability at least 1 − 2et − 8.4 e−xc,

ξ^S22a0ϕ×{s+(8ts)1/2,if 0t118(2s)1/2,s+6t,if 118(2s)1/2<txc.

Finally, letting C → ∞ proves (6.5).

7.2 Proof of Lemma 6.2

We prove this lemma by applying the conditional version of Theorem 2.3 in Spokoiny (2013). To this end, we need to verify conditions (ED0), (ED2), (L0), (x2110) and (L). In line with the notation used therein, we fix S [subset, dbl equals] [p] and write

D2(θ)=2𝔼X{nS(θ)}=i=1nb(XiSθ)XiSXiS,D02=D2(0)=nb(0)Σ^SS.

The validity of (ED0) is guaranteed from the proof of Lemma 6.1, and (ED2) is automatically satisfied with ω [equivalent] 0 since [nabla]2 ζS(θ) vanishes for all θ [set membership] Rs. Turning to (L0), observe that

D01D2(θ)D01Is=D01i=1n{b(XiSθ)b(0)}XiSXiSD01=D01i=1nb(ηi)XiSθXiSXiSD01,
(7.1)

where ηi lies between 0 and XiSθ. For r > 0, define Θ0(r) = {θ [set membership] Rs : ‖D0θ2r}. On the event x21300(τ) for some τ > 0 and for θ [set membership] Θ0(r),

|XiSθ|=|θD0D01XiS|D01XiS2{nb(0)}1/2τ1/2r.
(7.2)

This together with (7.1) implies that

D01D2(θ)D01Ismax|t|{nb(0)}1/2τ1/2r|b(t)|{b(0)}3/2τ1/2rn1/2δ(τ,r).
(7.3)

Recalling that V02=VarX{ζS(0)}=ϕD02, (x2110) is satisfied with a = ϕ1/2.

To verify (Lr), define g(t) = b′(0)tb(t) so that g′(t) = b′(0) − b′(t) and g″(t) = −b″(t). Then, for any θ [set membership] Rs satisfying ‖D0θ2 = r > 0, it follows from the second-order Taylor expansion that

2{𝔼XnS(θ)𝔼XnS(0)}=2i=1n{g(XiSθ)g(0)}=2i=1n{g(0)XiSθ+12g(ηi)(XiSθ)2}=i=1nb(ηi)(XiSθ)2,
(7.4)

where ηi is a point lying between 0 and XiSθ. On the event x21300(τ), the right-hand side of (7.4) is further bounded from below by

r2{b(0)}1min|t|{nb(0)}1/2τ1/2rb(t).

When ‖D0θ2 = r ≤ {nb″(0)/τ}1/2, 2{𝔼XnS(θ)𝔼XnS(0)} is bounded from below by a1r2 for a1 as in (3.5). Further, from the convexity of the function θ𝔼X{nS(θ)nS(0)}, we see that 𝔼X{nS(θ)nS(0)}a1r{nb(0)/τ}1/2, for all θ satisfying ‖D0θ2 = r ≥ {nb″(0)/τ}1/2. Define the function r [mapsto] b(r) as

b(r)={a1if 0r{nb(0)/τ}1/2,a1r1{nb(0)/τ}1/2if r>{nb(0)/τ}1/2.
(7.5)

By definition, rb(r) is non-decreasing in r ≥ 0 and for θ [set membership] Rs satisfying ‖D0θ2 = r,

2𝔼X{nS(θ)nS(0)}D0θ22b(r).
(7.6)

With the above preparations, we apply Theorem 2.3 in Spokoiny (2013) with slight modification on the constant. In view of (6.6) and (7.5), set

r0=2(ϕa0)1/2a11[s+6(s logeps+log n)]1/2,
(7.7)

such that Condition 2.3 there is satisfied on x21300(τ) whenever n{b(0)}1r02τ. Hence, it follows from Theorem 2.3 in Spokoiny (2013) and the union bound that, conditional on the event x21300(τ),

PX(maxS[p]:|S|=s|[2{nS(θ^S)nS(0)}]1/2ξ^S2|5δ(τ,r0)r0)5n1,
(7.8)

where δ(τ, r) and r0 are as in (7.3) and (7.7), respectively. This proves (6.8) by properly choosing C1 and C2.

7.3 Proof of Lemma 6.3

To begin with, note that for each s-subset S [subset, dbl equals] [p], Z1S, …, ZnS are i.i.d. s-dimensional random vectors with mean zero and covariance matrix ϕΣSS. By (6.4) and (6.14),

ξ^S22ξS22=ξS(ΣSS1/2Σ^SS1ΣSS1/2Is)ξS.

Write XS = (X1S, …, XnS)[top top] [set membership] Rn×s, then 𝕏SΣSS1/2 is an n × s matrix whose rows are independent sub-Gaussian random vectors in Rs. Further, observe that XiS = PSXi and ΣSS=PSΣPS, where PS [set membership] Rs×p is a projection matrix. Under Condition 3.1, uΣSS1/2XiSψ2=uΣSS1/2PSΣSS1/2Uψ2A0ΣSS1/2PSΣSS1/2u2=A0 for u [set membership] Ss−1. Then, it follows from (6.9) that for all sufficient large n so that δ12,ΣSS1/2Σ^SS1ΣSS1/2Is2δ and hence,

|ξ^S2ξS2|=|ξ^S22ξS22|ξ^S2+ξS2ξS21×|ξ^S22ξS22|2C3(st)1/2n1/2×ξS2.
(7.9)

Next we upper bound the quadratic term ‖ξS2. First we show that ΣSS1/2ZiS=ϕ1/2ΣSS1/2ε˜iXi are sub-exponential random vectors, where [epsilon with tilde]i := εi / (Var εi)1/2. In fact, for every u [set membership] Ss−1, uΣSS1/2ZiSψ12ε˜iψ2uΣSS1/2XiSψ22A0A0, where A0>0 is a constant depending only on a0 in Condition 3.1. Following the proof of Lemma 5.15 in Vershynin (2012), we derive that for every u [set membership] Rs satisfying u2ϕ1/2(4eA0A0)1n,

log 𝔼 exp(uξS)=i=1nlog 𝔼 exp(n1/2uΣSS1/2ZiS)2e2u22n1i=1n(u/u2)ΣSS1/2ZiSψ12(4eA0A0)2ϕu222.

Consequently, applying Corollary 1.12 in the supplement of Spokoiny (2012) with g=n, B = Is and xc=34n12(1+log 3)s34n32s to the random vector (4eA0A0)1ϕ1/2ξS yields that, for every 0 ≤ t ≤ xc,

P[ξS24eA0A0{ϕΔ(s,t)}1/2]2et+8.4exc.
(7.10)

Finally, combining (7.9) and (7.10) completes the proof of (6.15).

7.4 Proof of Lemma 6.4

First, observe that

maxS[p]:|S|=sξS2=maxu(s,p)n1/2i=1nuZi(uΣu)1/2,

where F (s, p) = {x [mapsto] u[top top]x : u [set membership] Sp−1, ‖u0s}. Recall that Z1, …, Zn are i.i.d. p-dimensional centered random vectors with covariance matrix 𝔼(ZiZi)=ϕΣ. As in the proof of Lemma 6.3, we have for any u [set membership] Sp−1,

ϕ1/2uZiψ12εi/(Var εi)1/2ψ2uΣ1/2Uiψ22A0A0(uΣu)1/2.

Consequently, it follows from Lemma 7.5 in Fan, Shao and Zhou (2015) that there exists a random variable T0=dR0(s,p)=maxu(s,p)uG(uΣu)1/2 for G ~ N(0, Σ) such that, for any δ [set membership] (0, 1],

P{|maxS[p]:|S|=sϕ1/2ξS2T0|C1A0A0(δ+γs,p,n1/2n+γs,p,n2n3/2)}C2[{s log(γspn)}2δ3n+{s log(γspn)}5δ4n],

where γs,p,n=s logγseps+log n and C1, C2 > 0 are absolute constants. This proves (6.17).

7.5 Proof of Lemma 6.5

The proof employs techniques from empirical process theory which modify the arguments used in Wang (2013). To begin with, note that

θ^S=arg minθsf(θ)arg minθsY𝕏Sθ1.

Under the null model, Y = Xβ* + ε = XSθ* + ε with θ* = 0. Then the sub-differential of f(θ) at θ = 0 can be written as f(0)=𝕏Ssgn(ε), where sgn(ε) = (sgn(ε1), …, sgn(εn))[top top] with sgn(u) := I(u > 0) − I(u < 0). Define z = (z1, …, zn)[top top] = sgn(ε), and note that z1, …, zn are i.i.d. random variables satisfying P(zi = 1) = P(zi = −1) = 1/2.

Since [theta w/ hat]S minimizes ‖YXSθ1 over Rs, we have the following basic inequality

Y𝕏Sθ^S1=𝕏Sθ^Sε1ε1.
(7.11)

Further, define a random process {Q(θ)} indexed by θ [set membership] Rs:

Q(θ)=n1/2i=1n(|XiSθεi||εi|).
(7.12)

In what follows, we prove that with overwhelmingly high probability, Q(θ) is concentrated around its expectation QX(θ) := EX{Q(θ)} uniformly over θ [set membership] Rs via a straightforward adaptation of the peeling argument.

For δ1 > 0 and [ell] = 1, 2, …, consider the following sequence of events

𝒢(δ1)={θs:Σ^SS1/2θ2δ1},𝒢(δ1)={θs:α1δ1Σ^SS1/2θ2αδ1},
(7.13)

where α=2. Here, δ1 can be regarded as a tolerance parameter, and it is easy to see that 𝒢(δ1)==1𝒢(δ1). For R > 0, set 𝒱(R)={θ𝒢(δ1):Σ^SS1/2θ2R} and let Δ(R) be the maximum deviation over the elliptic vicinity V(R):

Δ(R)=maxθ𝒱(R)|Q(θ)Q𝕏(θ)|.
(7.14)

For every θ [set membership] Rs, define the rescaled vector θ˜=Σ^SS1/2θ such that

Δ(R)=maxδ1θ˜2R|Q(Σ^SS1/2θ˜)Q𝕏(Σ^SS1/2θ˜)|.

For every 0 < ε ≤ R, there exists an ε-net [mathematical script N]ε of the Euclidean ball 𝔹2s(R) with cardinality bounded by (1+2Rε)s. For [theta w/ tilde]1, θ˜2𝔹2s(R) satisfying ‖ [theta w/ tilde]1[theta w/ tilde]22 ≤ ε, observe that

|Q(Σ^SS1/2θ˜1)Q(Σ^SS1/2θ˜2)|n1/2i=1n|XiSΣ^SS1/2(θ˜1θ˜2)|𝕏SΣ^SS1/2(θ˜1θ˜2)2εn1/2.

Then, it is easy to see that

Δ(R)maxθ˜𝒩ε|Q(Σ^SS1/2θ˜)Q𝕏(Σ^SS1/2θ˜)|+2εn1/2.
(7.15)

For each θ˜𝔹2s(R) fixed, Q(Σ^SS1/2θ˜)Q𝕏(Σ^SS1/2θ˜) is a sum of independent random variables with zero means and for i = 1, …, n, ||XiSΣ^SS1/2θ˜εi||εi|||XiSΣ^SS1/2θ˜|. Therefore, it follows from Hoeffding’s inequality that for every t > 0,

PX{|Q(Σ^SS1/2θ˜)Q𝕏(Σ^SS1/2θ˜)|t}2 exp {nt22i=1n(XiSΣ^SS1/2θ˜)2}=2 exp (t22θ˜22).

In other words, for every θ˜𝔹2s(R) and δ > 0,

|Q(Σ^SS1/2θ˜)Q𝕏(Σ^SS1/2θ˜)|(2δ)1/2θ˜2(2δ)1/2R

holds with probability at least 1 − 2e−δ. This, together with the union bound yields

PX{maxθ˜𝒩ε|Q(Σ^SS1/2θ˜)Q𝕏(Σ^SS1/2θ˜)|(2δ)1/2R}exp {s log (1+2Rε)δ}.
(7.16)

In particular, by taking ε = Rn−1 in (7.15) and δ=s log(1+2Rε)+t2s log n+t in (7.16) we conclude that

PX{Δ(R)R(2t)1/2+2R(s log n)1/2+2Rn1/2}2et
(7.17)

holds almost surely on the event x21300(τ) for any τ > 0.

In particular, by taking t = cnR2 in (7.17) for some c > 0 to be specified below (7.22) and the union bound, we have

PX[θ𝒢(δ1),s.t.|Q(θ)Q𝕏(θ)|23/2θ˜2{θ˜2(cn)1/2+(s log n)1/2+n1/2}]=1PX[θ𝒢(δ1),s.t.|Q(θ)Q𝕏(θ)|(αδ1)2(2cn)1/2+2αδ1{(s log n)1/2+n1/2}]=1PX[Δ(αδ1)(αδ1)2(2cn)1/2+2αδ1{(s log n)1/2+n1/2}]2=1exp{cn(αδ1)2}2=1exp{2c log(α)nδ12}2 exp(c0nδ12)1exp(c0nδ12),

where c0 = c log 2. This implies that with probability at least 14 exp(c0nδ12),

|Q(θ)Q𝕏(θ)|23/2cΣ^SS1/2θ22+23/2Σ^SS1/2θ2{(s log n)1/2+n1/2}
(7.18)

holds for all θ [set membership] G1) whenever nc1δ12.

For the (conditional) expectation

Q𝕏(θ)=n1/2i=1n𝔼X(|XiSθεi||εi|)=n1/2(𝔼X𝕏Sθε1𝔼ε1),

applying Lemmas 5 and 6 in Wang (2013) with slight modifications gives

Q𝕏(θ){14n𝕏Sθ1=n4n1𝕏Sθ1if 𝕏Sθ12na2,a28n𝕏Sθ22=a2n8Σ^SS1/2θ22if 𝕏Sθ1<2na2,
(7.19)

where a2 is as in Condition 3.4. For the sequence of LAD estimators {[theta w/ hat]S : S [subset, dbl equals] [p], |S| = s}, from (7.11) it can be seen that ‖XS[theta w/ hat]S1 ≤ ‖XS[theta w/ hat]Sε1 + ‖ε1 ≤ 2‖ε1, and hence

maxS[p]:|S|=sn1𝕏Sθ^S12{𝔼|ε|+n1i=1n(|εi|𝔼|εi|)}.

For every t > 0 and 1 < κ ≤ 2, by Markov’s inequality we have

P{i=1n(|εi|𝔼|εi|)t}tκ𝔼|i=1n(|εi|𝔼|εi|)|κ42κtκn𝔼|ε|κ,

where we used the inequality |1 + x|κ ≤ 1 + κx + 22−κ|x|κ for 1 < κ ≤ 2 and x [set membership] R. The last two displays together imply that, with probability at least 1 − δ2,

maxS[p]:|S|=sn1𝕏Sθ^S12𝔼|ε|{1+4(2κ)/κ(𝔼|ε|)1(𝔼|ε|κ)1/κδ21/κn1+1/κ}.

By Condition 3.4, we have a2E|ε| < 1. Therefore, as long as the sample size n satisfies

n{42qa2κ𝔼|ε|κ(1a2𝔼|ε|)κ}1/(κ1)δ21/(κ1),
(7.20)

the event

1{maxS[p]:|S|=sn1𝕏Sθ^S12a21}
(7.21)

occurs with probability at least 1 − δ2.

Now, by (7.11), we have Q([theta w/ hat]S) ≤ 0 and thus −{Q([theta w/ hat]S) − QX([theta w/ hat]S)} ≥ QX([theta w/ hat]S) holds for every s-subset S [subset, dbl equals] [p]. Together with (7.18)(7.21) and the union bound, this implies that on the event x21300(τ) ∩ x21301 for any τ > 0,

maxS[p]:|S|=sΣ^SS1/2θ^S2min [δ1,322a21{(s log nn)1/2+1n}]
(7.22)

holds with (conditional) probability 14(ps) exp(c0nδ12)δ2, provided that the sample size n satisfies n ≥ 2 · 322(a2δ1)−2 and (7.20).

Finally, taking

δ1=32a22log(2)(s logeps+log nn)1/2  and  δ2=42qa2κ𝔼|ε|κ(1a2𝔼|ε|)κ1nκ1

in (7.22) proves (6.21).

7.6 Proof of Lemma 6.6

We prove this lemma by employing the arguments similar to those used in Spokoiny (2013), where the likelihood function L(θ) is assumed to be twice differentiable with respect to θ. It is worth noticing that both Conditions (L) and (ED2) in Spokoiny (2013) are not satisfied in the current situation. We provide here a self-contained proof in which Lemma 6.5 also plays an important role.

Step 1: Local linear approximation of nS(θ)

Let χ1S(θ) be the normalized residual of the local linear approximation of nS(θ) given by

χ1S(θ)=D01{nS(θ)nS(0)+D02θ}=D01{U(θ)+𝔼XnS(θ)𝔼XnS(0)+D02θ},
(7.23)

where U(θ) = [nabla]ζS(θ) − [nabla]ζS(0) and D02=2𝔼X{nS(0)}=2fε(0)i=1nXiSXiS. Then it follows from the mean value theorem that

𝔼X{χ1S(θ)}={IsD01D2(θ˜)D01}D0θ,
(7.24)

where D2(θ)=2𝔼X{nS(θ)}=2i=1nfε(XiSθ)XiSXiS and [theta w/ tilde] = λθ for some 0 ≤ λ ≤ 1. As before, for every r ≥ 0, define the local elliptic neighborhood of 0 as

Θ0(r)={θs:D0θ2r}.

On the event x21300(τ) for some τ > 0,

|XiSθ|D0θ2D01XiS2{2nfε(0)}1/2τ1/2r
(7.25)

for all θ [set membership] Θ0(r). Thus it follows from the Taylor expansion that for r ≤ {2nfε(0)/τ}1/2,

IsD01D2(θ˜)D01=2D01i=1n{fε(XiSθ˜)fε(0)}XiSXiSD01A32fε3/2(0)τ1/2rn1/2δ(τ,r).
(7.26)

Together, (7.24) and (7.26) imply that under the same constraint for (7.26),

𝔼X{χ1S(θ)}2δ(τ,r)r.
(7.27)

Turning to the stochastic component D01U(θ)=χ1S(θ)𝔼X{χ1S(θ)}, we aim to bound maxθΘ0(r)D01U(θ)2, which can be written as

maxθΘ0(r),u21uD01U(θ)=r1maxu,θΘ0(r)vU(θ).
(7.28)

Note that {v[top top]U(θ) : v, θ [set membership] Rs} is a bivariate process indexed by (v[top top], θ[top top])[top top] [set membership] R2s. Define

θ¯=(v,θ)2s,D¯0=(D000D0)(2s)×(2s),

Ū(θ¯)=vU(θ),Θ¯0(r)={θ¯2s:D¯0θ¯2r}.

In this notation, from (7.28) and the identity 0[theta w/ macron] = D0v + D0θ, it is easy to see that

maxθΘ0(r)D01U(θ)2r1maxθ¯Θ¯0(2r)Ū(θ¯).
(7.29)

Recall that ζS(θ)ζS(0)=2i=1n{I(YiXiSθ)I(Yi0)+1/2Fε(XiSθ)}XiS, where for i = 1, …, n, I(YiXiSθ)I(Yi0)+1/2Fε(XiSθ) is equal to

{I(0<YiXiSθ)PX(0<YiXiSθ)if XiSθ0,I(XiSθ<Yi0)+PX(XiSθ<Yi0)if XiSθ<0.

For θ [set membership] Rs, define random variables εi,θ=I(0<YiXiSθ)I(XiSθ<Yi0) satisfying

  1. conditional on XiSθ0, εi,θ = 1 with probability Pi,θ − 1/2 and εi,θ = 0 with probability 3/2 − Pi,θ;
  2. conditional on XiSθ<0, εi,θ = −1 with probability 1/2 − Pi,θ and εi,θ = 0 with probability 1/2 + Pi,θ,

where Pi,θ=Fε(XiSθ). In this notation, ζS(θ)ζS(0)=2i=1n(Id𝔼X)εi,θXiS. For every λ [set membership] R and u [set membership] Rs, we have

𝔼X exp[λu{ζS(θ)ζS(0)}]=i=1n[𝔼X{e2λuXiS(I𝔼X)εi,θ}I(XiSθ0)+𝔼X{e2λuXiS(I𝔼X)εi,θ}I(XiSθ<0)]=i=1n[{e2λuXiS(3/2Pi,θ)(Pi,θ1/2)+e2λuXiS(Pi,θ1/2)(3/2Pi,θ)}I(XiSθ0)+{e2λuXiS(1/2+Pi,θ)(1/2Pi,θ)+e2λuXiS(Pi,θ1/2)(1/2+Pi,θ)}I(XiSθ<0)].

Further, using the inequalities |eu1u|12u2eu0 and 1 + ueu which hold for all u [set membership] R, the last term above can be bounded by

i=1n[{1+2λ2(uXiS)2(Pi,θ1/2)(3/2Pi,θ)e2λ|uXiS|}I(XiSθ0)+{1+2λ2(uXiS)2(1/2Pi,θ)(1/2+Pi,θ)e2λ|uXiS|}I(XiSθ<0)]i=1n{1+2λ2(uXiS)2|Pi,θ1/2|e2λ|uXiS|}i=1n exp {2λ2(uXiS)2|Pi,θ1/2|e2λ|uXiS|}.

Consequently, for every [theta w/ macron] = (v[top top], θ[top top])[top top] [set membership] [Theta with macron]0(2r),

log 𝔼X exp {λŪ(θ¯)Ū(0)D¯0θ¯2}=log 𝔼X exp {λv{ζS(θ)ζS(0)}D¯0θ¯2}2λ2D0v22+D0θ22i=1n(vXiS)2|Pi,θ1/2|exp (2λ|vXiS|D¯0θ¯2).
(7.30)

On the event x21300(τ) for some τ > 0, we have |Pi,θ − 1/2| ≤ 2A2{2nfε(0)}−1/2τ1/2r and |vXiS|D0v2D01XiS2D0v2{2nfε(0)}1/2τ1/2. Together with (7.30), this yields that for all |λ| ≤ {2nfε(0)/τ}1/2,

log 𝔼X exp {λŪ(θ¯)Ū(0)D¯0θ¯2}λ224e2A2rfε(0)τ2nfε(0).
(7.31)

In view of (7.31), define

w0(τ)=2eA2r0fε(0){τ2nfε(0)}1/4
(7.32)

for some r0 > 0 to be specified (see (7.38) below), such that for any [theta w/ macron] = (v[top top], θ[top top])[top top] [set membership] [Theta with macron]0(2r) with 0 ≤ rr0,

𝔼X exp {λw0(τ)Ū(θ¯)Ū(0)D¯0θ¯2}exp(λ2/2)
(7.33)

holds almost surely on x21300(τ) for all

|λ|2eA2r0fε(0){2nfε(0)τ}1/4g0(τ).
(7.34)

By (7.33), it follows from Corollary 2.2 in the supplement of Spokoiny (2012) and (7.29) that, for any τ > 0, 0 ≤ rr0 and 0<t12g02(τ)2s,

PX{maxθΘ0(r)D01U(θ)26w0(τ)(2t+4s)1/2}et
(7.35)

holds almost surely on x21300(τ), where g0 is given at (7.34).

Combining (7.24) and (7.35) we obtain that for any τ > 0, 0 ≤ rr0 ≤ {2nfε(0)/τ}1/2 and 0<t12g02(τ)2s,

PX{maxθΘ0(r)χ1S(θ)2δ(τ,r)r+6w0(τ)(2t+4s)1/2}et
(7.36)

almost surely on x21300(τ). For a given triplet (τ, r, t), define the event

Ω0S(τ,r,t)={maxθΘ0(r)χ1S(θ)2δ(τ,r)r+6w0(τ)(2t+4s)1/2}.
(7.37)

Step 2: Fisher approximation

By Lemma 6.5,

maxS[p]:|S|=sD0θ^S2={2nfε(0)}1/2maxS[p]:|S|=sΣ^SS1/2θ^S2C1a21{2fε(0)s log(pn)}1/2r0
(7.38)

holds with probability at least 1 − c1n−1c2n1−κ. Moreover, since [theta w/ hat]S maximizes nS(θ) over θ [set membership] Rs for each s-subset S [subset, dbl equals] [p], we have nS(θ^S)=0 and χ1S(θ^)=D0θ^Sξ^S. This, together with (7.37) implies that on the event {θ^SΘ0(r0)}Ω0S(τ,r0,t),

D0θ^Sξ^S2δ(τ,r0)r0+6w0(τ)(2t+4s)1/2
(7.39)

whenever n{2fε(0)}1τr02.

Step 3: Wilks approximation

For θ1, θ2 [set membership] Θ0(r), define

χ2S(θ1,θ2)=nS(θ)nS(θ2)(θ1θ2)nS(θ2)+12D0(θ1θ2)22.
(7.40)

Noting that θ1χ2S(θ1,θ2)=nS(θ1)nS(θ2)+D02(θ1θ2)=D0{χ1S(θ1)χ1S(θ2)}, we have

|χ2S(θ1,θ2)|=|χ2S(θ1,θ2)χ2S(θ2,θ2)|2D0(θ1θ2)2maxuΘ0(r)χ1S(u)2,
(7.41)

where [theta w/ tilde] = λθ for some 0 ≤ λ ≤ 1. Let r0 > 0 be as in (7.38). Then, it follows from (7.41) that on Ω0S(τ,r0,t) with n{2fε(0)}1τr02,

maxθ1,θ2Θ0(r0)|χ2S(θ1,θ2)|D0(θ1θ2)22δ(τ,r0)r0+12w0(τ)(2t+4s)1/2.

In view of (7.40), nS(θ^S)nS(0)12D0θ^S22=χ2S(0,θ^S). Therefore, on the event {θ^SΘ0(r0)}Ω0S(τ,r0,t) we have

|[2{nS(θ^S)nS(0)}]1/2D0θ^S2||2{nS(θ^S)nS(0)}D0θ^S22|D0θ^S22|χ2S(0,θ^S)|D0θ^S24{δ(τ,r0)r0+6w0(τ)(2t+4s)1/2},

provided that n{2fε(0)}1τr02. Together with (7.39), this implies that conditional on the event S[p]:|S|=s{θ^SΘ0(r0)}Ω0S(τ,r0,t),

maxS[p]:|S|=s|[2{nS(θ^S)nS(0)}]1/2ξ^S2|5{δ(τ,r0)r0+6w0(τ)(2t+4s)1/2}
(7.42)

whenever n{2fε(0)}1r02τ, where δ(τ, r), r0 and w0(τ) are as in (7.26), (7.38) and (7.32).

Finally, taking τ=τ0λmin1(s)s log(pn) as in (6.13) and setting t=s logeps+log n in the concentration bound (7.36) prove (6.25) using Boole’s inequality.

References

  • Bertsimas D, King A, Mazumder R. Best subset selection via a modern optimization lens. 2015 Available at arXiv:1507.03133.
  • Bühlmann P, van de Geer SA. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York: Springer; 2011.
  • Cai TT, Jiang T. Phase transition in limiting distributions of coherence of high-dimensional random matrices. J. Multivariate Anal. 2012;107:24–39.
  • Cai TT, Fan J, Jiang T. Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 2013;14:1837–1864. [PMC free article] [PubMed]
  • Chernozhukov V, Chetverikov D, Kato K. Gaussian approximation of suprema of empirical processes. Ann. Statist. 2014;42:1564–1597.
  • Fan J, Guo S, Hao N. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. 2012;74:37–65. [PMC free article] [PubMed]
  • Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001;96:1348–1360.
  • Fan J, Liu H, Sun Q, Zhang T. TAC for sparse learning: Simultaneous control of algorithmic complexity and statistical error. 2015 Available at arXiv:1507.01037.
  • Fan J, Shao Q-M, Zhou W-X. Are discoveries spurious? Distributions of maximum spurious correlations and their applications. 2015 Available at arXiv:1502.04237.
  • Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 1997;55:119–139.
  • Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall, CRC Press; 2015.
  • Koenker R. Quantile Regression. Cambridge: Cambridge University Press; 2005.
  • Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. J. Comput. Graph. Statist. 2000;9:1–20.
  • Maddala GS. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press; 1983.
  • McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman & Hall; 1989.
  • Oberthuer A, Berthold F, Warnat P, Hero B, Kahlert Y, Spitz R, Ernestus K, König R, Haas S, Eils R, Schwab M, Brors B, Westermann F, Fischer M. Customized oligonucleotide microarray gene expression based classification of neuroblastoma patients outperforms current clinical risk stratification. J. Clin. Oncol. 2006;24:5070–5078. [PubMed]
  • Shi L, et al. MAQC Consortium. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology. 2010;28:827–841. [PMC free article] [PubMed]
  • Spokoiny V. Parametric estimation. Finite sample theory. Ann. Statist. 2012;40:2877–2909.
  • Spokoiny V. Bernstein-von Mises theorem for growing parameter dimension. 2013 Available at arXiv:1302.3430.
  • Spokoiny V, Zhilova M. Bootstrap confidence sets under model misspecification. Ann. Statist. 2015;43:2653–2675.
  • Sun T, Zhang C-H. Scaled sparse linear regression. Biometrika. 2012;99:879–898.
  • Thorisson GA, Smith AV, Krishnan L, Stein LD. The International HapMap Project Web site. Genome Res. 2005;15:1592–1593. [PubMed]
  • Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996;58:267–288.
  • Vapnik VN. The Nature of Statistical Learning Theory. New York: Springer-Verlag; 1995. 1995.
  • Vershynin R. Introduction to the non-asymptotic analysis of random matrices. In: Eldar Y, Kutyniok G, editors. Compressed Sensing: Theory and Applications. Cambridge: Cambridge University Press; 2012. pp. 210–268.
  • Wang L. The L1 penalized LAD estimator for high dimensional linear regression. J. Multivariate Anal. 2013;120:135–151.
  • Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Statist. 1938;9:60–62.
  • Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 2008;36:1509–1533. [PMC free article] [PubMed]