Search tips
Search criteria 


Logo of ijbiostatThe International Journal of BiostatisticsThe International Journal of BiostatisticsSubmit to The International Journal of BiostatisticsSubscribe
Int J Biostat. 2010 January 1; 6(1): Article 12.
Published online 2010 March 29. doi:  10.2202/1557-4679.1231
PMCID: PMC2854087

A Note on the Effect on Power of Score Tests via Dimension Reduction by Penalized Regression under the Null*


We consider the problem of score testing for certain low dimensional parameters of interest in a model that could include finite but high dimensional secondary covariates and associated nuisance parameters. We investigate the possibility of the potential gain in power by reducing the dimensionality of the secondary variables via oracle estimators such as the Adaptive Lasso. As an application, we use a recently developed framework for score tests of association of a disease outcome with an exposure of interest in the presence of a possible interaction of the exposure with other co-factors of the model. We derive the local power of such tests and show that if the primary and secondary predictors are independent, then having an oracle estimator does not improve the local power of the score test. Conversely, if they are dependent, there is the potential for power gain. Simulations are used to validate the theoretical results and explore the extent of correlation needed between the primary and secondary covariates to observe an improvement of the power of the test by using the oracle estimator. Our conclusions are likely to hold more generally beyond the model of interactions considered here.

Keywords: Adaptive Lasso, gene-environment interactions, Lasso, model selection, oracle estimation, score tests

1. Introduction

Dimension reduction in regression models via regularization has emerged as a topic that has undergone vigorous development, with oracle methods such as SCAD (Fan and Li, 2001) and the Adaptive Lasso (Zou, 2006) and other popular methods such as the Lasso (Tibshirani, 1996). We will say that an oracle method is one which is asymptotically consistent, i.e., selects the correct model with probability tending to 1.0 as the sample size n → ∞, and is asymptotically efficient for estimating the non-zero variables.

We are interested in testing for the effect of a low-dimensional covariate Z when the model contains other higher dimensional covariates (X, S) which govern nuisance parameters. As in Fan and Li (2001) and Zou (2006), we work in the context that the number of covariates is large but smaller than the sample size, i.e., the dimensionality is fixed and does not increase with the sample size. For example, in logistic regression with a binary response Y, the model would be


where H(·) is the logistic regression problem. The interest here is in testing the null hypothesis that H0 : θ = 0 and to produce an asymptotically valid significance level for that test. In model (1), (κ, η) is a nuisance parameter relating to (X, S). One cannot use an oracle method to make valid probability statements about the null hypothesis, i.e., run SCAD or the Adaptive Lasso using all of (X, S, Z) and then somehow test for θ. This has been emphasized in a series of papers beginning with Leeb and Pötscher (2005), who note that the limit distribution of the oracle estimate of θ is not asymptotically normal under contiguous alternatives.

Our emphasis here is very different from the usual dimension reduction framework. The parameters (κ, η) are simply nuisance parameters and in this framework are not the main focus of interest.

As a motivating example for this article, we consider an extension of the model (1) that Chatterjee et al. (2006) described for testing the association of a disease outcome with an exposure when that exposure may be interacting with other co-factors that influence the risk of the disease. See also Maity, et al. (2009) for a semiparametric extension. In Chatterjee et al., Z could be a specific genetic or environment exposure of interest, X could be an array of genetic and/or environmental covariates of higher dimension that may be potentially interacting with Z and S could be certain basic covariates, such as age and sex, that the model needs to be adjusted for. The interest focuses on whether Z is associated with the disease outcome Y. To improve power, Chatterjee et al proposed using a parsimonious, Tukey’s one-degree-of-freedom style, interaction model between Z and X of the form


where the scalar γ is meant to capture the interaction. A full description of this method is given in the Appendix. The null hypothesis of no association between Y and Z corresponds to testing H0 : θ = 0. A complication, however, is that under θ = 0, the parameter γ also disappears from the model and hence is not identifiable from the data. Nevertheless, Chatterjee et al. noticed that for each fixed value of γ, the model (2) can be used to construct a valid score-test for H0 : θ = 0. They proposed to use the maximum of such score-statistics over a range of the parameter γ as the final test statistic. They observed that the score-test has particular computational advantages, because under the null hypothesis the model (2) reduces to a standard logistic regression model involving only main effects of X and S.

Motivated by model (2) and Chatterjee, et al. (2006), we study in this paper the following algorithm. Let the loglikelihood function be denoted by L(Y,XTκ,STη,ZTθ,ζ), where for example in linear regression ζ would be the regression error variance. The usual score test of course would fit the model L(Y,XTκ,Tη,0,ζ). An oracle would do the following, which we call an oracle score test.

  • Under the null hypothesis H0 : θ = 0, fit the parameters (κ, η) by using an oracle estimator that consistently chooses the correct null model.
  • Perform a score test using the selected components of (X, S).

This paper addresses the simple question: does having access to an oracle under the null model improve the local power of the testing, i.e., is the oracle score test more powerful than the ordinary score test that does not do any model selection under the null? We will show that the answer to this depends on how (X, S) are correlated with Z: when they are independent, there is no point in having access to an oracle. We will demonstrate however that if (X, S) has information carrying elements that are highly correlated with Z, then having access to an oracle can be valuable.

An outline of this paper is as follows. In Section 2, we develop the general framework and state the main result on the local power of the score test. In Section 3, we discuss the implications of this result for oracle methods. In Section 4 we give simulation evidence that suggests that having access to an oracle may not be much help in a variety of practical problems. In Section 5, we briefly describe a setting that having access to an oracle will be useful, although we point out some peculiar behavior of the Adaptive Lasso implemented with 10-fold crossvalidation. All technical details are given in an Appendix.

2. Local Power of the Score Test

Here we compute the local power of the score test against alternatives of the form θ = λn−1/2.

Suppose that the loglikelihood function is L(Y,XTκ,STη,ZTθ,ζ) and that the null hypothesis is H0: θ = 0. In the general form, the function L(Y,ν1,ν2,ν3,ν4) has first derivatives Lj(Y,ν1,ν2,ν3,ν4)=[partial differential]L(Y,ν1,ν2,ν3,ν4)/[partial differential]νj and second derivatives Ljk(Y,ν1,ν2,ν3,ν4)=[partial differential]2L(Y,ν1,ν2,ν3,ν4)/[partial differential]νj[partial differential]νk. Let β = (κT, ηT, ζT)T. Let the Fisher information matrix for (θ, β) be partitioned to have diagonal elements Iθθ and Iββ with top right hand off-diagonal element Iθβ. Under the null hypothesis, with (•) = (Y, XTκ, STη, 0, ζ), the Fisher information matrix is

I=[IθθIθβIβθIββ]   =E[ZZTL33()ZXTL13()ZSTL23()ZL34()XZTL13()XXTL11()XSTL12()XL14()SZTL23()SXTL12()SSTL22()SL24()ZTL34()XTL14()STL24()L44()].

Further, define Ω=IθθIθβIββ1Iβθ. Let [beta]n) be the maximum likelihood estimate of β assuming that the null hypothesis is true but when θn is the true parameter value. Let O be the usual estimate of Ω with the Fisher information matrix estimated under the null hypothesis. Define


The score test statistic is Tn=SnTΩ^1Sn. Under the null hypothesis the score test statistic converges χpθ2(0), the central chisquared distribution with pθ degrees of freedom, where pθ is the dimension of θ. The following is a well-known but useful result.

Result 1 Under the local alternatives θ = λn−1/2, Tnχpθ2(λTΩλ/2), the non-central chisquared distribution with pθ degrees of freedom and noncentrality parameter λTΩλ/2.

3. Implications for Oracles

We are going to show that if Z and (X, S) are independent, under two simple assumptions that are usually satisfied, using an oracle to reduce the dimension of the fit yields no increase in local power when compared to using all components of (X, S), and not just the ones that predict the response. This has an important implication. In perhaps the vast majority of cases, one will not expect information-carrying components (X, S) to be highly correlated with Z. Thus, in much of actual practice, oracle estimation when doing a score test will not improve power substantially.

We will make the following assumptions. Both Assumptions 1–2 hold for the score tests for models (1) and (2).

Assumption 1 Under the null hypothesis, the distribution of Y depends on (X, S, Z) only through (XTκ, STη).

Assumption 2 The score test statistic is invariant to location changes in Z, so that for example it has the same value whether Z or ZE(Z) is used.

Result 2 Suppose that Z and (X, S) are independent, and that Assumptions 1–2 hold. Then under the contiguous alternatives that θ = θn = λn−1/2, the local power of the score test depends on W = (X, S) only through (XTκ, STη).

Using Result 2 our main result follows:

Theorem 1 Make the same assumptions as in Result 2, including that Z is independent of (X, S). Partition β into information and non-information carrying components, so that β = (β1, β2 = 0). Partition the components of W = (X, S) similarly as (W1, W2). Then, in the score test, knowing that β2 = 0 and using only W1 does not affect the local power of the score test. It follows that oracle estimators of β will have no more local power than the naive test that does not attempt dimension reduction.

Corollary 1 Using Results 1–2 and Result 3 in Appendix A.1, if Z is not independent of (X, S), then in general oracle estimators will have greater power.

It is tempting to believe that just because (X, S) are independent of Z, then estimates of β = (κT, ηT, ζT)T are independent of estimates of θ. This is not the case under alternatives in general, including in our example.

4. Simulations

In this section, we report two simulations of the Chatterjee method of model (2). The method itself is described in Appendix A.3. In all cases, 1,000 data sets were randomly generated. In all cases, we first generated X* to have dimension pκ = 100, S to have dimension pη = 10 and Z* was scalar. In all cases, marginally X* and S* were normally distributed with mean zero, variance 1 and common correlation 0.50, while Z* was normally distributed with mean zero and variance one. We then convert X and Z to SNP-type data, which take on the values 0, 1 and 2. More precisely, the individual components of X and Z had value 1.0 with probability 0.27, and they had value 2.0 with probability 0.20. The data were then generated by model (2) with γ = 2.

The sample size was n = 5,000. All components of κ were zero except for the first three, which had values (1.0, 1.0, 0.5). All components of η were zero except the first, which had value 0.75. The values of θ varied from the null, θ = 0, to θ = 0.07 in steps of 0.01. The Lasso using the penalized package in R, (Goeman, J. J., 2009a; Goeman, J. J., 2009b), and the Adaptive Lasso using the glmnet package in R, (Friedman J. et al. 2009), were used to estimate β under the null hypothesis. We also refitted β using only the four interesting predictors, as well as using all 110 components of (X, S).

We point out that while our emphasis is on statistical power, we also evaluate the Type I error of the tests. As the results below show, all the methods have reasonable Type I error rate control.

4.1. The Case that (X, S, Z) are Mutually Independent

In the first simulation, (X*, S*, Z*) were mutually independent. According to our Theorem 1, we expect no gain in power for being an oracle. This is confirmed by Table 1, where the largest single difference in power was 0.026, although there is consistently a very small power increase with an oracle.

Table 1:
Simulation of power for a 5%-level test in the case that (X, S, Z) are mutually independent. Here the theory says that all methods should have the same local power.

4.2. The Case that (X, S, Z) are Mutually Dependent

In the second simulation, we set all components of (X*, S*, Z*) to have correlation 0.50. In this case, the theory predicts that the oracle method and the method that uses all 110 predictors will have different power, see Corollary 1. In Table 2 we produce the results. There is some evidence of very modest increased power for the oracle method, with the biggest difference in power being when θ = 0.05, this is an increase of 0.051. The adaptive Lasso is very close to the oracle. Disappointingly, in the second simulation, the Lasso method actually loses considerable power compared to using all 110 predictors, or compared to the adaptive Lasso. It is not clear why this is happening.

Table 2:
Simulation of power for a 5%-level test in the case that (X, S, Z) are mutually correlated. Here the theory says that oracle methods should have higher local power than the method that uses all 110 predictors.

The lack of a great increase in power for the oracle estimator in this second simulation is predictable from our theory. Write W = (XT, ST)T and remember that β = (κT, ηT)T. Fix γ. Let H(1)(x) = H(x){1 – H(x)}. Detailed calculations based on Theorem 1 show that the elements of the information matrix are given as


from which Ω and the noncentrality parameter can be calculated from Result 1.

In our simulation study, with γ = 2 and θ = 0.05, the noncentrality parameter when using all 110 predictors is 0.42, while it is 0.44 for the oracle. Such a modest increase in noncentrality is reflected in the very modest power gains for the oracle in the second simulation. If instead we make the common correlation in the second simulation equal to 0.90, the noncentrality parameter is 0.20 and 0.22 for the 110 variables and the oracle, respectively.

5. High Correlation and Information-Carrying Covariates

Our main purpose has been to show that there will be little to be gained by having access to an oracle if the secondary variables are not highly correlated with the variables of interest. However, as a secondary matter, it is not always the case that having access to an oracle will not help in terms of power. Using our theory, we developed cases where X had 15 components, only 2 of which were information-carrying. In addition, Z had three components, one of which was also a predictor. One of the information-carrying components of X was uncorrelated with Z, while the other had correlation > 0.85 for each component of Z.

We investigated two cases, one in which the information-carrying components of X had relative risks ≈ 4.0, and the other in which the relative risks were a more modest ≈ 0.5. Here there was a decided increase in power if an oracle was available. Disappointingly, the Adaptive Lasso with 10-fold crossvalidation was only modestly more powerful than the approach that used all covariates, and much less powerful than the oracle. The Lasso was essentially equivalent to using all the covariates. It is possible, and indeed we hope, that newer oracle methods such as the Adaptive Elastic Net (Zou and Zhang, 2009) that deal with collinearity better than the Adaptive Lasso will lead to improved performance in these cases, although currently there is no software available in R for this method. The elastic net (Zou and Hastie, 2005) is its non-oracle version, and in these simulations was essentially equivalent to the Adaptive Lasso.

6. Discussion

We believe that using regularization methods such as the Adaptive Lasso is a natural idea in score testing for a primary variable when there are many secondary variables. Our simulations and empirical work demonstrate that if the secondary variables are not highly correlated with the variables of interest, or if these secondary variables are not informative about the response, then little will be gained by having an access to an oracle. Our theoretical numerical exercises reveal the possibility of power increases with strong correlation between highly informative secondary variables and the primary variables of interest. This situation may not commonly occur. As a point of future research, in these settings that power gains are possible, the Adaptive Lasso with 10-fold crossvalidation had fairly disappointing behavior.

It would be interesting to find empirical examples of score testing where oracle methods actually do achieve significantly greater power in practice.

Appendix: Sketch of Technical Arguments

A.1.  General Considerations

Consider a general problem where the loglikelihood function is L(Y,θ,β). The null hypothesis is H0 : θ = 0, and we consider contiguous alternatives of the form θ = θn = λn−1/2. Let [beta]n) be the estimator computed as if the null hypothesis were true with data generated under the contiguous alternative.

Subscripts will denote derivatives, e.g., Lθ(Y,θ,β)=[partial differential]L(Y,θ,β)/[partial differential]θ and Lθβ(Y,θ,β)=[partial differential]2L(Y,θ,β)/[partial differential]θ[partial differential]βT. Define Iθθ(θ,β)=E{Lθθ(Y,θ,β)} and similarly for Iθβ(θ,β), Iβθ(θ,β), and Iββ(θ,β). These components of the Fisher information are estimated as


The numerator of the score statistic is


Define Ω=Iθθ(0,β0)Iθβ(0,β0)Iββ1(0,β0)Iβθ(0,β0) and let O be its estimate. The score test statistic is Tn=SnTΩ^1Sn.

Result 3 Let pθ be the dimension of θ. Under contiguous alternatives θ = θn = λn−1/2,

SnNormal(Ωλ,Ω);Tn χpθ2(λTΩλ/2),

where χp2(a) is the noncentral chisquared distribution.

Here is a sketch of the argument for Result 3. Under the null hypothesis, it is well known that the numerator of the score statistic has the expansion


where β0 is the true value of β.

To apply LeCam’s third Lemma to obtain the distribution of the numerator of the score statistic under the alternatives θ = θn, we note that by a Taylor series expansion, under the null hypothesis,

Qn=i=1n{L(Yi,θn,β0)L(Yi,0,β0)}    =i=1nLθT(Yi,0,β0)(θn0)+         (1/2)(θn0)Ti=1nLθθT(Yi,0,β0)(θn0)+op(1)   =n1/2i=1nλTLθ(Yi,0,β0)(1/2)λTIθθ(0,β0)λ+op(1).

This means that under the null hypothesis, (Sn,Qn) are jointly normally distributed with means 0 and (1/2)λTIθθ(0,β0)λ, respectively. Their variances are Ω=Iθθ(0,β0)Iθβ(0,β0)Iββ1(0,β0)Iβθ(0,β0) and λTIθθ(0,β0)λ, respectively, and their covariance is −Ωλ. Applying LeCam’s third Lemma shows that when θ = θn, SnNormal(Ωλ,Ω). It then follows that Tnχpθ2(λTΩλ/2), as claimed.

Result 1 is a simple consequence of these results.

A.2.  Proof of Theorem 1

We merely need to show that Iθβ=0 and that Iθθ depends on the distribution of (X, S) through (XTκ, STη).

Recalling Assumption 2, we can assume that E(Z) = 0. Because of Assumption 1, under the null hypothesis, E(Ljk(Y,XTκ,STη,0,ζ)|X,S,Z)=Gjk(XTκ,STη,ζ) say. Letting (•) = (XTκ, STη, ζ), we can now compute the Fisher information matrix under the null hypothesis as

I=[IθθIθβIβθIββ]   =E[ZZTG33()ZXTG13()ZSTG23()ZG34()XZTG13()XXTG11()XSTG12()XG14()SZTG23()SXTG12()SSTG22()SG24()ZTG34()XTG14()STG24()G44()]

Since Z is independent of (X, S), and since E(Z) = 0, it is obvious that Iθβ=0, and hence that Ω=cov(Z)E{G33(XTκ,STη,ζ)}, completing the proof.

A.3.  The Chatterjee Method of Score Testing in Genomics

Here are the details of the Chatterjee procedure for model (2). Let β = (κT, ηT)T and W = (XT, ST)T. The normalized score for estimating θ0 when evaluated at the null hypothesis θ0 = 0 in the logistic context is


The idea is that for each fixed γ, estimate β0 = (κ0, η0) by maximum likelihood at the null model, calling that estimate [beta]. Let H(1)(x) = H(x){1 – H(x)}. Define

A(β0)=E{WWTH(1)(WTβ0)},B(β0,γ)=E{Z(1+γXTκ0)H(1)(WTβ0)WT},G(X,Z,S,β0,γ)=Z(1+γXTκ0)B(β0,γ)A1(β0)W and K(β0,γ)=E{G(X,Z,S,β0,γ)GT(X,Z,S,β0,γ)H(1)(WTβ0)}.

Further define U(γ)=SnT(β,γ)K1(β,γ)Sn(β,γ). All these terms are estimated by replacing true parameters by their estimates under the null model and expectations by averages over the data.

Chatterjee, et al. propose as a test statistic to reject the null hypothesis for large values of


where they show that for each γ, under the null hypothesis U^(γ)χpθ2, where pθ is the dimension of θ. They also show how to compute p-values using only simulation, as follows. Let b = 1,..., B, and for any b, let (Zb1,,Zbn) be randomly generated standard normal random variables. Define



Then, asymptotically, under the null hypothesis, Ub has the same limit distribution as does Un in (A.4). Hence, the p-value is just



*Martinez was supported by a Postdoctoral Training grant from the National Cancer Institute (CA90301). Carroll’s research was supported by a grant from the National Cancer Institute (R37-CA057030. Chatterjee’s research was supported by a Gene-Environment Initiative (GEI) grant from the National Heart Lung and Blood Institute (NHLBI) and by the Intramural research program of the National Cancer Institute. This paper benefited from the constructive comments of two referees and an associate editor.


  • Chatterjee N, Kalaylioglu Z, Moslehi R, Peters U, Wacholder S. Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions. American Journal of Human Genetics. 2006;79:1002–1016. doi: 10.1086/509704. [PubMed] [Cross Ref]
  • Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. doi: 10.1198/016214501753382273. [Cross Ref]
  • Friedman J, Hastie T, Tibshirani R. glmnet: Lasso and elastic-net regularized generalized linear models. R package version 1.1-3. 2009.
  • Goeman JJ. 2009a. Penalized R package, version 0.9–24
  • Goeman JJ. 2009b. L1 penalized estimation in the Cox proportional hazards model Biometrical Journal 51in press [PubMed]
  • Leeb H, Pötscher BM. Model selection and inference: facts and fiction. Econometric Theory. 2005;21:21–59. doi: 10.1017/S0266466605050036. [Cross Ref]
  • Maity A, Carroll RJ, Mammen E, Chatterjee N. Powerful multi-locus tests for genetic association with semiparametric gene-environment interactions. Journal of the Royal Statistical Society, Series B. 2009;71:75–96. doi: 10.1111/j.1467-9868.2008.00671.x. [PMC free article] [PubMed] [Cross Ref]
  • Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288.
  • Zou H. The Adaptive Lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. doi: 10.1198/016214506000000735. [Cross Ref]
  • Zou H, Hastie T. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B. 2005;67:301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [Cross Ref]
  • Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics. 2009;37:1733–1751. doi: 10.1214/08-AOS625. [PMC free article] [PubMed] [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press