Dimension reduction in regression models via regularization has emerged as a topic that has undergone vigorous development, with oracle methods such as SCAD (Fan and Li, 2001
) and the Adaptive Lasso (Zou, 2006
) and other popular methods such as the Lasso (Tibshirani, 1996
). We will say that an oracle method is one which is asymptotically consistent, i.e., selects the correct model with probability tending to 1.0 as the sample size n
→ ∞, and is asymptotically efficient for estimating the non-zero variables.
We are interested in testing for the effect of a low-dimensional covariate Z
when the model contains other higher dimensional covariates (X
) which govern nuisance parameters. As in Fan and Li (2001)
and Zou (2006)
, we work in the context that the number of covariates is large but smaller than the sample size, i.e., the dimensionality is fixed and does not increase with the sample size. For example, in logistic regression with a binary response Y
, the model would be
(·) is the logistic regression problem. The interest here is in testing the null hypothesis that H0
: θ = 0 and to produce an asymptotically valid significance level for that test. In model (1), (κ, η) is a nuisance parameter relating to (X
). One cannot use an oracle method to make valid probability statements about the null hypothesis, i.e., run SCAD or the Adaptive Lasso using all of (X
) and then somehow test for θ. This has been emphasized in a series of papers beginning with Leeb and Pötscher (2005)
, who note that the limit distribution of the oracle estimate of θ is not asymptotically normal under contiguous alternatives.
Our emphasis here is very different from the usual dimension reduction framework. The parameters (κ, η) are simply nuisance parameters and in this framework are not the main focus of interest.
As a motivating example for this article, we consider an extension of the model (1) that Chatterjee et al. (2006)
described for testing the association of a disease outcome with an exposure when that exposure may be interacting with other co-factors that influence the risk of the disease. See also Maity, et al. (2009)
for a semiparametric extension. In Chatterjee et al., Z
could be a specific genetic or environment exposure of interest, X
could be an array of genetic and/or environmental covariates of higher dimension that may be potentially interacting with Z
could be certain basic covariates, such as age and sex, that the model needs to be adjusted for. The interest focuses on whether Z
is associated with the disease outcome Y
. To improve power, Chatterjee et al proposed using a parsimonious, Tukey’s one-degree-of-freedom style, interaction model between Z
of the form
where the scalar γ is meant to capture the interaction. A full description of this method is given in the Appendix
. The null hypothesis of no association between Y
corresponds to testing H0
: θ = 0. A complication, however, is that under θ = 0, the parameter γ also disappears from the model and hence is not identifiable from the data. Nevertheless, Chatterjee et al. noticed that for each fixed value of γ, the model (2) can be used to construct a valid score-test for H0
: θ = 0. They proposed to use the maximum of such score-statistics over a range of the parameter γ as the final test statistic. They observed that the score-test has particular computational advantages, because under the null hypothesis the model (2) reduces to a standard logistic regression model involving only main effects of X
Motivated by model (2) and Chatterjee, et al. (2006)
, we study in this paper the following algorithm. Let the loglikelihood function be denoted by
, where for example in linear regression ζ would be the regression error variance. The usual score test of course would fit the model
. An oracle would do the following, which we call an oracle score test
- Under the null hypothesis H0 : θ = 0, fit the parameters (κ, η) by using an oracle estimator that consistently chooses the correct null model.
- Perform a score test using the selected components of (X, S).
This paper addresses the simple question: does having access to an oracle under the null model improve the local power of the testing, i.e., is the oracle score test more powerful than the ordinary score test that does not do any model selection under the null? We will show that the answer to this depends on how (X, S) are correlated with Z: when they are independent, there is no point in having access to an oracle. We will demonstrate however that if (X, S) has information carrying elements that are highly correlated with Z, then having access to an oracle can be valuable.
An outline of this paper is as follows. In Section 2, we develop the general framework and state the main result on the local power of the score test. In Section 3, we discuss the implications of this result for oracle methods. In Section 4 we give simulation evidence that suggests that having access to an oracle may not be much help in a variety of practical problems. In Section 5, we briefly describe a setting that having access to an oracle will be useful, although we point out some peculiar behavior of the Adaptive Lasso implemented with 10-fold crossvalidation. All technical details are given in an Appendix