Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2848082

Formats

Article sections

Authors

Related links

Can J Stat. Author manuscript; available in PMC 2010 December 1.

Published in final edited form as:

Can J Stat. 2009 December 1; 37(4): 625–644.

doi: 10.1002/cjs.10039PMCID: PMC2848082

NIHMSID: NIHMS168156

Rui WANG, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA;

Rui WANG: ude.dravrah.hpsh@gnawr; Stephen W. LAGAKOS: ude.dravrah.hpsh@sokagal

See other articles in PMC that cite the published article.

When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard (“naive”) methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study.

Let *X*_{1}, *X*_{2}, ···, *X _{p}* denote a set of covariates and

Past work in inference after variable selection has focused on inferences about the parameters in models for the association between *Y* and all *p* candidate covariates. Rather than fitting a single regression model involving all *p* covariates, the idea is to use a variable selection step to reduce the number of covariates and then base inferences on a more parsimonious model which acts as if the selected covariates can fully explain the association of *Y* with all *p* covariates. Several authors have noted that application of standard (“naive”) inference methods to the same data set used for variable selection can lead to invalid inferences, including distorted Type I errors, biased estimators, and confidence intervals with distorted coverage probabilities (cf: Miller 1984; Chatfield 1995; Hurvich & Tsai 1990; Zhang 1992; Kabaila 1995; Pötscher & Novák 1998, Danilov & Magnus 2004, Leeb & Pötscher 2005, Kabaila & Leeb 2006, Giri & Kabaila 2008), as well as overly optimistic prediction errors (cf: Efron 1986; Gong 1986; Breiman 1992; Leeb 2009). Although the bias of naive approaches can become negligible when the probability of selecting all the important covariates approaches 1 as the sample size increases (Pötscher 1991), Chatfield (1995), Leeb & Pötscher (2003, 2005), Leeb (2005), and others have noted that such results are not useful for making inferences in finite sample because severe biases can still occur. Alternative analytical methods, including bootstrap and jackknife, can sometimes reduce, but do not in general eliminate these biases, (cf: Faraway 1992, Veall 1992, Shen, Huang & Ye 2004). Conditions under which these methods are valid have not been identified, and their poor performance in some settings was noted in Freedman, Navidi & Peters (1988), and Dijkstra & Veldkamp (1988). As Leeb & Pötscher (2005) note,“… a proper theory of inference post model selection is only slowly emerging …”. Recently, Shen *et al.* (2004) develop approximate inference methods for the regression setting when the response is normally distributed, a consistent or over-consistent variable selector is used (that is, a variable selector that asymptotically selects or includes the true model with probability 1), and where the parameters of interest correspond to covariates that are not candidate for exclusion by the variable selector. Leeb (2009) proposes a prediction interval which is approximately valid and short with high probability in finite samples for the linear regression setting where the error and also the explanatory variables are jointly normally distributed.

We are unaware of any literature addressing the issue of using the same data set to first select a subset of covariates and then to make inferences for the marginal association between this selected subset and the response. For linear models, the parameters in the marginal model sometimes coincide with those in the model for the association between all covariates and the response, but in general they differ (Cochran 1938, Cox 2007). Naive methods which ignore the fact that the same data set is used both for variable selection and for subsequent inferences are commonly used in practice. As with inferences for the full model following variable selection, naive methods are in general biased in the settings we consider, as we will illustrate. Our interest in inferences about the marginal association is motivated by a common situation in medical research of wishing to understand the relationship between a specific set of covariates and a response, while recognizing that other covariates might also be associated with the response. For example, Lossos *et al.* (2004) examined 36 candidate genes and a disease outcome using a variable selection algorithm to arrive at a subset of 6 genes, and then assessed their association with the disease outcome. Here the scientific goal is to use these selected genes to develop an improved staging system for guiding patient management, and thus the main statistical interest is the marginal association between the selected genes and the clinical outcome, not about whether all the important genes were selected or whether a better variable selector could have been used. Similarly, in determining the genetic correlates of efavirenz hypersusceptibility, Shulman *et al.* (2004) used stepwise regression to identify specific reverse transcriptase (RT) mutations at three codons, and then considered a logistic regression model relating efavirenz hypersusceptibility and the RT mutations at these three codons. Having identified these codons, the scientific question is whether they can be used to guide the use of antiretroviral therapy by identifying efavirenz hypersusceptibility, and thus the statistical focus is the marginal association between these 3 codons and clinical response.

The proposed approach involves transforming the covariate-response data matrix to one that can be partitioned into two components that are independent under a null hypothesis, then forming new matrices by permuting the rows of one component while holding the other component fixed, and then basing inferences on a permutation distribution formed from a specific subset of the resulting matrices. Exact tests and confidence intervals are obtained for some settings and approximate inferences for others. In section 2 we describe the general approach, present the theorem that justifies it, and discuss its implementation. In section 3 we use simulations to assess and compare the performance of the proposed approach to naive methods and to a sample-splitting approach, and in section 4 we illustrate the methods with data from a recent AIDS study. In section 5 we discuss related issues and areas in need of further development. All proofs are included in the Appendix.

We begin this section by introducing notation and conceptually describing the proposed approach. This is followed by a theorem that justifies the approach and results that guide its implementation.

Let = {*X*_{1}, *X*_{2}*,* ···, *X _{p}*} denote a set of candidate covariates, including interaction terms of interest, and let

Suppose : **D** → denotes a variable selector that maps **D** into a subset, ={*X*_{l1}, *X*_{l2}, ···, *X*_{ls}}, of . Let *X _{S}*= (

Consider testing a hypothesis, *H*_{0}, about the marginal association of the selected covariates with the response *Y*. The approach consists of the following 2 steps.

Based on the hypothesis of interest and conditions on the underlying joint distribution of (*X*_{1}, ···, *X _{p}, Y*), we first identify a one-to-one transformation,

Consider the *n*! matrices of the form
$\stackrel{\sim}{\mathbf{D}}(l)=({\stackrel{\sim}{\mathbf{D}}}_{P}^{l},{\stackrel{\sim}{\mathbf{D}}}_{F})$, where
${\stackrel{\sim}{\mathbf{D}}}_{P}^{l}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{D}}}_{P}$, and identify the subset, Π* _{R}*, of matrices for which (

The validity of the proposed approach is based on the following Theorem. In Section 2.3, we will discuss ways to choose the transformation *g*(·) and the partition based on a given hypothesis of interest and conditions on the underlying joint distribution of (*X*_{1}, ···, *X _{p}*,

Let **D** = (**X**, **Y**) denote the n × (p + 1) matrix consisting of n i.i.d. copies of (X_{1}, X_{2}, ···, X_{p}, Y) and suppose there exists a one-to-one transformation, g(·), of (X_{1}, X_{2}, ···, X_{p}, Y) such that = g(**D**) can be partitioned as = (_{P}, _{F}) for some _{P} and _{F} whose elements are independent under H_{0}. For any **d** in the support of **D**, define x_{S} = (**d**) and = g(**d**) = (_{P}, _{F}). For each row permutation,
${\stackrel{\sim}{\mathbf{d}}}_{P}^{l}$, of _{P}, define
$\stackrel{\sim}{\mathbf{d}}(l)=({\stackrel{\sim}{\mathbf{d}}}_{P}^{l},{\stackrel{\sim}{\mathbf{d}}}_{F})$ and suppose g^{−1}((l)) is in the support of **D**. Let Π_{R} = {(l)|(g^{−1}((l))) = x_{S}}. Then under H_{0} and for any (l) Π_{R},

$$P(\stackrel{\sim}{\mathbf{D}}=\stackrel{\sim}{\mathbf{d}}(l)\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S})=1/M,$$

where M is the number of matrices in Π_{R}.

By construction, the unpermuted matrix = *g*(**d**) is an element of the restricted set Π* _{R}*. Theorem 1 shows that under

To give a flavor of the method, suppose *X* is scalar, so that **D** is the *n* × 2 matrix with *i ^{th}* row (

- First suppose there is no variable selection, that is, (
**D**) {*X*}, and consider testing the hypothesis*H*_{0}:*β*= 0 of no association between*X*and*Y*. Let*g*denote the identity transformation and partition =**D**by taking=_{P}**X**and=_{F}**Y**. If**X**denotes a row permutation of^{l}**X**, then (*l*) = (**X**,^{l}**Y**) and Πis the set of all_{R}*n*! such matrices. Let*T*() be some test statistic, say**X**^{T}**Y**. We can then test*H*_{0}by comparing its observed value to the permutation distribution formed by the*n*! values of*T*((*l*)) = (**X**)^{l}^{T}**Y**. The proposed method thus reduces to the classical permutation test for the association between*X*and*Y*. Formally, inference is based on the null permutation distribution $P(\stackrel{\sim}{\mathbf{D}}=\stackrel{\sim}{\mathbf{d}}(l)\mid {\stackrel{\sim}{\mathbf{d}}}_{P}\stackrel{P}{=}\mathbf{x},{\stackrel{\sim}{\mathbf{d}}}_{F}=\mathbf{y},\mathcal{R}(\mathbf{D})=\{X\})=P(\stackrel{\sim}{\mathbf{D}}=({\mathbf{x}}^{l},\mathbf{y})\mid \mathbf{X}\stackrel{p}{=}\mathbf{x},\mathbf{Y}=\mathbf{y})=1/n!$, for each of the*M*=*n*! matrices in Π._{R} - Now consider use of a variable selector that sometimes selects {
*X*} but otherwise does not according to some rule; that is, (**D**) = {*X*} or {}. Suppose (**D**) = {*X*} for a particular**D**, and consider testing*H*_{0}:*β*= 0. We again take*g*to be the identity transformation so that (*l*) = (**X**,^{l}**Y**) as before, but now base inferences on only those (*l*) such that (*g*^{−1}((*l*))) = (**X**,^{l}**Y**) = {*X*}; that is, the method applies the classical permutation test but restricted to only those matrices (**X**,^{l}**Y**), for which selects*X*. - Finally, consider the same variable selector as in (b) and suppose that (
**D**) = {*X*} for a particular**D**, but now consider testing*H*_{0}:*β*=*β*^{0}for some*β*^{0}≠ 0. Here we can take*g*(*x, y*) = (*x, y*+*β*_{0}*x*), so that = (**X**,**Y**+*β*^{0}**X**). Note that the 2 columns of are independent under*H*_{0}and that (*l*) = (**X**,^{l}**Y**+*β*^{0}**X**). The inverse mapping is*g*^{−1}(*a, b*) = (*a, b*−*β*_{0}*a*) and thus*g*^{−1}((*l*)) = (**X**,^{l}**Y**+*β*^{0}(**X**−**X**)). If^{l}*T*() is some test statistic, the observed value*T*() is then compared to the permutation distribution formed by {*T*((*l*))|(*l*) Π} = {_{R}*T*((*l*))|((**X**,^{l}**Y**+*β*^{0}(**X**−**X**))) =^{l}*X*}. This test could be used to construct a confidence interval for*β*.

In more complex settings than the example discussed above, finding a transformation *g*(·) that leads to a desired partition may not be obvious. The proposed methods are invariant to reversing the roles of * _{P}* and

(*X, Y*) can be transformed one-to-one to some (*, Ỹ*) = (* _{P}, _{F}, Ỹ*) such that

With this condition, Theorem 1 holds by taking * _{P}* =

(*X, Y*) can be transformed one-to-one to some (*X, Ỹ*) so that *Ỹ* *X* under *H*_{0}.

With this condition, Theorem 1 holds by taking * _{P}* =

Suppose the response *Y* is related to the candidate covariates *X*_{1}, ···, *X _{p}* by a standard linear model:

$$Y={\beta}^{\ast}X+{\epsilon}^{\ast},$$

(2.1)

where *ε*^{*} *X* and where some of the components of *β*^{*} may be zero. Consider the following conditions:

- (C1)
*X*_{S}*X*_{\}_{S} - (C2)
*Y**X*_{\}|_{S}*X*_{S} - (C3) The components of
*X*_{\}are continuous and_{S}*X*_{\}=_{S}*γ*+ , where_{S}X_{S}*X*. A special case is when_{S}*X*= (*X*_{1}, ···,*X*) is normally distributed, in which case ${\gamma}_{S}={\mathrm{\sum}}_{\backslash S,S}{\mathrm{\sum}}_{S}^{-1}$, where Σ_{p}_{\}=_{S,S}*cov*(*X*_{\},_{S}*X*) and Σ_{S}=_{S}*var*(*X*)._{S}

Assume (2.1). If any of (C1), (C2), or (C3) holds, the marginal association between X_{S} and Y is of the form

$$Y={\beta}_{S}{X}_{S}+\epsilon $$

(2.2)

for some β_{S}, where ε X_{S}. Furthermore, when (C2) holds, ε X.

Note that (2.2) refers to the marginal association of *X _{S}* and

Below we give several results specifying conditions under which the proposed methods can be used to make an inference about a global hypothesis (Propositions 2–4) or an individual covariate (Propositions 5–7) from (2.2). Propositions 5–7 generalize in a straightforward way to any subset of the selected covariates (see the Appendix).

Consider the global hypothesis ${H}_{0}:{\beta}_{S}={\beta}_{S}^{0}$ for some ${\beta}_{S}^{0}$.

Suppose that (C1) holds. Let g(X, Y) = (X, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{S}^{0}{X}_{S}$, and _{P} = **X**_{S}, _{F} = (**X**_{\S}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

Suppose that (C2) holds under H_{0}. Let g(X, Y) = (X, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{S}^{0}{X}_{S}$, and _{P}= **X, **_{F}=**Ỹ**. Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

Suppose that (C3) holds. Define Z_{\S} = X_{\S} − γ_{S}X_{S}. Let g(X, Y) = (X_{S}, Z_{\S}, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{S}^{0}{X}_{S}$, and define _{P} = **X**_{S} and _{F} = (**Z**_{\S}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

When multiple covariates are selected, it is often of interest to test a hypothesis about a single covariate. Without loss of generality, let *X*_{1} , and consider
${H}_{0}:{\beta}_{1}={\beta}_{1}^{0}$ for some
${\beta}_{1}^{0}$. Below we use *X*_{S\1} and *X*_{\1} to denote the random vectors formed by the variables in \{*X*_{1}} and \{*X*_{1}}, respectively.

Suppose that (C2) holds under H_{0} and X_{1} X_{\1}. Let g(X, Y) = (X, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{1}^{0}{X}_{1}$, and let _{P} = **X**_{1}, and _{F} = (**X**_{\1}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

Suppose that (C2) holds under H_{0}, X_{1} is a continuous covariate, and X_{1} = γ_{\1}X_{\1} + ε_{1}, where ε_{1} X_{\1}. Define Z_{1} = X_{1} − γ_{\1}X_{\1}. Let g(X, Y) = (Z_{1},X_{\1}, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{1}^{0}{Z}_{1}$, and _{P} = **Z**_{1}, _{F} = (**X**_{\1}, **Ỹ**), then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

Suppose (C3) holds, X_{1} is continuous and X_{1} = δ_{S\1}X_{S\1}+ε_{1}, where ε_{1} X_{S\1}. Define Z_{1} = X_{1} − δ_{S\1}X_{S\1}, and Z_{\S} = X_{\S} − γ_{1}Z_{1}. Let g(X, Y) = (Z_{1},X_{S\1}, Z_{\S}, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{1}^{0}{Z}_{1}$, and let _{P} = **Z**_{1}, and _{F} = (**X**_{S\1}, **Z**_{\S}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1. In the special case where X is normally distributed,
${\delta}_{S\backslash 1}={\mathrm{\sum}}_{1,S\backslash 1}{\mathrm{\sum}}_{S\backslash 1}^{-1}$, where Σ_{1,S\1} = cov(X_{1},X_{S\1}) and Σ_{S\1} = var(X_{S\1}).

In practice, the covariances matrices in Propositions 4 and 7 are usually not known, in which case they can be replaced by their empirical counterparts, resulting in an approximate inference. Similarly, in Propositions 4, 6, and 7, the *γ*_{S}, *γ*_{\1}, and *δ*_{S\1} can be replaced by their least-squares estimates. The linear representation for *X*_{1} assumed in Propositions 6 and 7 can be assessed empirically by plotting residuals against the values of other covariates.

Each of Propositions 2–7 gives a transformation and partition that satisfy the conditions of Theorem 1 for specific hypotheses and assumptions about the distribution of (*X*_{1}, ···, *X*_{p}, *Y*). Propositions 2, 4, 5, 6, 7 represent applications of Condition A while Proposition 3 represents an application of Condition B. Inferences then follow from the transformation/partition and permutation/restriction steps described in Section 2.2.

In the above, we began by assuming a linear model relating *Y* to the entire set of candidate covariates. Alternatively, we could have begun by assuming a linear model for the marginal association between *Y* and the selected covariates, and then imposed other conditions on the error term of this model and/or on the joint distribution of the covariates to apply the proposed methods.

The proposed methods can sometimes be applied without specifying a particular parametric model. For example, consider the null hypothesis *H*_{0} : *Y* *X*_{S} that *Y* and *X*_{S} are independent. If *X*_{S} *X*_{\S}, Condition A holds with _{P} = *X*_{S}, _{F} = *X*_{\S}, and *Ỹ* = *Y*, so that we can apply Theorem 1 with _{P} = **X**_{S} and _{F} = (**X**_{\S},**Y**). Alternatively, suppose *Y* *X*_{\S}|*X*_{S}, which can be viewed as saying that includes all of the important covariates. Here Condition B would hold under *H*_{0}, so that we can apply Theorem 1 with _{P} = **X** and _{F} = **Y**.

In this section we use simulations to assess and compare the performance of the proposed methods to the naive and sample-splitting approaches for several specific settings. For a given outcome, *x*_{S}, of the variable selection step, we estimate the Type I error and power for the proposed methods and naive approach by first generating independent data sets and then, from among those that result in *x*_{S} being selected, computing the proportion that reject the null hypothesis. For the sample-splitting approach, we first generate independent data sets and randomly split each in half, using the first half for variable selection. From among those data sets that result in *x*_{S}, we then use the second half to fit the marginal model and record the proportion that reject the null hypothesis. Confidence intervals are obtained similarly; the coverage probability is estimated by the proportion of the resulting intervals that include the true parameter value. All simulations were done using R, version 2.2.0 or later.

We begin this section with an example to illustrate the difference between the parameters of interest in this paper and those in models for the association between the response variable and all the candidate covariates. Similar to the setting considered by Shen *et al.* (2004), suppose (*X*_{1}*,X*_{2}*,X*_{3}*,X*_{4}) follows a multivariate normal distribution with mean **0** and covariance matrix having (*i, j*)^{th} element *ρ*^{|i−j|}, and that conditional on (*X*_{1}*,X*_{2}*,X*_{3}*,X*_{4}),

$$Y={\beta}_{1}^{\ast}{X}_{1}+{\beta}_{2}^{\ast}{X}_{2}+{\beta}_{3}^{\ast}{X}_{3}+0\xb7{X}_{4}+{\epsilon}^{\ast},$$

(3.1)

where *ε*^{*}~ *N*(0, 1) and independent of (*X*_{1}*,X*_{2}*,X*_{3}*,X*_{4}). In Shen *et al.* (2004), the variable selection must always include *X*_{1} and the parameter of interest is
${\beta}_{1}^{\ast}$, regardless of which subset of (*X*_{2}*,X*_{3}*,X*_{4}) is also selected. Our focus is on the marginal association between the selected covariates and the response. When *X*_{1}, *X*_{2} and *X*_{3} are selected, the parameters we consider are the same as those in (3.1), but otherwise they differ. For example, when only *X*_{1} is selected, we no longer are interested in
${\beta}_{1}^{\ast}$, the coefficient for *X*_{1} from the linear model that adjusts for *X*_{2} and *X*_{3}, but instead in *β*_{1} from the marginal model *Y* = *β*_{1}*X*_{1} +*ε*, where
${\beta}_{1}={\beta}_{1}^{\ast}+\rho {\beta}_{2}^{\ast}+{\rho}^{2}{\beta}_{3}^{\ast},\epsilon ={\beta}_{2}^{\ast}({X}_{2}-\rho {X}_{1})+{\beta}_{3}^{\ast}({X}_{3}-{\rho}^{2}{X}_{1})+{\epsilon}^{\ast}$, and *X*_{1} *ε* (Proposition 1).

Suppose = {*X*_{1}}. Table 1 displays the empirical Type I errors for testing
${H}_{0}:{\beta}_{1}={\beta}_{1}^{0}$ using the naive approach, the sample-splitting approach, and restricted permutation test A using both actual (*c*) and estimated (*ĉ*) covariances, where
$c={\mathrm{\sum}}_{\backslash S,S}{\mathrm{\sum}}_{S}^{-1}$ (Proposition 4). The variable selector is a step-down procedure (step() function in R), starting with the model that includes all *p* covariates, and using the Akaike Information Criterion (Akaike 1973) to eliminate covariates. Here
${\beta}_{1}^{\ast}=.1,{\beta}_{2}^{\ast}=.1,.05,{\beta}_{3}^{\ast}=.01$, *ρ*= .5,−.5, .8,−.8 and, as in Shen and others, *n* = 22. The Type I error estimates using the restricted permutation test with actual *c* or sample-splitting are very close to the nominal level of .05; those based on the restricted permutation test using *ĉ* are close to the nominal level and offer a substantial improvement over those obtained from the naive approach.

Suppose (*X*_{1}, ···, *X*_{5}) has a multivariate normal distribution with mean (.1, .2, .3, .4, .5), *X*_{1} *X*_{5}, (*X*_{1}*,X*_{5}) (*X*_{2}*,X*_{3}*,X*_{4}), *var*(*X*_{j}) = 1 for *j* = 1, ···, 5, and *corr*(*X*_{j}*,X*_{k}) = .1 for *j, k* = 2, 3, 4 and *j* ≠ *k*. Assume that the model for *Y*, given (*X*_{1}*,X*_{2}, ···, *X*_{5}), is
$Y=.5+{\beta}_{1}^{\ast}{X}_{1}+{\epsilon}^{\ast}$, where *ε*^{*} ~ *N*(0, 1) and independent of (*X*_{1}, ···, *X*_{5}). We take *n* = 100 and use the same variable selector as in section 3.1.1.

Suppose = {*X*_{1}*,X*_{5}}. From Proposition 1, the marginal association of (*X*_{1}*,X*_{5}) and *Y* is given by *Y* = *β*_{0} + *β*_{1}*X*_{1} + *β*_{5}*X*_{5} +*ε*. Consider testing the hypothesis *H*_{0} : *β*_{1} = *β*_{5} = 0 using the two degrees of freedom likelihood ratio test statistic. Figure 1 gives the power of restricted permutation test A (Proposition 2), permutation test B (Proposition 3), and sample-splitting for increasing values of *β*_{1} and for *β*_{5} = 0. Results are based on 1000 simulations for Type I error and 200 simulations for power. The Type I error of the naive test is highly distorted (=0.559), whereas those for the permutation and sample-splitting approaches are very close to the nominal .05 level. The power curves of the two permutation tests are similar, especially in power ranges of typical interest, and larger than that of the sample-splitting approach.

Suppose *p* = 3 and *X* = (*X*_{1}*,X*_{2}*,X*_{3}) has multivariate normal distribution with mean vector (.1, .2, .3) and covariance matrix Σ = (*σ*_{ij}), where *σ*_{ii} = 1 and *σ*_{ij} = *ρ* for *i* ≠ *j*. Assume the conditional distribution of *Y*, given (*X*_{1}*,X*_{2}*,X*_{3}), is given by *Y* = .5 + .3*X*_{1} + 0*X*_{2} + 0*X*_{3} +*ε*^{*}, where *ε*^{*} ~ *N*(0, 1) and independent of (*X*_{1}*,X*_{2}*,X*_{3}), and consider the same variable selector as section 3.1.1. Suppose selects {*X*_{1}*,X*_{2}}, and we wish to test hypotheses about parameters from the marginal model *Y* = *β*_{0} + *β*_{1}*X*_{1} + *β*_{2}*X*_{2} + *ε* (Proposition 1).

To test
${H}_{0}:{\beta}_{1}={\beta}_{1}^{0}$, we use Proposition 7 and define *Z*_{1} = *X*_{1} − *c*_{1}*X*_{2}, where *c*_{1} = *cov*(*X*_{1}*,X*_{2})/*var*(*X*_{2}) =*ρ*, and *Z*_{3} = *X*_{3} − *c*_{2}*Z*_{1}, where *c*_{2} = *cov*(*X*_{3}, *Z*_{1})/*var*(*Z*_{1}) = *ρ*/(1 + *ρ*). Let
$\stackrel{\sim}{Y}=Y-{\beta}_{1}^{0}{Z}_{1}$. To test
${H}_{0}:{\beta}_{2}={\beta}_{2}^{0}$, we transform *X*_{2} to *Z*_{2} = *X*_{2} − *c*_{1}*X*_{1}, where *c*_{1} = *cov*(*X*_{1}*,X*_{2})/*var*(*X*_{1}) = *ρ*, and *Z*_{3} = *X*_{3} − *c*_{2}*Z*_{2}, where *c*_{2} = *cov*(*X*_{3}, *Z*_{2})/*var*(*Z*_{2}) = *ρ*/(1+*ρ*), and
$\stackrel{\sim}{Y}=Y-{\beta}_{2}^{0}{Z}_{2}$. Note that we use a different transformation for testing *β*_{2} to ensure that this remains the coefficient being tested after the transformation.

Table 2 gives the empirical Type I errors for the naive test, the restricted permutation test A, and the sample-splitting approach for different choices of *n* and *ρ*, and using the actual (*c*) and empirically-estimated (*ĉ*) values of *c* = (*c*_{1}, *c*_{2}), based on the test statistic
${\sum}_{j=1}^{n}({Y}_{j}-\overline{Y})({Z}_{ij}-{\overline{Z}}_{i})$, where *i* = 1, 2 refer to *Z*_{1} and *Z*_{2}. The permutation test using the actual values of *c* and the sample-splitting approach lead to Type I errors that are very close to the nominal .05 level. Use of empirical estimates leads to somewhat conservative tests while the naive test gives distorted Type I errors.

Suppose that we have *p* binary covariates *X*_{1}, ···, *X*_{p}. Define *p*_{j} = *P*(*X*_{j} = 1) and define *β*_{k} to be the difference in mean response for the two levels of *X*_{k}; that is, *β*_{k} = *E*(*Y*|*X*_{k} = 0) − *E*(*Y*|*X*_{k} = 1), *k* = 1, 2, ···, *p*. Suppose that (*X*) = {*X*_{j}} for some *j* {1, 2, ···, *p*} and we want to test
${H}_{0}:{\beta}_{j}={\beta}_{j}^{0}$. A natural transformation to eliminate the mean location difference induced by *X*_{j} under *H*_{0} is
$\stackrel{\sim}{Y}=Y+{\beta}_{j}^{0}{X}_{j}$, since then we have *E*(*Ỹ*|*X*_{j} = 0) = *E*(*Ỹ*|*X*_{j} = 1).

Suppose that *X*_{3}, ···, *X*_{p} are mutually independent and independent of (*X*_{1}*,X*_{2}), and *corr*(*X*_{1}*,X*_{2}) = *ρ*. Suppose that *f*(*Y*|*X*_{1}, ···, *X*_{p}) = *f*(*Y*|*X*_{1}) and *f*(*y*|*X*_{1} = *x*) = *f*_{0}(*y* − *xβ*) for some *β* and density *f*_{0}. It is easily verified that
${\beta}_{1}=\beta ,\phantom{\rule{0.16667em}{0ex}}{\beta}_{2}=\beta \rho \sqrt{{p}_{1}(1-{p}_{1})}/\sqrt{{p}_{2}(1-{p}_{2})}$, and *β*_{j} = 0 for *j* = 3, 4, ···, *p*.

When *X*_{1} is selected, Condition (C2) holds and we can apply Proposition 3 (Permutation Test B). Suppose *X*_{j} (*j* ≥ 3) is selected, Condition (C1) holds and we can apply Proposition 2 (Permutation Test A). Suppose *X*_{2} is selected. When *ρ* ≠ 0, *X*_{2} is marginally associated with *Y* because of its correlation with *X*_{1}. Condition (C1) does not hold because *X*_{2} is not independent of *X*_{1}. Neither does Condition (C2) hold for this *Ỹ* because *Ỹ* is not independent of *X* (*Var*(*Ỹ*|*X*_{2}) ≠ *Var*(*Ỹ*)). We include this case below to assess the robustness of the permutation tests when the conditions for their validity are not satisfied.

Consider the variable selector defined by (**D**) = {*X _{j}*} if

Type I Error Estimates. Panel (a): = {*X*_{1}}, Testing
${H}_{0}:{\beta}_{1}={\beta}_{1}^{0}$. Panel (b): = {*X*_{2}}, Testing
${H}_{0}:{\beta}_{2}={\beta}_{2}^{0}$, Where
${\beta}_{2}=2\rho {\beta}_{1}\sqrt{{p}_{1}(1-{p}_{1})}$. Panel (c): = {*X*_{3}}, Testing
${H}_{0}:{\beta}_{3}={\beta}_{3}^{0}$. Horizontal **...**

In panel (a), the naive test is severely biased for smaller values of *β*_{1} and improves for increasing *β*_{1} because the probability of selecting *X*_{1} approaches 1. The Type I error rates for both permutation tests and for the sample-splitting approach are close to the nominal values for all values of *β*_{1}, even though the conditions for restricted permutation test A are not met. The small distortion for permutation test A is in part due to the small dependence between *X*_{1} and other covariates.

In panel (b), where *X*_{2} is selected, the naive test has highly distorted Type I error for all *β*_{1}. The conditions for both permutation tests are violated, and their Type I errors climb above .05 when *β*_{1} > .5 (*β*_{2} > .06), increasing to .15 when *β*_{1} = 1 (*β*_{2} = .12). Compared to the case where *X*_{1} is selected, the larger distortion using permutation test A is likely due to the fact that *X*_{2} is dependent on both *X*_{1} and *Ỹ*, while when = {*X*_{1}}, *X*_{1} is dependent only on *X*_{2}.

In panel (c), where *X*_{3} is selected, the curves represent the Type I errors for the hypothesis *H*_{0}: *β*_{3} = 0. Here the condition for permutation test A is satisfied and the actual Type I error is very close to the nominal value. The condition for permutation test B is not satisfied, resulting in an inflated Type I error which begins for *β*_{1} ≈ .4 and increases to about .15 when *β*_{1} = 1. This occurs because as *β*_{1} increases, the association between *Ỹ* = *Y* and *X*_{1} increases, resulting in stronger dependency between * _{P}* =

We compared the power of permutation test B to the sample-splitting approach for testing *H*_{0}: *β*_{1} = 0 when *X*_{1} is selected, for *β*_{1} ranging from 0 to 1.0, and for different choices of *p*_{1}, *p*_{2}, *ρ*, and *n*. Results are based on 2000 simulations for Type I error and 1000 simulations for power. In all cases that we examined, the power of the restricted permutation test exceeds that of the sample-splitting approach. A typical pattern of the comparisons is shown in Figure 3 for *p*_{1} = *p*_{2} = .1 and varying *ρ* (= 0, .2, .5) and *n* (= 100, 500). Very similar results were obtained for other choices of parameters values (data available upon request). Also shown in Figure 3 are the power curves for the unachievable test of association between *X*_{1} and *Y* without model selection (that is, if we fit the marginal model relating *X*_{1} and *Y* without a variable selection step). The lower power of the restricted permutation test relative to the test that does not undertake model selection can be viewed as the “cost” of variable selection.

Power Comparison of Permutation Test B (Solid) to Sample-Splitting (Dotted), when = {*X*_{1}} and Testing *H*_{0}: *β*_{1} = 0 in Example 1, for *p*_{1} = *p*_{2} = .1, when *n* = 500 (panels (a)–(c)) or 100 (panels (d)–(f)) and *ρ* = 0 (panels (a), **...**

Table 3 summarizes the performance of the nominal 95% confidence intervals obtained from the naive approach, the sample-splitting approach, and from permutation tests A and B when *n* = 100, 500 and *β*_{1} = .05, .2. The shaded areas represent settings where the conditions (specified in Propositions 2 and 3) for the restricted permutation tests are satisfied, and thus the confidence intervals are exact. For the remaining values, the corresponding coverage results reflect the robustness of the methods to violations of these conditions. The coverage of the naive confidence intervals is often substantially lower than the nominal value. The coverage of the restricted permutation tests is very close to the nominal value when the conditions for these tests hold, and offer a substantial improvement over naive methods when they are violated. Note also that the width of the restricted permutation test intervals are only moderately greater than those for the naive approach. Because the width of the naive approach coincides with the width for the approach without variable selection, the relative increase in the width of the permutation intervals can be viewed as the “cost” paid for applying this variable selector. While the coverage for the sample-splitting approach is always close to the nominal value, the intervals are wider than those obtained from the restricted permutation methods, reflecting a loss of efficiency because only 50% of the sample was used in their construction.

The proposed methods allow comparisons of different variable selectors. Consider the same setting as above but now with *p* = 3 covariates, and with *p*_{1} = .1, *p*_{2} = .2, *p*_{3} = .3, *ρ* = .5, and *β*_{1} = 0, .2. Let _{1} be the selector used in Section 3.2.1, and let _{2} select those covariates whose marginal association with *Y* yields a t-statistic greater in absolute value than 1.96. Table 4 gives the actual coverage probabilities for the nominal 95% confidence intervals from the naive approach, the sample-splitting approach, and for permutation test B when = {*X*_{1}}. The coverage probabilities for the intervals obtained from the restricted permutation methods and the sample-splitting approach are all very close to the nominal levels, while those for the naive approach are lower than the nominal values. The intervals obtained from sample-splitting are wider than those obtained from the restricted permutation methods for both variable selectors. The permutation intervals following use of _{2} are about 10% wider than those following use of _{1}. For _{2}, where the distortion of the naive intervals is greater, the exact permutation intervals are 16%–18% wider than the naive intervals. In contrast, for _{1}, where the naive intervals are less distorted, the exact permutation intervals are only 3%–5% wider. Because the expected width of the naive intervals is the same as that if there were no variable selection, one can view the “cost” of correcting for variable selection as being greater for _{2} than for _{1}.

We illustrate the proposed methods using the results from an immunological study conducted by the AIDS Clinical Trials Group (Malhotra *et al*., 2004) to assess the association between several immunological markers measured at enrollment and Y, the change in CD4+ T-cell count by study week 24. There were *n* = 59 patients and *p* = 15 covariates consisting of nine immunological markers obtained from flow cytometry, four stimulation indices obtained from lymphocyte proliferation (LP) assays, and two binary variables to describe which of three treatments a patient received. The nine immunological markers were the percentage of CD8 T cells in lymphocyte; the percentages of CD4+ T cells expressing: naive markers (nCD4 %), activation markers, coexpressing CD28+, and coexpressing Fas; and the percentages of CD8+ T cells expressing: naive markers (nCD8%), activation markers, coexpressing CD28+, and coexpressing Fas (CD8+95+%). The four stimulants used in LP assays were baculovirus control protein, Candida, baculovirus-expressed recombinant HIV-1* _{LAI}* p24, and HIV-1

We considered the variable selector that separately examines the marginal association of each covariate with *Y*, using least squares, and selects all covariates showing some evidence of being associated with *Y*, defined as a significance level (p-value) of .10 or less. This led to the selection of 3 covariates: nCD4% (p=.002), nCD8% (p=.036), and CD8+95+% (p=.021). We then considered a linear model for the marginal association of these covariates with *Y*. As seen in the left side of Table 5, the naive multivariate (least-squares) analysis indicates that *Y* is significantly associated with nCD4% (p=.005), possibly associated with CD8+95% (p=.09), and not associated with nCD8% (p=.37). The disparate p-values for the univariate and multivariate analysis of nCD8% are likely due to its correlation (.38) with nCD4%. The naive likelihood ratio test of the global hypothesis that none of the three covariates are associated with *Y* gives *p* < .001.

We subsequently analyzed the marginal association between the three selected covariates and response using the restricted permutation methods (Table 5, right). We employed Proposition 3 to assess the global hypothesis that none were associated with response, using the log-likelihood ratio statistic. Use of Proposition 3, which assumes that the selected covariates can fully explain the association between *Y* and all the covariates (Condition C2), was motivated by the liberal criterion (*p* ≤ .10) used to select covariates. The global permutation test provides evidence (p=.026) against the hypothesis that all regression coefficients were zero. We then employed Proposition 6 to assess the individual covariates. Proposition 6 also assumes that each of the 3 selected covariates can be expressed as a linear combination of the remaining covariates plus an unrelated error term. This assumption was empirically supported by examining scatter-plots of the residuals from each fitted model relating a selected covariate with the remaining covariates. As a test statistic, we used the least-squares estimator of the coefficient of *Z*_{1} (see Proposition 6) in the model relating *Ỹ* with *Z*_{1} and the other two selected covariates. After correcting for variable selection, the association of nCD4% and response remained significant (p=.027), the association for CD8+95+% became marginally significant (p=.051) and the association for nCD8% remained non-significant. The 95% confidence intervals obtained from the restricted permutation methods are wider than the naive confidence intervals, incorporating the additional uncertainty about the regression coefficients due to variable selection.

With *p* = 15 covariates and only *n* = 59 observations, most analysts would not likely use a sample splitting approach to analyze these data out of concern that some important covariates might not be selected, and that even if selected, there might be inadequate power to demonstrate their association with *Y*. Using the same variable selector and randomly splitting the data into variable selection (n=30) and testing (n=29) sets resulted in nCD4% being selected 87% of the time. When the same 3 covariates were selected as with the full data, nCD4% was no longer significant (*p* < .05) 32% of the time.

Despite the recognized biases resulting from use of naive inference methods that do not account for variable selection, such methods are still commonly used in practice, in large part due to a lack of alternatives. The methods proposed in this paper, based on restricted permutation tests, provide exact or approximate inferences about the marginal association between a subset of covariates and a response variable under specific conditions. The methods are not tied to any specific variable selector and do not require that the inference be restricted to covariates that are not candidates for exclusion by the variable selector. The proposed methods do not apply to all settings, as indicated by the assumptions made in Propositions. However, an advantage is that conditions for their validity can be specified and most can be checked empirically. When a lenient (non-parsimonious) variable selector is used, it is more likely that all important covariates in the full model will be selected. In such cases, the regression coefficient for a covariate in the marginal model will be the same as the coefficient in the full model, and hence the proposed methods can be used to make inferences on associations in the overall model. However, when some important covariates are not selected, the regression coefficients in the full and marginal models will not, in general, be the same.

The results in Figures 1 and and33 suggest that use of the proposed methods can provide more powerful tests than a sample-splitting approach. Because sample-splitting does not require the assumptions made by the restricted permutation methods, it would be preferred when the sample size is sufficiently large. However, in many settings it may not be feasible to split a sample, in which case the proposed methods offer an appealing alternative. Further investigation of the relative efficiency of these approaches would be worthwhile. For small or moderate sample sizes, a related advantage of the proposed methods over the sample-splitting is that, due to the larger sample size on which they are based, it is more likely that important covariates will be identified.

Several considerations arise in the implementation of the methods. Firstly, rather than enumerating all *n*! permutations to identify the restricted subset Π* _{R}*, it is sufficient to sample from the

When a specific covariate is known to be of interest a priori, the variable selector would be chosen to always include this covariate. For example, in a randomized clinical trial, the covariate denoting the treatment effect would always be included by the variable selector, and other candidate covariates might be selected in the hope of leading to a more efficient inference about the treatment effect. For linear models, the randomization of treatments ensures that the regression coefficient for the treatment effect is the same regardless of which additional covariates are selected. The proposed procedure would produce valid inference for the treatment effect conditional on each of the candidate models, and therefore would also produce valid inference unconditionally. If one performs variable selection and then wishes to make an inference about a covariate that was not selected, the proposed methods also can be used. For example, if the covariate *X*_{2} is selected and one wishes to make an inference about the marginal association between *Y* and the unselected covariate *X*_{1}, the proposed methods still apply, except that now the restriction set will consist of all the permuted data matrices (properly transformed) which lead to the same outcome ({*X*_{2}}) as the original dataset.

Throughout we have assumed that the explanatory variables are random. In some settings, such as when Condition B can be utilized, the proposed methods can be applied directly for fixed **X**. In other settings, such as when one wishes to make inference about an individual coefficient in the regression models, we believe that the proposed approach can be modified by adapting permutation methods for fixed **X** developed for settings where there is no variable selection (cf: Huh & Jhun 2001). However, careful investigation and evaluation of the properties of such methods for fixed **X** are needed.

Although we consider linear regression models, the proposed methods can in principle be applied with any model for which an appropriate transformation and partition can be identified. Extensions to making inferences about discrete covariates in linear regression and about parameters in a Cox model for survival data are under development. It would also be useful to undertake more assessments of the robustness of the proposed methods, to develop criteria for selecting a test statistic, and to develop computationally efficient algorithms, especially when computationally-intensive variable selector is employed (cf: DiRienzo *et al*. 2003).

This research was supported by grants from the US National Institute of Allergy and Infectious Diseases. We thank Professor Paul Gustafson, the Associate Editor and a referee for their comments which have led to an improved version of the paper.

Under *H*_{0}, for any (*l*) and (*m*) in Π* _{R}*,

$$\begin{array}{l}P(\stackrel{\sim}{\mathbf{D}}=\stackrel{\sim}{\mathbf{d}}(l)\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S})\\ =P({g}^{-1}(\stackrel{\sim}{\mathbf{D}})={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{l},{\stackrel{\sim}{\mathbf{d}}}_{F})\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S})\\ =P(\mathbf{D}={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{l},{\stackrel{\sim}{\mathbf{d}}}_{F})\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S})\\ =\frac{P(\mathbf{D}={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{l},{\stackrel{\sim}{\mathbf{d}}}_{F}),\mathcal{R}(\mathbf{D})={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}{P(\mathcal{R}(\mathbf{D})={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}\\ =\frac{P(\mathbf{D}={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{l},{\stackrel{\sim}{\mathbf{d}}}_{F}),\mathcal{R}({g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{l},{\stackrel{\sim}{\mathbf{d}}}_{F}))={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}{P(\mathcal{R}(\mathbf{D})={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}\\ =\frac{P(\mathbf{D}={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{m},{\stackrel{\sim}{\mathbf{d}}}_{F}),\mathcal{R}({g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{m},{\stackrel{\sim}{\mathbf{d}}}_{F}))={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}{P(\mathcal{R}(\mathbf{D})={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})},\end{array}$$

because of the row independence of and independence of * _{P}* and

$$\begin{array}{l}P(\stackrel{\sim}{\mathbf{D}}=\stackrel{\sim}{\mathbf{d}}(l)\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S})\\ =\frac{P(\mathbf{D}={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{m},{\stackrel{\sim}{\mathbf{d}}}_{F}),\mathcal{R}(\mathbf{D})={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}{P(\mathcal{R}(\mathbf{D})={x}_{S}\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F})}\\ =P({g}^{-1}(\stackrel{\sim}{\mathbf{D}})={g}^{-1}({\stackrel{\sim}{\mathbf{d}}}_{P}^{m},{\stackrel{\sim}{\mathbf{d}}}_{F})\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S})\\ =P(\stackrel{\sim}{\mathbf{D}}=\stackrel{\sim}{\mathbf{d}}(m)\mid {\stackrel{\sim}{\mathbf{D}}}_{P}\stackrel{p}{=}{\stackrel{\sim}{\mathbf{d}}}_{P},{\stackrel{\sim}{\mathbf{D}}}_{F}={\stackrel{\sim}{\mathbf{d}}}_{F},\mathcal{R}(\mathbf{D})={x}_{S}).\end{array}$$

Since (*l*) and (*m*) are arbitrary, it follows that this common probability equals 1/*M*, where *M* is the number of matrices in Π* _{R}*.

Rewrite (2.1) as
$Y={\beta}_{S}^{\ast}{X}_{S}+{\beta}_{\backslash S}^{\ast}{X}_{\backslash S}+{\epsilon}^{\ast}={\beta}_{S}{X}_{S}+\epsilon $. Under (C1),
${\beta}_{S}={\beta}_{S}^{\ast}$ (Cochran 1938) and
$\epsilon ={\beta}_{\backslash S}^{\ast}{X}_{\backslash S}+{\epsilon}^{\ast}$. (C1) and the independence of *ε*^{*} and *X* = (*X _{S}*,

In Propositions 2 – 7^{*} below, it is easily verified that *g* is one-to-one, and that for any **d** in the support of **D**, *g*^{−1}((*l*)) is in the support of **D**. Thus it is sufficient to show that the variables consisting of * _{P}* and

Suppose (C1) holds. Under *H*_{0},
$\stackrel{\sim}{Y}=Y-{\beta}_{S}{X}_{S}={\beta}_{\backslash S}^{\ast}{X}_{\backslash S}+{\epsilon}^{\ast}$ (see the Proof for Lemma 1), and therefore *Ỹ* *X _{S}*|

Under *H*_{0}, if (C2) holds, then *Ỹ* = *ε*^{*} *X*.

Suppose (C3) holds. Under *H*_{0},
$\stackrel{\sim}{Y}={\beta}_{\backslash S}^{\ast}{Z}_{\backslash S}+{\epsilon}^{\ast}$ (see the Proof for Proposition 1), and thus *Ỹ* *X _{S}*|

The following propositions, denoted with asterisks, generalize Propositions 5–7 for testing an individual covariate to any proper subset of the selected covariates. We use *X _{H}*,

Suppose that (C2) holds under H_{0} and X_{H} X_{\H}. Let g(X, Y) = (X, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{H}^{0}{X}_{H}$, and let _{P} = **X**_{H}, and _{F} = (**X**_{\H}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

Write (2.1) as
$Y={\beta}_{H}^{\ast}{X}_{H}+{\beta}_{S\backslash H}^{\ast}{X}_{S\backslash H}+{\beta}_{\backslash S}^{\ast}{X}_{\backslash S}+{\epsilon}^{\ast}$. Since *X _{H}*

Suppose that (C2) holds under H_{0} and X_{lj}, j = 1, 2 ···, h are continuous covariates. Suppose that X_{H} = Γ_{\H}X_{\H} + e, where e X_{\H}. Define Z_{H} = X_{H} − Γ_{\H}X_{\H}. Let g(X, Y) = (Z_{H}, X_{\H}, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{H}^{0}{Z}_{H}$, and _{P} = **Z**_{H}, _{F} = (**X**_{\H}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1.

Under (C2), we can write
$Y={\beta}_{H}^{\ast}{X}_{H}+{\beta}_{S\backslash H}^{\ast}{X}_{S\backslash H}+{\epsilon}^{\ast}={\beta}_{H}^{\ast}{Z}_{H}+{\beta}_{H}^{\ast}{\mathrm{\Gamma}}_{\backslash H}{X}_{\backslash H}+{\beta}_{S\backslash H}^{\ast}{X}_{S\backslash H}+{\epsilon}^{\ast}={\beta}_{H}^{\ast}{Z}_{H}+\alpha {X}_{\backslash H}+{\epsilon}^{\ast}$, where the elements of *α* combine coefficient terms from *X*_{\}* _{H}* and

Suppose that (C3) holds, X_{l1}, ···, X_{lh} are continuous, and X_{H} = δ_{S\H}X_{S\H} + ε_{H}, where ε_{H} X_{S\H}. Define Z_{H} = X_{H} − δ_{S\H}X_{S\H} and Z_{\S} = X_{\S} − γ_{H}Z_{H}. Let g(X, Y) = (Z_{H}, X_{S\H}, Z_{\S}, Ỹ), where
$\stackrel{\sim}{Y}=Y-{\beta}_{H}^{0}{Z}_{H}$, and define _{P} = **Z**_{H} and _{F} = (**X**_{S\H}, **Z**_{\S}, **Ỹ**). Then g(·) and the partition (_{P}, _{F}) satisfy the conditions of Theorem 1. In the special case where X is normally distributed,
${\delta}_{S\backslash H}={\mathrm{\sum}}_{H,S\backslash H}{\mathrm{\sum}}_{S\backslash H}^{-1}$, where Σ_{H,S\H} = cov(X_{H}, X_{S\H}) and Σ_{S\H} = var(X_{S\H}).

Suppose that (C3) holds. (*X _{H}*,

Rui WANG, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.

Stephen W. LAGAKOS, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.

- Akaike H. Information Theory and the Maximum Likelihood Principle. In: Petrov V, Csáki F, editors. International Symposium on Information Theory. Budapest: Akademiai Kiádo; 1973. pp. 267–281.
- Breiman L. The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-fixed Prediction Error. Journal of the American Statistical Association. 1992;419:739–754.
- Chatfield C. Model Uncertainty, Data Mining and Statistical Inference (with discussion) Journal of the Royal Statistical Society Series A. 1995;158:419–466.
- Cochran WG. The Omission or Addition of an Independent Variable in Multiple Linear Regression. Supplement to the Journal of the Royal Statistical Society. 1938;5:171–176.
- Cox DR. On a Generalization of a Result of W. G. Cochran. Biometrika. 2007;94:755–759.
- Danilov DL, Magnus JR. On the Harm That Ignoring Pre-testing Can Cause. Journal of Econometrics. 2004;122:27–46.
- Dijkstra YK, Veldkamp JH. Data-Driven Selection of Regressors and the Bootstrap. In: Dijkstra TK, editor. On Model Uncertainty and Its Statistical Implications. Berlin: Springer-Verlag; 1988.
- DiRienzo G, DeGruttola V, Larder B, Hertogs K. Non-Parametric Methods to Predict HIV Drug Susceptibility Phenotype from Genotype. Statistics in Medicine. 2003;22:2785–2798. [PubMed]
- Efron B. How Biased is the Apparent Error Rate of a Prediction Rule? Journal of the American Statistical Association. 1986;81:461–470.
- Faraway JJ. On the Cost of Data Analysis. Journal of Computational and Graphical Statistics. 1992;1:213–229.
- Freedman DA, Navidi W, Peters SC. On the Impact of Variable Selection in Fitting Regression Equations. In: Dijkstra TK, editor. On Model Uncertainty and Its Statistical Implications. Berlin: Springer-Verlag; 1988.
- Giri K, Kabaila P. The Coverage Probability of Confidence Intervals in 2
Factorial Experiments after Preliminary Hypothesis Testing. Australian & New Zealand Journal of Statistics. 2008;50:69–79.^{r} - Gong S. Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression. Journal of the American Statistical Association. 1986;81:108–118.
- Huh M, Jhun M. Random Permutation Testing inMultiple Linear Regression. Communications in Statistics – Theory and Methods. 2001;30:2023–2032.
- Hurvich CM, Tsai C. The Impact of Model Selection on Inference in Linear Regression. The American Statistician. 1990;44:214–217.
- Kabaila P. The Effect of Model Selection on Confidence Regions and Prediction Regions. Econometric Theory. 1995;11:537–549.
- Kabaila P. Valid Confidence Intervals in Regression After Variable Selection. Econometric Theory. 1998;14:463–482.
- Kabaila P, Leeb H. On the Large-Sample Minimal Coverage Probability of Confidence Intervals After Model Selection. Journal of the American Statistical Association. 2006;101:619–629.
- Leeb H. The Distribution of a Linear Predictor After Model Selection: Conditional Finite-sample Distributions and Asymptotic Approximations. Journal of Statistical Planning and Inference. 2005;134:64–89.
- Leeb H. Conditional Predictive Inference Post Model Selection. The Annals of Statistics. 2009 forthcoming.
- Leeb H, Pötscher BM. The Finite-sample Distribution of Post-model-selection Estimators, and Uniform Versus Non-uniform Approximations. Econometric Theory. 2003;19:100–142.
- Leeb H, Pötscher BM. Model Selection and Inference: Facts and Fictions. Econometric Theory. 2005;21:21–59.
- Lossos IS, Czerwinski DK, Alizadeh AA, Wechser MA, Tibshirani R, Botstein D, Levy R. Prediction of Survival in Diffuse Large-B-Cell Lymphoma Based on the Expression of Six Genes. New England Journal of Medicine. 2004;350:1828–1837. [PubMed]
- Malhotra U, Bosch RJ, Chan E, Wang R, Fischl MA, Collier AC, McElrath MJ. Association of T Cell Proliferative Responses and Phenotype with Virus Control in Chronic Progressive HIV-1 Disease. The Journal of Infectious Diseases. 2004;189:515–519. [PubMed]
- Miller AJ. Selection of Subsets of Regression Variables (with discussion) Journal of the Royal Statistical Society Series A. 1984;147:398–425.
- Pötscher BM. Model Selection Under Nonstationarity: Autoregressive models and Stochastic Linear Regression Models. Annals of Statistics. 1989;17:1257–1274.
- Pötscher BM. Effects of Model Selection on Inference. Econometric Theory. 1991;7:163–185.
- Pötscher BM. Comment on The Effect of Model Selection on Confidence Regions and Prediction Regions. Econometric Theory. 1995;11:550–559.
- Pötscher BM, Novák AJ. The Distribution of Estimators After Model Selection: Large and Small Sample Results. Journal of Statistical Computation and Simulation. 1998;60:19–56.
- Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464.
- Shen X, Huang H, Ye J. Inference After Model Selection. Journal of the American Statistical Association. 2004;467:751–762.
- Shulman NS, Bosch RJ, Mellors JW, Albrecht MA, Katzenstein DA. Genetic Correlates of Efavirenz Hypersusceptibility. AIDS. 2004;18:1781–1785. [PubMed]
- Shibata R. Selection of the Order of an Autoregressive Model by Akaike’s Information Criterion. Biometrika. 1976;63:117–126.
- Veall M. Bootstrapping the Process of Model Selection: An Econometric Example. Journal of Applied Econometrics. 1992;7:93–99.
- Zhang P. Inference After Variable Selection in Linear Regression Models. Biometrika. 1992;79:741–746.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |