Enter Your Search:Search tips Search criteria Articles Journal titles Advanced

Can J Stat. Author manuscript; available in PMC 2010 December 1.
Published in final edited form as:
Can J Stat. 2009 December 1; 37(4): 625–644.
PMCID: PMC2848082
NIHMSID: NIHMS168156

# Inference after variable selection using restricted permutation methods

## Abstract

When confronted with multiple covariates and a response variable, analysts sometimes apply a variable-selection algorithm to the covariate-response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard (“naive”) methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample-splitting approach. The methods are illustrated with data from a recent AIDS study.

Keywords and phrases: variable selector, covariates, regression, sample splitting

## 1. INTRODUCTION

Let X1, X2, ···, Xp denote a set of covariates and Y denote some continuous response, and consider the following scenario: a variable selection procedure is applied to n independent and identically distributed copies of (X1, ···, Xp, Y), resulting in a subset, (Xj1, ···, Xjs), of covariates that appear to be related to Y. One is then interested in making an inference about parameters in a model for the marginal association between (Xj1, ···, Xjs) and Y; that is, about the parameters that would be estimated if standard methods (e.g., least-squares) were applied to fit the marginal model to m additional independent copies of (Xj1, ···, Xjs, Y). In practice, however, an independent data set is often unavailable, and thus one wishes to make inferences about these parameters based on the same data set used by the variable selector. This problem is sometimes referred to as “inference after variable selection”.

Past work in inference after variable selection has focused on inferences about the parameters in models for the association between Y and all p candidate covariates. Rather than fitting a single regression model involving all p covariates, the idea is to use a variable selection step to reduce the number of covariates and then base inferences on a more parsimonious model which acts as if the selected covariates can fully explain the association of Y with all p covariates. Several authors have noted that application of standard (“naive”) inference methods to the same data set used for variable selection can lead to invalid inferences, including distorted Type I errors, biased estimators, and confidence intervals with distorted coverage probabilities (cf: Miller 1984; Chatfield 1995; Hurvich & Tsai 1990; Zhang 1992; Kabaila 1995; Pötscher & Novák 1998, Danilov & Magnus 2004, Leeb & Pötscher 2005, Kabaila & Leeb 2006, Giri & Kabaila 2008), as well as overly optimistic prediction errors (cf: Efron 1986; Gong 1986; Breiman 1992; Leeb 2009). Although the bias of naive approaches can become negligible when the probability of selecting all the important covariates approaches 1 as the sample size increases (Pötscher 1991), Chatfield (1995), Leeb & Pötscher (2003, 2005), Leeb (2005), and others have noted that such results are not useful for making inferences in finite sample because severe biases can still occur. Alternative analytical methods, including bootstrap and jackknife, can sometimes reduce, but do not in general eliminate these biases, (cf: Faraway 1992, Veall 1992, Shen, Huang & Ye 2004). Conditions under which these methods are valid have not been identified, and their poor performance in some settings was noted in Freedman, Navidi & Peters (1988), and Dijkstra & Veldkamp (1988). As Leeb & Pötscher (2005) note,“… a proper theory of inference post model selection is only slowly emerging …”. Recently, Shen et al. (2004) develop approximate inference methods for the regression setting when the response is normally distributed, a consistent or over-consistent variable selector is used (that is, a variable selector that asymptotically selects or includes the true model with probability 1), and where the parameters of interest correspond to covariates that are not candidate for exclusion by the variable selector. Leeb (2009) proposes a prediction interval which is approximately valid and short with high probability in finite samples for the linear regression setting where the error and also the explanatory variables are jointly normally distributed.

We are unaware of any literature addressing the issue of using the same data set to first select a subset of covariates and then to make inferences for the marginal association between this selected subset and the response. For linear models, the parameters in the marginal model sometimes coincide with those in the model for the association between all covariates and the response, but in general they differ (Cochran 1938, Cox 2007). Naive methods which ignore the fact that the same data set is used both for variable selection and for subsequent inferences are commonly used in practice. As with inferences for the full model following variable selection, naive methods are in general biased in the settings we consider, as we will illustrate. Our interest in inferences about the marginal association is motivated by a common situation in medical research of wishing to understand the relationship between a specific set of covariates and a response, while recognizing that other covariates might also be associated with the response. For example, Lossos et al. (2004) examined 36 candidate genes and a disease outcome using a variable selection algorithm to arrive at a subset of 6 genes, and then assessed their association with the disease outcome. Here the scientific goal is to use these selected genes to develop an improved staging system for guiding patient management, and thus the main statistical interest is the marginal association between the selected genes and the clinical outcome, not about whether all the important genes were selected or whether a better variable selector could have been used. Similarly, in determining the genetic correlates of efavirenz hypersusceptibility, Shulman et al. (2004) used stepwise regression to identify specific reverse transcriptase (RT) mutations at three codons, and then considered a logistic regression model relating efavirenz hypersusceptibility and the RT mutations at these three codons. Having identified these codons, the scientific question is whether they can be used to guide the use of antiretroviral therapy by identifying efavirenz hypersusceptibility, and thus the statistical focus is the marginal association between these 3 codons and clinical response.

The proposed approach involves transforming the covariate-response data matrix to one that can be partitioned into two components that are independent under a null hypothesis, then forming new matrices by permuting the rows of one component while holding the other component fixed, and then basing inferences on a permutation distribution formed from a specific subset of the resulting matrices. Exact tests and confidence intervals are obtained for some settings and approximate inferences for others. In section 2 we describe the general approach, present the theorem that justifies it, and discuss its implementation. In section 3 we use simulations to assess and compare the performance of the proposed approach to naive methods and to a sample-splitting approach, and in section 4 we illustrate the methods with data from a recent AIDS study. In section 5 we discuss related issues and areas in need of further development. All proofs are included in the Appendix.

## 2. METHODS

We begin this section by introducing notation and conceptually describing the proposed approach. This is followed by a theorem that justifies the approach and results that guide its implementation.

### 2.1 Notation

Let = {X1, X2, ···, Xp} denote a set of candidate covariates, including interaction terms of interest, and let Y denote some continuous response variable. Suppose that the observations consist of n i.i.d. copies of the random vector (X1, X2, ···, Xp, Y), let X and Y denote the corresponding n × p matrix and n × 1 vector, and let D = (X, Y). In what follows g(·) is a one-to-one function of (X1, ···, Xp, Y) and for a n × (p + 1) matrix M, we define g(M) (or g−1(M)) as the n × (p + 1) matrix with ith row obtained by applying g (or g−1) to the ith row of M. We use “ $=p$” to denote “is a row permutation of” and “” to denote “is independent of”.

Suppose : D denotes a variable selector that maps D into a subset, ={Xl1, Xl2, ···, Xls}, of . Let XS= (Xl1, Xl2, ···, Xls) denote the corresponding vector of selected covariates and let X\S denote the vector formed by the random variables in \. The variable selector is arbitrary, including selectors where is empty with positive probability or where some covariates are selected with probability 1. For example, the variable selector might first assess the univariate association between each of p candidate covariates and Y, and then select all those covariates achieving a significance level below some threshold. Or, if X1 denotes treatment or exposure in a medical study, the variable selector might be a step-up procedure that identifies a subset of possible prognostic factors, which are then included in a model with X1 to make inferences about the treatment effect while controlling for the selected prognostic factors, and about the association of the selected prognostic factors with the response. Here X1 is selected with probability 1, while the selection of each prognostic factor is uncertain.

### 2.2 Conceptual Description of Approach and Theoretical Justification

Consider testing a hypothesis, H0, about the marginal association of the selected covariates with the response Y. The approach consists of the following 2 steps.

#### Transformation and Partition Step

Based on the hypothesis of interest and conditions on the underlying joint distribution of (X1, ···, Xp, Y), we first identify a one-to-one transformation, g(·), of (X1, ···, Xp, Y), whose p + 1 components can be partitioned into two nonempty sets of random variables that are independent under H0. We then use g(·) to form the n × (p + 1) matrix = g(D) and partition into P and F, where the variables comprising P and F are independent under H0.

#### Permutation and Restriction Step

Consider the n! matrices of the form $D∼(l)=(D∼Pl,D∼F)$, where $D∼Pl=pD∼P$, and identify the subset, ΠR, of matrices for which (g−1( (l))) = ; that is, the subset of matrices (l) whose reverse transformation is mapped to by the variable selector. Let T = T() denote some test statistic for H0. We then compare the observed value of T to the permutation distribution formed by evaluating T((l)) for those (l) in the restricted set ΠR.

The validity of the proposed approach is based on the following Theorem. In Section 2.3, we will discuss ways to choose the transformation g(·) and the partition based on a given hypothesis of interest and conditions on the underlying joint distribution of (X1, ···, Xp, Y).

##### Theorem 1

Let D = (X, Y) denote the n × (p + 1) matrix consisting of n i.i.d. copies of (X1, X2, ···, Xp, Y) and suppose there exists a one-to-one transformation, g(·), of (X1, X2, ···, Xp, Y) such that = g(D) can be partitioned as = (P, F) for some P and F whose elements are independent under H0. For any d in the support of D, define xS = (d) and = g(d) = (P, F). For each row permutation, $d∼Pl$, of P, define $d∼(l)=(d∼Pl,d∼F)$ and suppose g−1((l)) is in the support of D. Let ΠR = {(l)|(g−1((l))) = xS}. Then under H0 and for any (l) ΠR,

$P(D∼=d∼(l)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS)=1/M,$

where M is the number of matrices in ΠR.

By construction, the unpermuted matrix = g(d) is an element of the restricted set ΠR. Theorem 1 shows that under H0 and conditional on 1) the result of the variable selection, 2) the observed value of F, and 3) knowledge of P up to a row permutation, the M matrices that comprise ΠR are equally likely. Thus, the observed value, , of can be viewed as a randomly selected element from the set ΠR. It follows that if T = T() is any test statistic, the observed value of T can be viewed under H0 as a random sample of size 1 from the resulting (computable) permutation distribution of values {T((l))|(l) ΠR}. This provides the basis for exact inferences about H0 that correct for variable selection. Note that although the choice of test statistics does not affect Type I error, it does affect power and therefore requires careful consideration. Confidence regions for model parameters can be obtained by inverting the restricted permutation tests; that is, a (1 − α)% confidence region is given by those parameter values that are not rejected at the α level of significance.

To give a flavor of the method, suppose X is scalar, so that D is the n × 2 matrix with ith row (Xi, Yi), and assume that the conditional density function of Y, given X = x, is f0(yβx) for some f0(·), where β is an unknown parameter.

1. First suppose there is no variable selection, that is, (D) {X}, and consider testing the hypothesis H0: β = 0 of no association between X and Y. Let g denote the identity transformation and partition = D by taking P = X and F = Y. If Xl denotes a row permutation of X, then (l) = (Xl, Y) and ΠR is the set of all n! such matrices. Let T() be some test statistic, say XTY. We can then test H0 by comparing its observed value to the permutation distribution formed by the n! values of T((l)) = (Xl)TY. The proposed method thus reduces to the classical permutation test for the association between X and Y. Formally, inference is based on the null permutation distribution $P(D∼=d∼(l)∣d∼P=Px,d∼F=y,R(D)={X})=P(D∼=(xl,y)∣X=px,Y=y)=1/n!$, for each of the M = n! matrices in ΠR.
2. Now consider use of a variable selector that sometimes selects {X} but otherwise does not according to some rule; that is, (D) = {X} or {}. Suppose (D) = {X} for a particular D, and consider testing H0: β = 0. We again take g to be the identity transformation so that (l) = (Xl, Y) as before, but now base inferences on only those (l) such that (g−1((l))) = (Xl, Y) = {X}; that is, the method applies the classical permutation test but restricted to only those matrices (Xl, Y), for which selects X.
3. Finally, consider the same variable selector as in (b) and suppose that (D) = {X} for a particular D, but now consider testing H0: β = β0 for some β0 ≠ 0. Here we can take g(x, y) = (x, y + β0x), so that = (X, Y + β0X). Note that the 2 columns of are independent under H0 and that (l) = (Xl, Y + β0X). The inverse mapping is g−1(a, b) = (a, bβ0a) and thus g−1((l)) = (Xl, Y + β0(XXl)). If T() is some test statistic, the observed value T() is then compared to the permutation distribution formed by {T((l))|(l) ΠR} = {T((l))|((Xl, Y + β0(XXl))) = X }. This test could be used to construct a confidence interval for β.

### 2.3 Finding a Transformation and Partition

#### 2.3.1 General Considerations

In more complex settings than the example discussed above, finding a transformation g(·) that leads to a desired partition may not be obvious. The proposed methods are invariant to reversing the roles of P and F, and for a given data matrix there could be multiple partitions, as we illustrate in section 3.1.2. Below we give two general conditions to guide the choice of the transformation and partition. We then discuss the application of these conditions for specific hypotheses when a linear model is assumed for the association between Y and the candidate covariates X1, X2, ···, Xp.

##### Condition A

(X, Y) can be transformed one-to-one to some (, Ỹ) = (P, F, Ỹ) such that P (F, Ỹ) under H0. This can be verified using P F and P|F.

With this condition, Theorem 1 holds by taking P = P and F = (F, ). We refer to the resulting test as restricted permutation test A.

##### Condition B

(X, Y) can be transformed one-to-one to some (X, Ỹ) so that X under H0.

With this condition, Theorem 1 holds by taking P = X and F = . We refer to the resulting test as restricted permutation test B.

#### 2.3.2 Linear Models

Suppose the response Y is related to the candidate covariates X1, ···, Xp by a standard linear model:

$Y=β∗X+ε∗,$
(2.1)

where ε* X and where some of the components of β* may be zero. Consider the following conditions:

• (C1) XS X\S
• (C2) Y X\S|XS
• (C3) The components of X\S are continuous and X\S = γSXS + , where XS. A special case is when X = (X1, ···, Xp) is normally distributed, in which case $γS=∑\S,S∑S−1$, where Σ\S,S = cov(X\S, XS) and ΣS = var(XS).

##### Proposition 1

Assume (2.1). If any of (C1), (C2), or (C3) holds, the marginal association between XS and Y is of the form

$Y=βSXS+ε$
(2.2)

for some βS, where ε XS. Furthermore, when (C2) holds, ε X.

Note that (2.2) refers to the marginal association of XS and Y, that is, f(Y|XS), and not the association between XS and Y induced by the variable selector, that is, f(Y|XS, (D) = ). Condition (C1) can be checked empirically, and sometimes covariates can be transformed to make this condition hold approximately. Condition (C2), which indicates that the selected covariates can fully explain the association of Y with X1, ···, Xp, might be expected to hold approximately when using a consistent (such as Bayesian Information Criterion) or over-consistent (such as Akaike Information Criterion) variable selector (Akaike 1973, Shibata 1976, Schwarz 1978, Pötscher 1989). Condition (C3) requires that after eliminating the linear effect of XS from X\S, the error term is independent of XS. This can be assessed by calculating the residuals from a least-squares regression of X\S on XS and plotting these against the covariates in XS.

Below we give several results specifying conditions under which the proposed methods can be used to make an inference about a global hypothesis (Propositions 2–4) or an individual covariate (Propositions 5–7) from (2.2). Propositions 5–7 generalize in a straightforward way to any subset of the selected covariates (see the Appendix).

Consider the global hypothesis $H0:βS=βS0$ for some $βS0$.

##### Proposition 2

Suppose that (C1) holds. Let g(X, Y) = (X, Ỹ), where $Y∼=Y−βS0XS$, and P = XS, F = (X\S, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

##### Proposition 3

Suppose that (C2) holds under H0. Let g(X, Y) = (X, Ỹ), where $Y∼=Y−βS0XS$, and P= X, F=. Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

##### Proposition 4

Suppose that (C3) holds. Define Z\S = X\S − γSXS. Let g(X, Y) = (XS, Z\S, Ỹ), where $Y∼=Y−βS0XS$, and define P = XS and F = (Z\S, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

When multiple covariates are selected, it is often of interest to test a hypothesis about a single covariate. Without loss of generality, let X1 , and consider $H0:β1=β10$ for some $β10$. Below we use XS\1 and X\1 to denote the random vectors formed by the variables in \{X1} and \{X1}, respectively.

##### Proposition 5

Suppose that (C2) holds under H0 and X1 X\1. Let g(X, Y) = (X, Ỹ), where $Y∼=Y−β10X1$, and let P = X1, and F = (X\1, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

##### Proposition 6

Suppose that (C2) holds under H0, X1 is a continuous covariate, and X1 = γ\1X\1 + ε1, where ε1 X\1. Define Z1 = X1 − γ\1X\1. Let g(X, Y) = (Z1,X\1, Ỹ), where $Y∼=Y−β10Z1$, and P = Z1, F = (X\1, ), then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

##### Proposition 7

Suppose (C3) holds, X1 is continuous and X1 = δS\1XS\11, where ε1 XS\1. Define Z1 = X1 − δS\1XS\1, and Z\S = X\S − γ1Z1. Let g(X, Y) = (Z1,XS\1, Z\S, Ỹ), where $Y∼=Y−β10Z1$, and let P = Z1, and F = (XS\1, Z\S, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1. In the special case where X is normally distributed, $δS\1=∑1,S\1∑S\1−1$, where Σ1,S\1 = cov(X1,XS\1) and ΣS\1 = var(XS\1).

In practice, the covariances matrices in Propositions 4 and 7 are usually not known, in which case they can be replaced by their empirical counterparts, resulting in an approximate inference. Similarly, in Propositions 4, 6, and 7, the γS, γ\1, and δS\1 can be replaced by their least-squares estimates. The linear representation for X1 assumed in Propositions 6 and 7 can be assessed empirically by plotting residuals against the values of other covariates.

Each of Propositions 2–7 gives a transformation and partition that satisfy the conditions of Theorem 1 for specific hypotheses and assumptions about the distribution of (X1, ···, Xp, Y). Propositions 2, 4, 5, 6, 7 represent applications of Condition A while Proposition 3 represents an application of Condition B. Inferences then follow from the transformation/partition and permutation/restriction steps described in Section 2.2.

In the above, we began by assuming a linear model relating Y to the entire set of candidate covariates. Alternatively, we could have begun by assuming a linear model for the marginal association between Y and the selected covariates, and then imposed other conditions on the error term of this model and/or on the joint distribution of the covariates to apply the proposed methods.

The proposed methods can sometimes be applied without specifying a particular parametric model. For example, consider the null hypothesis H0 : Y XS that Y and XS are independent. If XS X\S, Condition A holds with P = XS, F = X\S, and = Y, so that we can apply Theorem 1 with P = XS and F = (X\S,Y). Alternatively, suppose Y X\S|XS, which can be viewed as saying that includes all of the important covariates. Here Condition B would hold under H0, so that we can apply Theorem 1 with P = X and F = Y.

## 3. SIMULATION STUDIES

In this section we use simulations to assess and compare the performance of the proposed methods to the naive and sample-splitting approaches for several specific settings. For a given outcome, xS, of the variable selection step, we estimate the Type I error and power for the proposed methods and naive approach by first generating independent data sets and then, from among those that result in xS being selected, computing the proportion that reject the null hypothesis. For the sample-splitting approach, we first generate independent data sets and randomly split each in half, using the first half for variable selection. From among those data sets that result in xS, we then use the second half to fit the marginal model and record the proportion that reject the null hypothesis. Confidence intervals are obtained similarly; the coverage probability is estimated by the proportion of the resulting intervals that include the true parameter value. All simulations were done using R, version 2.2.0 or later.

### 3.1 Linear regression models following variable selection

#### 3.1.1 Parameters of interest

We begin this section with an example to illustrate the difference between the parameters of interest in this paper and those in models for the association between the response variable and all the candidate covariates. Similar to the setting considered by Shen et al. (2004), suppose (X1,X2,X3,X4) follows a multivariate normal distribution with mean 0 and covariance matrix having (i, j)th element ρ|i−j|, and that conditional on (X1,X2,X3,X4),

$Y=β1∗X1+β2∗X2+β3∗X3+0·X4+ε∗,$
(3.1)

where ε*~ N(0, 1) and independent of (X1,X2,X3,X4). In Shen et al. (2004), the variable selection must always include X1 and the parameter of interest is $β1∗$, regardless of which subset of (X2,X3,X4) is also selected. Our focus is on the marginal association between the selected covariates and the response. When X1, X2 and X3 are selected, the parameters we consider are the same as those in (3.1), but otherwise they differ. For example, when only X1 is selected, we no longer are interested in $β1∗$, the coefficient for X1 from the linear model that adjusts for X2 and X3, but instead in β1 from the marginal model Y = β1X1 +ε, where $β1=β1∗+ρβ2∗+ρ2β3∗,ε=β2∗(X2−ρX1)+β3∗(X3−ρ2X1)+ε∗$, and X1 ε (Proposition 1).

Suppose = {X1}. Table 1 displays the empirical Type I errors for testing $H0:β1=β10$ using the naive approach, the sample-splitting approach, and restricted permutation test A using both actual (c) and estimated (ĉ) covariances, where $c=∑\S,S∑S−1$ (Proposition 4). The variable selector is a step-down procedure (step() function in R), starting with the model that includes all p covariates, and using the Akaike Information Criterion (Akaike 1973) to eliminate covariates. Here $β1∗=.1,β2∗=.1,.05,β3∗=.01$, ρ= .5,−.5, .8,−.8 and, as in Shen and others, n = 22. The Type I error estimates using the restricted permutation test with actual c or sample-splitting are very close to the nominal level of .05; those based on the restricted permutation test using ĉ are close to the nominal level and offer a substantial improvement over those obtained from the naive approach.

Empirical Type I Errors for Testing $H0:β1=β10$ at Nominal .05 Level, Using the Naive, Sample-Splitting, and Restricted Permutation Approaches with Actual (c) and Estimated (ĉ) Values of c, for Different ρ. Each Entry Based ...

#### 3.1.2 Testing a global hypothesis

Suppose (X1, ···, X5) has a multivariate normal distribution with mean (.1, .2, .3, .4, .5), X1 X5, (X1,X5) (X2,X3,X4), var(Xj) = 1 for j = 1, ···, 5, and corr(Xj,Xk) = .1 for j, k = 2, 3, 4 and jk. Assume that the model for Y, given (X1,X2, ···, X5), is $Y=.5+β1∗X1+ε∗$, where ε* ~ N(0, 1) and independent of (X1, ···, X5). We take n = 100 and use the same variable selector as in section 3.1.1.

Suppose = {X1,X5}. From Proposition 1, the marginal association of (X1,X5) and Y is given by Y = β0 + β1X1 + β5X5 +ε. Consider testing the hypothesis H0 : β1 = β5 = 0 using the two degrees of freedom likelihood ratio test statistic. Figure 1 gives the power of restricted permutation test A (Proposition 2), permutation test B (Proposition 3), and sample-splitting for increasing values of β1 and for β5 = 0. Results are based on 1000 simulations for Type I error and 200 simulations for power. The Type I error of the naive test is highly distorted (=0.559), whereas those for the permutation and sample-splitting approaches are very close to the nominal .05 level. The power curves of the two permutation tests are similar, especially in power ranges of typical interest, and larger than that of the sample-splitting approach.

Power Comparison of Restricted Permutation Test A (Dashed), Restricted Permutation Test B (Solid), and Sample-Splitting (Dotted) for Testing H0 : β1 = β5 = 0, for β1 = 0, .1, ···, .7. The dot-dash line represents ...

#### 3.1.3 Testing a subset of the coefficients

Suppose p = 3 and X = (X1,X2,X3) has multivariate normal distribution with mean vector (.1, .2, .3) and covariance matrix Σ = (σij), where σii = 1 and σij = ρ for ij. Assume the conditional distribution of Y, given (X1,X2,X3), is given by Y = .5 + .3X1 + 0X2 + 0X3 +ε*, where ε* ~ N(0, 1) and independent of (X1,X2,X3), and consider the same variable selector as section 3.1.1. Suppose selects {X1,X2}, and we wish to test hypotheses about parameters from the marginal model Y = β0 + β1X1 + β2X2 + ε (Proposition 1).

To test $H0:β1=β10$, we use Proposition 7 and define Z1 = X1c1X2, where c1 = cov(X1,X2)/var(X2) =ρ, and Z3 = X3c2Z1, where c2 = cov(X3, Z1)/var(Z1) = ρ/(1 + ρ). Let $Y∼=Y−β10Z1$. To test $H0:β2=β20$, we transform X2 to Z2 = X2c1X1, where c1 = cov(X1,X2)/var(X1) = ρ, and Z3 = X3c2Z2, where c2 = cov(X3, Z2)/var(Z2) = ρ/(1+ρ), and $Y∼=Y−β20Z2$. Note that we use a different transformation for testing β2 to ensure that this remains the coefficient being tested after the transformation.

Table 2 gives the empirical Type I errors for the naive test, the restricted permutation test A, and the sample-splitting approach for different choices of n and ρ, and using the actual (c) and empirically-estimated (ĉ) values of c = (c1, c2), based on the test statistic $∑j=1n(Yj−Y¯)(Zij−Z¯i)$, where i = 1, 2 refer to Z1 and Z2. The permutation test using the actual values of c and the sample-splitting approach lead to Type I errors that are very close to the nominal .05 level. Use of empirical estimates leads to somewhat conservative tests while the naive test gives distorted Type I errors.

Empirical Type I Errors for the Naive, Sample-Splitting, and Restricted Permutation Approaches Using Actual (c) and Estimated (ĉ) Values of (c = (c1, c2)) for Different Sample Size (n) and ρ = cov(Xi,Xj), ij, Based on Nominal ...

### 3.2 Two-sample problems following variable selection

Suppose that we have p binary covariates X1, ···, Xp. Define pj = P(Xj = 1) and define βk to be the difference in mean response for the two levels of Xk; that is, βk = E(Y|Xk = 0) − E(Y|Xk = 1), k = 1, 2, ···, p. Suppose that (X) = {Xj} for some j {1, 2, ···, p} and we want to test $H0:βj=βj0$. A natural transformation to eliminate the mean location difference induced by Xj under H0 is $Y∼=Y+βj0Xj$, since then we have E(|Xj = 0) = E(|Xj = 1).

Suppose that X3, ···, Xp are mutually independent and independent of (X1,X2), and corr(X1,X2) = ρ. Suppose that f(Y|X1, ···, Xp) = f(Y|X1) and f(y|X1 = x) = f0(y) for some β and density f0. It is easily verified that $β1=β,β2=βρp1(1−p1)/p2(1−p2)$, and βj = 0 for j = 3, 4, ···, p.

When X1 is selected, Condition (C2) holds and we can apply Proposition 3 (Permutation Test B). Suppose Xj (j ≥ 3) is selected, Condition (C1) holds and we can apply Proposition 2 (Permutation Test A). Suppose X2 is selected. When ρ ≠ 0, X2 is marginally associated with Y because of its correlation with X1. Condition (C1) does not hold because X2 is not independent of X1. Neither does Condition (C2) hold for this because is not independent of X (Var(|X2) ≠ Var()). We include this case below to assess the robustness of the permutation tests when the conditions for their validity are not satisfied.

#### 3.2.1 Validity and Robustness

Consider the variable selector defined by (D) = {Xj} if Vj = argmax{V1,···, Vp}, where Vj is the t-test statistic based on the difference between the average values of Y when Xj = 1 and Xj = 0. We use the usual t-statistic for the permutation tests. For p = 10, p1 = .1, pj = .5 for j ≥ 2, ε ~ N(0, 1), ρ = .2, and n = 100, Figure 2 displays empirical Type I error rates for testing $H0:βj=βj0$, (j = 1, 2, 3) using the naive approach, permutation tests A and B, and the sample-splitting approach when the selected covariate is X1 (panel a), X2 (panel b), or X3 (panel c), for β1 varying from 0 to 1. The condition for permutation test A does not hold in panel (a) or (b), and the condition for permutation test B does not hold in panel (b) and (c); hence the respective curves indicate the robustness of these tests to violations of their conditions. In all cases, a nominal .05 level test and 2000 simulations were used.

Type I Error Estimates. Panel (a): = {X1}, Testing $H0:β1=β10$. Panel (b): = {X2}, Testing $H0:β2=β20$, Where $β2=2ρβ1p1(1−p1)$. Panel (c): = {X3}, Testing $H0:β3=β30$. Horizontal ...

In panel (a), the naive test is severely biased for smaller values of β1 and improves for increasing β1 because the probability of selecting X1 approaches 1. The Type I error rates for both permutation tests and for the sample-splitting approach are close to the nominal values for all values of β1, even though the conditions for restricted permutation test A are not met. The small distortion for permutation test A is in part due to the small dependence between X1 and other covariates.

In panel (b), where X2 is selected, the naive test has highly distorted Type I error for all β1. The conditions for both permutation tests are violated, and their Type I errors climb above .05 when β1 > .5 (β2 > .06), increasing to .15 when β1 = 1 (β2 = .12). Compared to the case where X1 is selected, the larger distortion using permutation test A is likely due to the fact that X2 is dependent on both X1 and , while when = {X1}, X1 is dependent only on X2.

In panel (c), where X3 is selected, the curves represent the Type I errors for the hypothesis H0: β3 = 0. Here the condition for permutation test A is satisfied and the actual Type I error is very close to the nominal value. The condition for permutation test B is not satisfied, resulting in an inflated Type I error which begins for β1 ≈ .4 and increases to about .15 when β1 = 1. This occurs because as β1 increases, the association between = Y and X1 increases, resulting in stronger dependency between P = X and F = Y. The naive test is severely distorted for all values of β1.

We compared the power of permutation test B to the sample-splitting approach for testing H0: β1 = 0 when X1 is selected, for β1 ranging from 0 to 1.0, and for different choices of p1, p2, ρ, and n. Results are based on 2000 simulations for Type I error and 1000 simulations for power. In all cases that we examined, the power of the restricted permutation test exceeds that of the sample-splitting approach. A typical pattern of the comparisons is shown in Figure 3 for p1 = p2 = .1 and varying ρ (= 0, .2, .5) and n (= 100, 500). Very similar results were obtained for other choices of parameters values (data available upon request). Also shown in Figure 3 are the power curves for the unachievable test of association between X1 and Y without model selection (that is, if we fit the marginal model relating X1 and Y without a variable selection step). The lower power of the restricted permutation test relative to the test that does not undertake model selection can be viewed as the “cost” of variable selection.

Power Comparison of Permutation Test B (Solid) to Sample-Splitting (Dotted), when = {X1} and Testing H0: β1 = 0 in Example 1, for p1 = p2 = .1, when n = 500 (panels (a)–(c)) or 100 (panels (d)–(f)) and ρ = 0 (panels (a), ...

Table 3 summarizes the performance of the nominal 95% confidence intervals obtained from the naive approach, the sample-splitting approach, and from permutation tests A and B when n = 100, 500 and β1 = .05, .2. The shaded areas represent settings where the conditions (specified in Propositions 2 and 3) for the restricted permutation tests are satisfied, and thus the confidence intervals are exact. For the remaining values, the corresponding coverage results reflect the robustness of the methods to violations of these conditions. The coverage of the naive confidence intervals is often substantially lower than the nominal value. The coverage of the restricted permutation tests is very close to the nominal value when the conditions for these tests hold, and offer a substantial improvement over naive methods when they are violated. Note also that the width of the restricted permutation test intervals are only moderately greater than those for the naive approach. Because the width of the naive approach coincides with the width for the approach without variable selection, the relative increase in the width of the permutation intervals can be viewed as the “cost” paid for applying this variable selector. While the coverage for the sample-splitting approach is always close to the nominal value, the intervals are wider than those obtained from the restricted permutation methods, reflecting a loss of efficiency because only 50% of the sample was used in their construction.

Empirical Coverage Probabilities and Mean ± SD of Width of Nominal 95% CIs for βj (j = 1, 2, 3) for Different β1 and Sample Sizes (n). Based on 500 Replications. Underlined Boldfaced Numbers Indicate Situations Where Conditions ...

#### 3.2.2 Two variable selectors

The proposed methods allow comparisons of different variable selectors. Consider the same setting as above but now with p = 3 covariates, and with p1 = .1, p2 = .2, p3 = .3, ρ = .5, and β1 = 0, .2. Let 1 be the selector used in Section 3.2.1, and let 2 select those covariates whose marginal association with Y yields a t-statistic greater in absolute value than 1.96. Table 4 gives the actual coverage probabilities for the nominal 95% confidence intervals from the naive approach, the sample-splitting approach, and for permutation test B when = {X1}. The coverage probabilities for the intervals obtained from the restricted permutation methods and the sample-splitting approach are all very close to the nominal levels, while those for the naive approach are lower than the nominal values. The intervals obtained from sample-splitting are wider than those obtained from the restricted permutation methods for both variable selectors. The permutation intervals following use of 2 are about 10% wider than those following use of 1. For 2, where the distortion of the naive intervals is greater, the exact permutation intervals are 16%–18% wider than the naive intervals. In contrast, for 1, where the naive intervals are less distorted, the exact permutation intervals are only 3%–5% wider. Because the expected width of the naive intervals is the same as that if there were no variable selection, one can view the “cost” of correcting for variable selection as being greater for 2 than for 1.

Empirical Coverage Probabilities and Mean ± SD of the Width of the Nominal 95% CIs for β1, for Different Variable Selectors. Based on 1000 Replications.

## 4. AN EXAMPLE

We illustrate the proposed methods using the results from an immunological study conducted by the AIDS Clinical Trials Group (Malhotra et al., 2004) to assess the association between several immunological markers measured at enrollment and Y, the change in CD4+ T-cell count by study week 24. There were n = 59 patients and p = 15 covariates consisting of nine immunological markers obtained from flow cytometry, four stimulation indices obtained from lymphocyte proliferation (LP) assays, and two binary variables to describe which of three treatments a patient received. The nine immunological markers were the percentage of CD8 T cells in lymphocyte; the percentages of CD4+ T cells expressing: naive markers (nCD4 %), activation markers, coexpressing CD28+, and coexpressing Fas; and the percentages of CD8+ T cells expressing: naive markers (nCD8%), activation markers, coexpressing CD28+, and coexpressing Fas (CD8+95+%). The four stimulants used in LP assays were baculovirus control protein, Candida, baculovirus-expressed recombinant HIV-1LAI p24, and HIV-1MN gp160.

We considered the variable selector that separately examines the marginal association of each covariate with Y, using least squares, and selects all covariates showing some evidence of being associated with Y, defined as a significance level (p-value) of .10 or less. This led to the selection of 3 covariates: nCD4% (p=.002), nCD8% (p=.036), and CD8+95+% (p=.021). We then considered a linear model for the marginal association of these covariates with Y. As seen in the left side of Table 5, the naive multivariate (least-squares) analysis indicates that Y is significantly associated with nCD4% (p=.005), possibly associated with CD8+95% (p=.09), and not associated with nCD8% (p=.37). The disparate p-values for the univariate and multivariate analysis of nCD8% are likely due to its correlation (.38) with nCD4%. The naive likelihood ratio test of the global hypothesis that none of the three covariates are associated with Y gives p < .001.

Results from the AIDS Study

We subsequently analyzed the marginal association between the three selected covariates and response using the restricted permutation methods (Table 5, right). We employed Proposition 3 to assess the global hypothesis that none were associated with response, using the log-likelihood ratio statistic. Use of Proposition 3, which assumes that the selected covariates can fully explain the association between Y and all the covariates (Condition C2), was motivated by the liberal criterion (p ≤ .10) used to select covariates. The global permutation test provides evidence (p=.026) against the hypothesis that all regression coefficients were zero. We then employed Proposition 6 to assess the individual covariates. Proposition 6 also assumes that each of the 3 selected covariates can be expressed as a linear combination of the remaining covariates plus an unrelated error term. This assumption was empirically supported by examining scatter-plots of the residuals from each fitted model relating a selected covariate with the remaining covariates. As a test statistic, we used the least-squares estimator of the coefficient of Z1 (see Proposition 6) in the model relating with Z1 and the other two selected covariates. After correcting for variable selection, the association of nCD4% and response remained significant (p=.027), the association for CD8+95+% became marginally significant (p=.051) and the association for nCD8% remained non-significant. The 95% confidence intervals obtained from the restricted permutation methods are wider than the naive confidence intervals, incorporating the additional uncertainty about the regression coefficients due to variable selection.

With p = 15 covariates and only n = 59 observations, most analysts would not likely use a sample splitting approach to analyze these data out of concern that some important covariates might not be selected, and that even if selected, there might be inadequate power to demonstrate their association with Y. Using the same variable selector and randomly splitting the data into variable selection (n=30) and testing (n=29) sets resulted in nCD4% being selected 87% of the time. When the same 3 covariates were selected as with the full data, nCD4% was no longer significant (p < .05) 32% of the time.

## 5. DISCUSSION

Despite the recognized biases resulting from use of naive inference methods that do not account for variable selection, such methods are still commonly used in practice, in large part due to a lack of alternatives. The methods proposed in this paper, based on restricted permutation tests, provide exact or approximate inferences about the marginal association between a subset of covariates and a response variable under specific conditions. The methods are not tied to any specific variable selector and do not require that the inference be restricted to covariates that are not candidates for exclusion by the variable selector. The proposed methods do not apply to all settings, as indicated by the assumptions made in Propositions. However, an advantage is that conditions for their validity can be specified and most can be checked empirically. When a lenient (non-parsimonious) variable selector is used, it is more likely that all important covariates in the full model will be selected. In such cases, the regression coefficient for a covariate in the marginal model will be the same as the coefficient in the full model, and hence the proposed methods can be used to make inferences on associations in the overall model. However, when some important covariates are not selected, the regression coefficients in the full and marginal models will not, in general, be the same.

The results in Figures 1 and and33 suggest that use of the proposed methods can provide more powerful tests than a sample-splitting approach. Because sample-splitting does not require the assumptions made by the restricted permutation methods, it would be preferred when the sample size is sufficiently large. However, in many settings it may not be feasible to split a sample, in which case the proposed methods offer an appealing alternative. Further investigation of the relative efficiency of these approaches would be worthwhile. For small or moderate sample sizes, a related advantage of the proposed methods over the sample-splitting is that, due to the larger sample size on which they are based, it is more likely that important covariates will be identified.

Several considerations arise in the implementation of the methods. Firstly, rather than enumerating all n! permutations to identify the restricted subset ΠR, it is sufficient to sample from the n! permutations. In the results presented in sections 3 and 4, we sampled enough permutations to yield 1000 that met the restriction criterion, and found that this gave very similar results as when 2000 were used. Secondly, when the variable selector is sequential (cf: DiRienzo et al. 2003) or in certain special cases, the variable selector does not need to be fully evaluated on each reverse-transformed permuted dataset to determine whether it is in the restricted set. For example, in section 3.2 when X3 is selected and the Permutation Test A is applied, V1, V2, V4, ···, V10 do not change when evaluating the variable selector for the n! permuted datasets, and thus need not be re-computed. Finally, when constructing a confidence region for a parameter, our experience has been that these regions are intervals and thus computation time can be greatly shortened by searching for the interval endpoints rather than evaluating all possible parameters.

When a specific covariate is known to be of interest a priori, the variable selector would be chosen to always include this covariate. For example, in a randomized clinical trial, the covariate denoting the treatment effect would always be included by the variable selector, and other candidate covariates might be selected in the hope of leading to a more efficient inference about the treatment effect. For linear models, the randomization of treatments ensures that the regression coefficient for the treatment effect is the same regardless of which additional covariates are selected. The proposed procedure would produce valid inference for the treatment effect conditional on each of the candidate models, and therefore would also produce valid inference unconditionally. If one performs variable selection and then wishes to make an inference about a covariate that was not selected, the proposed methods also can be used. For example, if the covariate X2 is selected and one wishes to make an inference about the marginal association between Y and the unselected covariate X1, the proposed methods still apply, except that now the restriction set will consist of all the permuted data matrices (properly transformed) which lead to the same outcome ({X2}) as the original dataset.

Throughout we have assumed that the explanatory variables are random. In some settings, such as when Condition B can be utilized, the proposed methods can be applied directly for fixed X. In other settings, such as when one wishes to make inference about an individual coefficient in the regression models, we believe that the proposed approach can be modified by adapting permutation methods for fixed X developed for settings where there is no variable selection (cf: Huh & Jhun 2001). However, careful investigation and evaluation of the properties of such methods for fixed X are needed.

Although we consider linear regression models, the proposed methods can in principle be applied with any model for which an appropriate transformation and partition can be identified. Extensions to making inferences about discrete covariates in linear regression and about parameters in a Cox model for survival data are under development. It would also be useful to undertake more assessments of the robustness of the proposed methods, to develop criteria for selecting a test statistic, and to develop computationally efficient algorithms, especially when computationally-intensive variable selector is employed (cf: DiRienzo et al. 2003).

## Acknowledgments

This research was supported by grants from the US National Institute of Allergy and Infectious Diseases. We thank Professor Paul Gustafson, the Associate Editor and a referee for their comments which have led to an improved version of the paper.

## APPENDIX

#### Proof of Theorem 1

Under H0, for any (l) and (m) in ΠR,

$P(D∼=d∼(l)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS)=P(g−1(D∼)=g−1(d∼Pl,d∼F)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS)=P(D=g−1(d∼Pl,d∼F)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS)=P(D=g−1(d∼Pl,d∼F),R(D)=xS∣D∼P=pd∼P,D∼F=d∼F)P(R(D)=xS∣D∼P=pd∼P,D∼F=d∼F)=P(D=g−1(d∼Pl,d∼F),R(g−1(d∼Pl,d∼F))=xS∣D∼P=pd∼P,D∼F=d∼F)P(R(D)=xS∣D∼P=pd∼P,D∼F=d∼F)=P(D=g−1(d∼Pm,d∼F),R(g−1(d∼Pm,d∼F))=xS∣D∼P=pd∼P,D∼F=d∼F)P(R(D)=xS∣D∼P=pd∼P,D∼F=d∼F),$

because of the row independence of and independence of P and F. Thus,

$P(D∼=d∼(l)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS)=P(D=g−1(d∼Pm,d∼F),R(D)=xS∣D∼P=pd∼P,D∼F=d∼F)P(R(D)=xS∣D∼P=pd∼P,D∼F=d∼F)=P(g−1(D∼)=g−1(d∼Pm,d∼F)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS)=P(D∼=d∼(m)∣D∼P=pd∼P,D∼F=d∼F,R(D)=xS).$

Since (l) and (m) are arbitrary, it follows that this common probability equals 1/M, where M is the number of matrices in ΠR.

#### Proof of Proposition 1

Rewrite (2.1) as $Y=βS∗XS+β\S∗X\S+ε∗=βSXS+ε$. Under (C1), $βS=βS∗$ (Cochran 1938) and $ε=β\S∗X\S+ε∗$. (C1) and the independence of ε* and X = (XS, X\S) imply that XS (X\S, ε*), and hence XS is independent of ε. If (C2) holds, then $β\S∗=0$ and ε = ε* X. If (C3) holds, then $βS=βS∗+β\S∗γS$ and $ε=β\S∗ε∼+ε∗$, where = X\SγSXS. Since ε* (XS, ) and XS , we have XS (, ε*) and therefore XS ε.

In Propositions 2 – 7* below, it is easily verified that g is one-to-one, and that for any d in the support of D, g−1((l)) is in the support of D. Thus it is sufficient to show that the variables consisting of P and F are independent under H0.

#### Proof of Proposition 2

Suppose (C1) holds. Under H0, $Y∼=Y−βSXS=β\S∗X\S+ε∗$ (see the Proof for Lemma 1), and therefore XS|X\S. This, combined with XS X\S, yields XS (X\S, ).

#### Proof of Proposition 3

Under H0, if (C2) holds, then = ε* X.

#### Proof of Proposition 4

Suppose (C3) holds. Under H0, $Y∼=β\S∗Z\S+ε∗$ (see the Proof for Proposition 1), and thus XS|Z\S. This, in combination with XS Z\S, implies that XS (Z\S, ).

The following propositions, denoted with asterisks, generalize Propositions 5–7 for testing an individual covariate to any proper subset of the selected covariates. We use XH, XS\H, or X\H to denote the vector formed by the random variables in a subset = {Xl1, ···, Xlh} , \, or \. Consider $H0:βH=βH0$, for some $βH0$.

##### Proposition 5*

Suppose that (C2) holds under H0 and XH X\H. Let g(X, Y) = (X, Ỹ), where $Y∼=Y−βH0XH$, and let P = XH, and F = (X\H, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

#### Proof

Write (2.1) as $Y=βH∗XH+βS\H∗XS\H+β\S∗X\S+ε∗$. Since XH X\H, $βH∗=βH$ (Cochran 1938), and when (C2) holds, $β\S∗=0$. Thus, under H0, $Y∼=βS\H∗XS\H+ε∗$, and therefore XH|X\H = (XS\H, X\S). This combined with XH X\H implies that XH (X\H, ).

##### Proposition 6*

Suppose that (C2) holds under H0 and Xlj, j = 1, 2 ···, h are continuous covariates. Suppose that XH = Γ\HX\H + e, where e X\H. Define ZH = XH − Γ\HX\H. Let g(X, Y) = (ZH, X\H, Ỹ), where $Y∼=Y−βH0ZH$, and P = ZH, F = (X\H, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1.

#### Proof

Under (C2), we can write $Y=βH∗XH+βS\H∗XS\H+ε∗=βH∗ZH+βH∗Γ\HX\H+βS\H∗XS\H+ε∗=βH∗ZH+αX\H+ε∗$, where the elements of α combine coefficient terms from X\H and XS\H. Under H0, = αX\H + ε*, and hence ZH|X\H. Also note that ZH X\H. It follows that ZH (X\H, ).

##### Proposition 7*

Suppose that (C3) holds, Xl1, ···, Xlh are continuous, and XH = δS\HXS\H + εH, where εH XS\H. Define ZH = XH − δS\HXS\H and Z\S = X\S − γHZH. Let g(X, Y) = (ZH, XS\H, Z\S, Ỹ), where $Y∼=Y−βH0ZH$, and define P = ZH and F = (XS\H, Z\S, ). Then g(·) and the partition (P, F) satisfy the conditions of Theorem 1. In the special case where X is normally distributed, $δS\H=∑H,S\H∑S\H−1$, where ΣH,S\H = cov(XH, XS\H) and ΣS\H = var(XS\H).

#### Proof

Suppose that (C3) holds. (XH, XS\H) implies that (ZH, XS\H). This combined with ZH XS\H (by construction) implies that ZH (XS\H, ). Note that Z\S = X\SγHZH = (γHδS\H +γS\H)XS\H +, therefore ZH (XS\H, Z\S). Under H0, $Y∼=(βH∗δS\H+βS\H∗)XS\H+β\S∗Z\S+ε∗$, therefore ZH|(XS\H, Z\S). Thus ZH (XS\H, Z\S, ).

## Contributor Information

Rui WANG, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.

Stephen W. LAGAKOS, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.

## References

• Akaike H. Information Theory and the Maximum Likelihood Principle. In: Petrov V, Csáki F, editors. International Symposium on Information Theory. Budapest: Akademiai Kiádo; 1973. pp. 267–281.
• Breiman L. The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-fixed Prediction Error. Journal of the American Statistical Association. 1992;419:739–754.
• Chatfield C. Model Uncertainty, Data Mining and Statistical Inference (with discussion) Journal of the Royal Statistical Society Series A. 1995;158:419–466.
• Cochran WG. The Omission or Addition of an Independent Variable in Multiple Linear Regression. Supplement to the Journal of the Royal Statistical Society. 1938;5:171–176.
• Cox DR. On a Generalization of a Result of W. G. Cochran. Biometrika. 2007;94:755–759.
• Danilov DL, Magnus JR. On the Harm That Ignoring Pre-testing Can Cause. Journal of Econometrics. 2004;122:27–46.
• Dijkstra YK, Veldkamp JH. Data-Driven Selection of Regressors and the Bootstrap. In: Dijkstra TK, editor. On Model Uncertainty and Its Statistical Implications. Berlin: Springer-Verlag; 1988.
• DiRienzo G, DeGruttola V, Larder B, Hertogs K. Non-Parametric Methods to Predict HIV Drug Susceptibility Phenotype from Genotype. Statistics in Medicine. 2003;22:2785–2798. [PubMed]
• Efron B. How Biased is the Apparent Error Rate of a Prediction Rule? Journal of the American Statistical Association. 1986;81:461–470.
• Faraway JJ. On the Cost of Data Analysis. Journal of Computational and Graphical Statistics. 1992;1:213–229.
• Freedman DA, Navidi W, Peters SC. On the Impact of Variable Selection in Fitting Regression Equations. In: Dijkstra TK, editor. On Model Uncertainty and Its Statistical Implications. Berlin: Springer-Verlag; 1988.
• Giri K, Kabaila P. The Coverage Probability of Confidence Intervals in 2r Factorial Experiments after Preliminary Hypothesis Testing. Australian & New Zealand Journal of Statistics. 2008;50:69–79.
• Gong S. Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression. Journal of the American Statistical Association. 1986;81:108–118.
• Huh M, Jhun M. Random Permutation Testing inMultiple Linear Regression. Communications in Statistics – Theory and Methods. 2001;30:2023–2032.
• Hurvich CM, Tsai C. The Impact of Model Selection on Inference in Linear Regression. The American Statistician. 1990;44:214–217.
• Kabaila P. The Effect of Model Selection on Confidence Regions and Prediction Regions. Econometric Theory. 1995;11:537–549.
• Kabaila P. Valid Confidence Intervals in Regression After Variable Selection. Econometric Theory. 1998;14:463–482.
• Kabaila P, Leeb H. On the Large-Sample Minimal Coverage Probability of Confidence Intervals After Model Selection. Journal of the American Statistical Association. 2006;101:619–629.
• Leeb H. The Distribution of a Linear Predictor After Model Selection: Conditional Finite-sample Distributions and Asymptotic Approximations. Journal of Statistical Planning and Inference. 2005;134:64–89.
• Leeb H. Conditional Predictive Inference Post Model Selection. The Annals of Statistics. 2009 forthcoming.
• Leeb H, Pötscher BM. The Finite-sample Distribution of Post-model-selection Estimators, and Uniform Versus Non-uniform Approximations. Econometric Theory. 2003;19:100–142.
• Leeb H, Pötscher BM. Model Selection and Inference: Facts and Fictions. Econometric Theory. 2005;21:21–59.
• Lossos IS, Czerwinski DK, Alizadeh AA, Wechser MA, Tibshirani R, Botstein D, Levy R. Prediction of Survival in Diffuse Large-B-Cell Lymphoma Based on the Expression of Six Genes. New England Journal of Medicine. 2004;350:1828–1837. [PubMed]
• Malhotra U, Bosch RJ, Chan E, Wang R, Fischl MA, Collier AC, McElrath MJ. Association of T Cell Proliferative Responses and Phenotype with Virus Control in Chronic Progressive HIV-1 Disease. The Journal of Infectious Diseases. 2004;189:515–519. [PubMed]
• Miller AJ. Selection of Subsets of Regression Variables (with discussion) Journal of the Royal Statistical Society Series A. 1984;147:398–425.
• Pötscher BM. Model Selection Under Nonstationarity: Autoregressive models and Stochastic Linear Regression Models. Annals of Statistics. 1989;17:1257–1274.
• Pötscher BM. Effects of Model Selection on Inference. Econometric Theory. 1991;7:163–185.
• Pötscher BM. Comment on The Effect of Model Selection on Confidence Regions and Prediction Regions. Econometric Theory. 1995;11:550–559.
• Pötscher BM, Novák AJ. The Distribution of Estimators After Model Selection: Large and Small Sample Results. Journal of Statistical Computation and Simulation. 1998;60:19–56.
• Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;6:461–464.
• Shen X, Huang H, Ye J. Inference After Model Selection. Journal of the American Statistical Association. 2004;467:751–762.
• Shulman NS, Bosch RJ, Mellors JW, Albrecht MA, Katzenstein DA. Genetic Correlates of Efavirenz Hypersusceptibility. AIDS. 2004;18:1781–1785. [PubMed]
• Shibata R. Selection of the Order of an Autoregressive Model by Akaike’s Information Criterion. Biometrika. 1976;63:117–126.
• Veall M. Bootstrapping the Process of Model Selection: An Econometric Example. Journal of Applied Econometrics. 1992;7:93–99.
• Zhang P. Inference After Variable Selection in Linear Regression Models. Biometrika. 1992;79:741–746.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.