PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Biometrics. Author manuscript; available in PMC 2010 December 1.
Published in final edited form as:
PMCID: PMC2794904
NIHMSID: NIHMS88182

Flexible Regression Model Selection for Survival Probabilities: With Application to AIDS

SUMMARY

Clinicians are often interested in the effect of covariates on survival probabilities at pre-specified study times. Since different factors can be associated with the risk of short-term and long-term failure, a flexible modeling strategy is pursued. Given a set of multiple candidate working models, an objective methodology is proposed that aims to construct consistent and asymptotically normal estimators of regression coefficients and average prediction error for each working model, that are free from the nuisance censoring variable. It requires the conditional distribution of censoring given covariates to be modeled. The model selection strategy uses stepup or stepdown multiple hypothesis testing procedures that control either the proportion of false positives or generalized familywise error rate when comparing models based on estimates of average prediction error. The context can actually be cast as a missing data problem, where augmented inverse probability weighted complete case (AIPWCC) estimators of regression coefficients and prediction error can be used (Tsiatis, 2006). A simulation study and an interesting analysis of a recent AIDS trial are provided.

Keywords: Average prediction error, Censored data, Doubly-robust estimator, Restricted moment model, Model misspecification, Nonparametric bootstrap, Simultaneous inference

1. Introduction

Researchers are often interested in evaluating the effects of a set of covariates on a survival time endpoint. In some settings, it is of importance to identify potential treatment by covariate interactions. For example, AIDS Clinical Trials Group (ACTG) study 398 (Hammer, Vaida, et al. 2002) was a randomized study of four salvage therapy regimens for HIV-1 infected subjects with loss of virologic suppression. In addition to a backbone regimen (Amprenavir, Abacavir, Efavirenz and Adefovir Dipivoxil), subjects received one of three protease inhibitors or placebo. Some other baseline covariates of interest were non-nucleoside reverse transcriptase inhibitor (NNRTI) experience (yes or no), log10 HIV-1 RNA copies/mL, and reverse-transcriptase codon 103 (mutant or not). The primary endpoint was defined as the time from randomization to going off-track for suppression of HIV-1 RNA copies/mL (Huang, DeGruttola, et al. 2001). Going off-track is an empirical definition of virologic failure that has been shown to generally agree with protocol-based definitions. Basically, at each follow-up visit, a subject is classified as either on-track for suppression of HIV-1 RNA, or not (off-track). Going off-track generally refers to either a sharp, unexpected rebound in HIV-1 RNA, or failure to achieve a substantial initial drop in HIV-1 RNA. The censoring time in this study was the time from randomization to the first missed follow-up visit, or study closure when no follow-up visits were missed, which was observed in full for each subject. Since treatment efficacy may depend on subgroups of subjects as defined by baseline covariates, it is of interest to investigate potential treatment by covariate interactions on the risk of going off-track. Furthermore, it would be informative to know if factors associated with the risk of early failure (at or before week 8) differ from those associated with the risk of late failure (at or before week 24).

A common way to model survival data is thru the semiparametric approaches suggested by Cox (1972), Cheng, Wei, et al. (1995), and Kalbfleisch and Prentice (2002, Section 7). These methods are useful for a global assessment of the effect of covariates on survival. In fact, the Cox (1972) model can allow for dynamic covariate effects by including time dependent covariates, as nicely illustrated in Kalbfleisch and Prentice (2002, Section 6.4). To compare working models, a desirable criterion is average prediction error (Tian, Cai, et al., 2007 for uncensored outcomes; Uno, Cai, et al., 2007 for censored outcomes). However, with censored survival data, it is in general difficult to obtain estimates of average prediction error because the support of the censoring variable is typically much shorter than that of the survival time (Sinisi, Neugebauer, et al. 2006). Thus a localized approach to model selection may be preferred.

Uno, Cai, et al. (2007) considered a localized approach that focused on survival prediction at one or more pre-specified failure times. Their methodology permits flexible modeling of the effect of covariates on survival by allowing regression coefficients and link functions to arbitrarily vary with time in binary regression models. The inference procedures provided by Uno, Cai, et al. (2007) provide consistent estimatioin of and large-sample confidence intervals for regression coefficients and average absolute prediction error, that are free of the nuisance censoring variable. Although their results do not require that working survival models are properly specified, Uno, Cai, et al. (2007) do work under the assumption that the distribution of the censoring time is independent of both survival time and covariates.

Overall, the purpose of this paper is to develop an objective, statistically formal yet flexible localized survival model selection methodology under the assumption of conditional independence of censoring and survival times given covariates. Furthermore, the proposed methods provide simultaneous inference over all time points and model comparisons of interest, based on estimates of average prediction error, by using multiple testing procedures that control either the generalized familywise error rate or proportion of false positives.

More specifically, the proposed methodology acts locally by specifying a set of multiple, varying-sized candidate working models for survival prediction at each of several pre-specified failure times. The conditional distribution of censoring time given covariates needs to be modeled so that consistent and asymptotically normal estimators of regression parameters and average prediction error, that are free of the nuisance censoring variable, can be obtained. This paper assumes that the censoring variable is observed in full for each subject, which allows a flexible semiparametric modeling approach to be suggested. The Discussion addresses how to handle the case when the censoring variable is not observed in full for each subject. Consistent estimates of regression coefficients and average prediction error are obtained when the working model for censoring is properly specified. When the working model for censoring is misspecified, but the working model for survival given covariates is properly specified, estimated regression coefficients are consistent, but estimated average prediction error is in general not. However, simulations demonstrate that fact may not preclude consistent selection (and estimation) of the most parsimonious candidate survival model.

At each pre-specified study time, t, estimation of regression coefficients and average prediction error of working survival models use so-called inverse probability weighted complete case estimators (IPWCC) (Uno, Cai, et al., 2007). The complete cases here are those subjects for which the value of I(Tt) is able to be assessed, where T is the survival time. For those subjects censored at or before time t, the variable I(Tt) is missing, otherwise it is observed. As illustrated in this paper, this framework allows use of augmented inverse probability weighted complete case estimators (AIPWCC) (Tsiatis, 2006), which attempt to gain efficiency by using information from subjects with missing I(Tt), thereby making use of all available data.

The paper is organized as follows. Section 2 describes the background and notation. Estimation of regression parameters is detailed in Section 3 and of average prediction error in Section 4. Section 5 outlines the multiple testing procedures used in the context of the proposed model selection strategy that is detailed in Section 6. A simulation study is presented in Section 7 and an analysis of ACTG 398 in Section 8. Some final remarks are given in the Discussion.

2 Background and Notation

Let T denote the survival time and let C denote the potential censoring time. Denote the unique observed failure times by t1*,,tR**, and suppose the time points of interest are t1, … , tR. For simplicity in the exposition below, consider only one time point t [set membership] (t1, … , tR), so-called t—year survival prediction. It should be noted that, in addition to R > 1 time points of interest, it is straight-forward to simultaneously accommodate several types of survival times. For example, in ACTG 398, one may be interested in the time to off-track and the time to treatment toxicity.

Let X = min(T,C), with δ = I(C > T), I(.) the indicator function, and let Z denote the column vector of baseline covariates. The methodology presented below requires that the conditional survival function for censoring given covariates be estimated. It is frequently the case in studies of HIV/AIDS that C is observed in full for each subject. This is assumed to be the case in the main body of the paper; the Discussion addresses the case when C is potentially censored. Suppose that the observed data consists of n independent realizations of (X, δ, C, Z) from the unknown probability distribution P(T, C, Z), denoted (Xi, δi, Ci, Zi), i = 1, … , n.

Consider L pre-specified functions Z[ell] of Z, [ell] = 1, … , L. Each Z[ell] can be a subset of Z and can also include simple transformations of the covariates such as logarithms, squares and interaction terms; Z[ell] usually will include the constant 1 to correspond to an intercept term. For each Z[ell], denote the corresponding working model for P(Tt|Z[ell]) by g(βZ˜), where a[top top] is the transpose of matrix a, and for fixed [ell] and t, g[ell](x) is a known, strictly increasing, differentiable function of x [set membership] (−∞, ∞) and β[ell] is an unknown column vector of parameters that is defined below, [ell] = 1, … , L. For example, setting g[ell](.) = 1 − exp{−exp(.)} corresponds to a proportional hazards model; setting g[ell](.) = exp(.)/{l + exp(.)} corresponds to a proportional odds model. Note that, for ease of exposition, t is suppressed in the notation g[ell](.) and β[ell], however, these quantities are allowed to arbitrarily depend on time (and also survival type), as illustrated in Section 8.

To evaluate the fit of each of the L candidate survival models g(βZ˜), consider average squared prediction error d=E{I(Tt)g(βZ˜)}2, [ell] = 1, … , L, where expected value E(.) is with respect to P(T,Z). Average squared prediction error is used here primarily for mathematical convenience when constructing AIPWCC estimators, as shown below. To evaluate the adequacy of the [ell]th working survival model, it is required that d[ell] and β[ell], which do not depend on the nuisance censoring variable C, be consistently estimated.

This paper shows how to obtain consistent estimators of β[ell] and d[ell] that are asymptotically normal (and free of the nuisance variable C) when T and C are conditionally independent given Z, written T [perpendicular] C|Z, and the working model for P(C>x|Z),x𝒯t*, is properly specified, where 𝒯t*={t1*,,min(t,tR**)}, regardless of whether g(βZ˜) is properly specified.

The construction of AIPWCC estimators for β[ell] and d[ell] includes an augment term, which requires an estimate for P(Tt|Z), say P*(Tt|Z) based on the working model P*(Tt|Z), whereas IPWCC estimators do not. The IPWCC estimator of β[ell] will be so-called doubly robust for those specific working survival models that are properly specified, that is, satisfy P(Tt|Z)=g(βZ˜); AIPWCC estimators of β[ell] in addition require P(Tt|Z) = P*(Tt|Z). Note that properly specified working models may be inefficient in the sense that they include regression terms whose coefficients are equal to 0. Doubly-robust here means that the estimator for β[ell] corresponding to these specific working models will be consistent and asymptotically normal (and free of the nuisance variable C) when T [perpendicular] C|Z, even if the model for P(C>x|Z),x𝒯t*, is misspecified. These results are shown in Web Appendix A.

When T [perpendicular] C|Z and the model for P(C>x|Z),x𝒯t*, is misspecified, it is not in general the case that corresponding estimators for d[ell] are consistent and asymptotically normal when working survival models are properly specified. When estimating β[ell] under these conditions, the corresponding estimating equation can be shown to be asymptotically equal to 0 at β[ell], its value in the absence of censoring. When estimating d[ell] under these conditions, its asymptotic value equals some positive number, which is in general not equal to its value in the absence of censoring, d[ell]. However, the value of d[ell] is not of intrinsic interest itself; rather the working survival models corresponding to the smallest d[ell]. Thus, if the ranking of the asymptotic values of average prediction error under these conditions is the same as those corresponding to d[ell] (at least for the best fitting working models) then the most parsimonious working survival model can still be selected, and, if properly specified, have consistent estimators of β[ell]. This appears to be the case in the simulations reported.

3 Estimation of β[ell]

Using the results from Tsiatis (2006, section 10.2), define the AIPWCC estimating equation

U^(β)=1ni=1nΔiP^*(C>Xit|Zi)Z˜i,{I(Xit)g(βZ˜i,)}{ΔiP^*(C>Xit|Zi)1}Z˜i,{P^*(Tt|Zi)g(βZ˜i,)},

where Δi = δiI(Xit) + I(Xi > t), Zi, [ell] is the value of Z[ell] for the ith subject, P*(C > x|Zi) is an estimate for P(C > x|Zi) based on the working model P*(C > x|Z), P*(T < t|Zi) is an estimate for P(T < t|Zi) based on the working model P*(Tt|Z) and fit using only those observations with Δ = 1. Note that t is suppressed in the notation for Δ. The term in the second line of Û[ell](β) is the augment term that attempts to gain efficiency by using information from subjects with Δ = 0; setting this term to 0 for all subjects corresponds to an IPWCC estimating equation. Note that it is only necessary to evaluate P*(C > x|Zi) at the time points 𝒯t*. Let [beta][ell] satisfy Û[ell]([beta][ell]) = 0. The estimating equation in Uno, Cai, et al. (2007) differs from that above in that no augment term is used and the censoring distribution is estimated marginally, not conditionally on Z.

Note that when the covariate vector Z is categorical, a saturated model for P(C > x|Z) can be constructed for each x𝒯t*, which is nonparametric and thus properly specified. When a saturated model for P(C > x|Z) is not used for each x𝒯t*, for instance when Z consists of one or more continuous components, semiparametric approaches can be used. For example, a flexible way to estimate P(C > x|Z) can be to specify a separate working model for each x𝒯t*,P*(C>x|Z)=hx(γxZ˜C), where hx(.) is a known, strictly increasing, differentiable function, γx is an unknown column vector of parameters, defined below, and Zc is a function of Z. Estimates of γx, denoted by γ ^x, are obtained by solving

0=1ni=1nZ˜i,C{I(Ci>x)hx(γ^xZ˜i,C)},

where Zi,C is the value of ZC for the ith subject, i = 1,… , n. The estimator [gamma with circumflex]x converges in probability to the limit γx,x𝒯t*.

The working model P*(Tt|Z) is estimated using only those observations with Δ = 1, in which case I(Xt) = I(Tt). As a result, one may consider P*(Tt|Z)=hT(γTZ˜T), where hT(.) is defined as hx(.), ZT is some function of Z and γT is estimated by [gamma with circumflex]T, with

0=1ni=1nΔiZ˜i,T{I(Xit)hT(γ^TZ˜i,T)}

and Zi,T is the value of ZT for the ith subject, i = 1,… ,n. The estimator [gamma with circumflex]T converges in probability to the limit γT.

Now, to show consistency of the estimator β ^[ell], as given by Uno, Cai, et al. (2007), it suffices to show that E{Û[ell](β)} = E[Z[ell]{I(Tt) - g[ell]τZ[ell])}] = µ[ell](β). Then, under the mild condition that there does not exist a β such that P[top top]Zi, [ell] > β [top top]Zj, [ell] | TitTj) = 1, i < j, and using arguments similar to those in Web Appendix A of Tian, Cai, et al. (2007), it can be shown that u[ell](β) = 0 has a unique solution, namely at β = β[ell]. Furthermore, if there does not exist β such that δiI(XitXj) = 1 implies I[top top] Zi,[ell] ≤ (β[top top]Zj, [ell]) = 1, ij, Û[ell](β) has a unique solution, [beta][ell], for any finite n. Web Appendix A shows the consistency of [beta][ell] when T [perpendicular] C|Z and either P*(C > x|Z) is properly specified, x𝒯t*,org(βZ˜) (and P*(Tt|Z) when AIPWCC estimators are used) is properly specified.

4 Estimation of d[ell]

Define the empirical average squared prediction error

D^=1ni=1nΔiP^*(C>Xit|Zi){I(Xit)g(β^Z˜i,)}2{ΔiP^*(C>Xit|Zi)1}[P^*(Tt|Zi){12g(β^Z˜i,)}+g2(β^Z˜i,)]=1ni=1nAi,.

The statistic D[ell] is an estimate of the so-called apparent error. The augment term that is subtracted in the second line of D[ell] is included in an attempt to gain efficiency by using information from those subjects with missing I(Tt) (Tsiatis, 2006, section 10.2). It is shown in Web Appendix A that D[ell] converges uniformly in probability to d[ell] when P*(C > x|Z) is properly specified, x𝒯t*.

5 Simultaneous Inference on (d[ell])

Suppose it is of interest to simultaneously test the sequence of null hypotheses H[ell] : d[ell] = d0 versus the alternative d[ell] < d0, where d0 corresponds to some reference model. The estimator D[ell] has an asymptotic normal distribution and, if P*(C > x|Z) is properly specified, x𝒯t*, is a consistent estimator for d[ell]. Thus a test statistic on H[ell] can be defined as Ŝ[ell] = (D[ell]D0)/[sigma with hat][ell], where [sigma with hat][ell] = {n−1vâr(Ai, [ell]Ai,0)}1/2 and vâr(.) denotes sample variance.

When the working model for censoring is properly specified, the statistic Ŝ[ell] is asymptotically standard normal under H[ell] for any P(T,C,Z) so that [pi][ell] = Φ(Ŝ[ell]) asymptotically satisfies P([pi][ell]u) ≤ u, for u [set membership] (0,1), Φ (.) the standard normal cumulative distribution function. As a result, stepup or stepdown multiple hypothesis testing procedures (MTPs) can be used to provide asymptotic strong control of a given error rate, that is, for any P(T, C, Z); this property is useful in this context since the ([pi] [ell]) are likely to be correlated. The error rates considered are the generalized familywise error rate, gFWER(k), and proportion of false positives, PFP(θ). Romano and Shaikh (2006) provide stepup procedures that asymptotically bound above at target level α (i) the probability of more than k rejected true null hypotheses (gFWER(k)), and (ii) the probability that the proportion of true null hypotheses among the rejected hypotheses (PFP) is greater than θ, i.e., P(PFP > θ) < α (PFP(θ)). Since the PFP(θ) retains the same interpretation regardless of the number of null hypotheses tested, it will be solely used for the remainder of the paper. Note that gFWER(k = 0) and PFP(θ = 0) correspond to the familywise error rate (FWER). The false discovery rate, defined as E(PFP) < α, is not used in this paper because it does not provide any control over the variability of the PFP.

The stepup algorithm given by Romano and Shaikh (2006) to control the PFP(θ) at level α is as follows. Suppose there are B null hypotheses to be tested. Put the observed p-values in descending order to obtain [pi](B) ≥ … ≥ [pi](1) and let H(B), … , H(1) denote the corresponding null hypotheses. If π(B)ααB/c(B,θ) then reject all B null hypotheses. Otherwise, reject null hypotheses H(b*), … , H(1), where b* is the smallest index such that

π^(B)>αc(B,θ)αB,,π^(b*+1)>αc(B,θ)αb*+1,

and αb={int(θb)+1}/{Bb+int(θb)+1}, with int(a) the integer part of a. If b* = 0 then don’t reject any hypotheses. The formula for the constant c(B, θ) is provided in Romano and Shaikh (2006). Lehmann and Romano (2005) and Romano and Shaikh (2004) provide stepdown algorithms that provide strong control of the gFWER(k) and PFP(θ).

6 Model Selection Strategy

The following is an approach that attempts to identify the best fitting working survival model with the smallest number of regression terms at each failure time of interest. The algorithm works in two general steps. The first is to identify the model comparisons of interest at each failure time under consideration; once these have been identified, an MTP from section 5 is then used to simultaneously test all corresponding null hypotheses across all failure times considered. To identify the model comparisons of interest, the algorithm first compares the fit of all working models to the null model, where only an intercept term is fit. If working models are identified that provide a significantly better fit to the data than the null model, then, the reference model is re-defined from the null model to the smallest working model among these with the best fit. Working models offering a better fit than this new reference are then compared to it. The algorithm iterates in this fashion, continuing to re-define the reference null model, until no working models are determined be significantly better than the reference. A step-by-step description is provided below, where for simplicity, t is consider fixed.

  • 0.
    Set i = 1.
  • 1.
    Define H(i):d=d0(i) against the one-sided alternative d<d0(i).Hered0(1)=E{I(Tt)g0(1)(β0(1))}2 corresponds to the null model with only an intercept. Test the sequence of null hypotheses (H(i)) using an MTP as outlined in Section 5, denoting test statistics by (S^(i)) and p-values by (π^(i)).
  • 2.
    Re-define the reference model to be the working model rejected from (H(i)) with the smallest number of regressors with the smallest π^(i); denote this model index by [ell]i and its population average squared prediction error by d0(i+1), so that d0(i+1)=di. If none of (H(i)) are rejected then stop.
  • 3.
    Define H(i+1):d=d0(i+1), against d<d0(i+1), where the index [ell] corresponds to those working models rejected from (H(i)) with π^(i)<π^i(i); if no π^(i)<π^i(i) then stop. Otherwise test this sequence using an MTP as outlined in Section 5, where S^(i+1) is now defined using the estimate of d0(i+1) as a reference, with p-value π^(i+1).
  • 4.
    Repeat 2. and 3. above, incrementing i = i + 1 after each iteration.

The algorithm above is repeated for each t [set membership] (t1,… ,tR); let N Hr denote the total number of null hypotheses tested for each tr, r = 1,…, R. The total number of model comparisons of interest identified by this algorithm is N H = N H1 + … + N HR. In order to maintain the global false-positive rate, it is necessary to test all of the N H null hypotheses simultaneously using an MTP as outlined in Section 5. For each t [set membership] (t1, … , tR), the final model choice will correspond to the working model with the smallest number of regressors, with the smallest p-value, in the largest stage i that was rejected.

7 Simulation Study

A simulation study was performed to investigate the small sample properties of the proposed model selection methodology. For simplicity, only comparisons to the null model were considered (stage i = 1). Two probability distributions were initially used. First, for P1(T,C,Z), the vector Z consisted of 7 covariates from the probability distribution P(Z) defined in Web Appendix B. The underlying survival time, say T*, was generated from the exponential distribution with hazard function λ(t|Z)=14exp(0.5Z7),Z7~N(0,1), the actual survival time was then T = round(T* +1) so that the range for T was the positive integers. The censoring time C was defined as the minimum of an administrative censoring time, say C1, that was uniformly distributed on (3,10), and a loss to followup time, say C2, that was exponential with hazard function λC2(t|Z)=18exp(0.25Z7), the actual censoring time was then C = round[min{C1, (C2 + 1)}] so the range for C was the positive integers. In all cases the outcome of interest was I(T ≤ 3). Using a single random sample with n = 200000 observations, P1(T, C, Z) results in about 50% of survival times being censored and about 74% with Δ = 1. In the absence of censoring, P1(T, Z) results in about 13%, 33% and 48% of subjects with failure at or before time 1, 2 and 3, respectively. Note this sample of n = 200000 was used only to approximate the parameters of P1(.).

The second probability distribution, P2(.), was defined as P1(.) except for Z. Here, Z had 12 covariates, the first 7 defined as Z1 thru Z7 from P1(.), and the last 5, Z8 thru Z12, defined as Z12 thru Z5 from P1(.). This distribution was used to investigate the impact of increasing the number of null hypotheses on the small sample properties of the methodology. For both P1(.) and P2(.) the value of β1,7 in the working model for P(T ≤ 3|Z7) defined by g((β0,7 + β1,7Z7) is 0.70, where g(.) is the anti-logit link function, g(x) = ex/(1 + ex).

In all cases, the working model for P(C > x|Z) was h(γ0,x+γ1,xZ), for x = 1,2,3, h(.) the anti-logit link function. Two settings were considered for when the working model for censoring was misspecified. First, a case when the functional form of a covariate was misspecified, the conditional hazard for C2 was λC2(t|Z)=18exp(0.25|Z7|). Second, a case when an important covariate was omitted, λC2(t|Z)=18exp(0.25Z7+0.5Z72). The working model for P(T ≤ 3|Z), used in the augment terms, was h(γ0,T+γ1,TZ).

The distributions P1(.) and P(.) have 7 and 12 covariates respectively, so that the total number of candidate working models for I(T ≤ 3), using only one-way terms, is 27 − 1 = 127 and 212 − 1 = 4095. All working models for survival used the anti-logit link function. The best fitting, most parsimonious model for I(T ≤ 3) is the one that has Z7 as the only predictor. For both P1(.) and P2(.), the population value of d[ell] is 0.225 for models containing Z7 and 0.250 for models not containing Z7.

The simulations were conducted as follows. Separately for each sample size and each distribution, a random sample (Zi)i=1n was generated from Pj(Z) a total of 1000 independent times. At each of the 1000 simulation iterations, Zi,1 was randomly generated from Pj(Z1), then Zi,2 was randomly generated from Pj(Z2|Zi,1), and so on. Then, at each simulation iteration, with (Zi)i=1n considered fixed, the random samples (Ti)i=1nand(Ci)i=1n were independently generated from each respective distribution. The simulations, and the analysis of ACTG 398, were executed using Matlab version 7.4 (R2007a) and with 64-bit Intel Xeon 3.2 GHz processors.

The empirical power of several different MTPs corresponding to H7 : d7 = d0, where [ell] = 7 indexes the model with Z7 as the only term, is shown in Table 1 for several sample sizes and for P1(.) and P2(.). The target level error was α = 0.1 in all cases. The MTP with the largest power was the stepup procedure using the PFP(θ = 0.1). Figure 1 plots these powers for the stepup algorithm. Notice that, when increasing the number of null hypotheses tested, there is only a small reduction in power of the MTP using the PFP(θ = 0.1), however, the reduction when using the PFP(θ = 0) (FWER) is large. For P1(.), the empirical FWER for all MTPs was 0.004 or less for n = 100, and was 0.002 or less for n = 200 thru 1000, with the vast majority equal to 0. For P2(.), the empirical FWER of all MTPs was less than 0.024 for n = 100, less than 0.009 for n = 200, and 0.005 or less for n = 300 thru 1000, with the vast majority equal to 0. Note that with 1000 simulation iterations and a nominal level error of α = 0.1, a normal-theory based 95% confidence interval for the population PFP(θ) extends ±0.0186 from the empirical PFP(θ).

Figure 1
Empirical power to reject H7 for stepup procedure with PFP(θ = 0) (‘o’) and PFP(θ = 0.1) (‘* ’) for P1(.) (dashed lines) and P2(.) (solid lines).
Table 1
Simulation results: empirical power of PFP(θ) at target level α = 0.1

Define the probability distributions P3(.) and P4(.) as P1(.) except with the first and second alternate definitions of C2 above, respectively. The empirical FWER and powers to reject H7 corresponding to P3(.) and P4(.) were all very similar to those presented for P1(.), suggesting minor impact from mismodeling P(C > x|Z).

Table 2 presents the empirical mean and standard deviation of estimates for the coefficient β1,7 and average squared prediction error d7 for the candidate survival model P(T ≤ 3|Z7) = g0,7 + β1,7Z7), for P1(.), P3(.) and P4(.) and several sample sizes. Results for P2(.) were very similar to those of P1(.). For all distributions, empirical means of the estimates [beta]1,7 and D7 are close to their population values of 0.70 and 0.225, respectively, with empirical standard deviations decreasing with increasing sample size. Define the working model index [ell] = 1 by P(T ≤ 3|Z1) = g0,1 + β1,1Z1); note that H1 : d1 = d0 is true. As shown in Table 2, the empirical mean and standard deviation of Ŝ1, corresponding to H1, differ somewhat from the target values 0 and 1, with variance overestimation present. The empirical mean of Ŝ7, corresponding to H7, increases with sample size, and, although variance overestimation is present, its empirical standard deviation is closer to 1 compared to Ŝ1. Since the statistic D[ell] is asymptotically linear, the nonparametric bootstrap can be used to estimate the variance of D[ell]D0 (Dudoit, van der Laan and Pollard, 2004). The numbers in parentheses under Ŝ7 correspond to using the nonparametric bootstrap for standard error estimation instead of [sigma with hat]7; here, at each simulation iteration, the estimates P*(C > x|Z) and P*(T ≤ 3|Z) were calculated once on the observed data and fixed thru the 100 nonparametric bootstrap iterations. Although there were slight improvements to the means and standard deviations from using the bootstrap, the empirical powers and sizes were very similar to tests that used [sigma with hat]7 for standard error estimation.

Table 2
Simulation results: empirical mean and standard deviation

8 Analysis of ACTG 398

Consider model selection for the risk of going off-track at or before week 8 (early failure) and at or before week 24 (late failure). The time scale used in this analysis was weeks, as subjects were scheduled for follow up visits at week 2, 4, 8, 16 and every 8 weeks thereafter until week 48. Note that actual visit week was used in the analysis and not scheduled visit week. The censoring time was the time from randomization to going off-study, or study closure, whichever occurred first, and thus was observed in full for each subject. Full baseline data was available for n = 450 subjects, of those, a total of 331 subjects were observed to go off-track; 163 at or before week 8 and 284 at or before week 24. There were observed off-track events at weeks 1 thru 24; the shortest censoring time was at week 1. For early and late failure, the percent of subjects with Δ = 1 was 98.4% and 96.4%, respectively.

The following four baseline covariates were considered: treatment group (TRT), coded as 1=additional PI, 0=placebo; NNRTI experience (NN), coded as 1=naive, 0=experienced; log10 HIV-l RNA (RNA); and reverse-transcriptase codon 103, coded as 1=wildtype, 0=mutant. Besides these four main effect terms, it was of interest to identify potential treatment by covariate interactions, denoted by TRT*NN, TRT*RNA and TRT* 103. By imposing the restriction that working models that include an interaction term also include the corresponding main effect terms, at a given study time of interest, t [set membership] (8,24), the total number of candidate survival models of one term, two terms, up thru seven terms, is 34.

The working model for the censoring distribution, P*(C > x|Z), included all seven regression terms, and was separately estimated at each observed off-track time less than or equal to 24 weeks. The working model P*(Tt|Z), for use in the augment terms, t = 8,24, also included all seven regression terms. The link function used in all cases was the anti-logit. The nonparametric bootstrap with 500 realizations was used for variance estimation in test statistics, for which the estimated working models P*(C > x|Z) and P*(Tt|Z) were calculated once on the observed dataset then fixed for use in each bootstrap iteration.

In all cases, the MTP used was the stepup procedure with the PFP(θ = 0.1) and α = 0.1. To identify the model comparisons of interest, first, for week 8, 34 hypotheses were tested with the null model as the reference model (stage i = 1). The results are displayed in Table 3. A total of 30 of 34 hypotheses were rejected. The smallest working model rejected with smallest p-value had RNA as the only term (π^1(1)=0.001). Using this model as the reference model, the second stage (i = 2) tested 21 hypotheses, corresponding to those models in Table 3 with π^(1)<π^1(1). Of these 21 hypotheses, the smallest model rejected with smallest p-value had 2 terms, NN and RNA (π^2(2)=0.0034). Using this model as the reference model, stage i = 3 tested 13 hypotheses, none were rejected. Thus the total number of model comparisons identified for week 8 was N H1 = 34 + 21 + 13 = 68.

Table 3
Week 8 results from ACTG 398

For week 24, the first stage (i = 1) tested 34 hypotheses using the null model as the reference model. The smallest model rejected with smallest p-value had NN as the only term (π^1(1)=0.001). Using this model as the reference model, stage i = 2 tested 21 hypotheses, corresponding to those models with π^(1)<π^1(1). Of these 21 hypotheses, the smallest model rejected with smallest p-value had 3 terms, NN, TRT and RNA (π^2(2)=0.0088). Using this model as the reference model, stage i = 3 tested 11 hypotheses, none were rejected. For week 24, the total number of model comparisons identified was N H2 = 34+ 21 + 11 = 66.

The final step was to simultaneously test all of the N H = 68+66 = 134 hypotheses corresponding to all model comparisons of interest. For week 8, the smallest model with smallest p-value (0.0034) from latest stage rejected (i = 2) had 2 terms, NN and RNA. For a fixed value of RNA, being naive to NNRTIs at baseline was estimated to reduce the odds of early failure by 70% compared to being experienced; for a fixed value of NN, a log10 increase in baseline viral load was estimated to increase the odds of early failure by 2.2 fold. For week 24, the smallest model with smallest p-value (0.0054) from latest stage rejected (i = 2) had 4 terms, NN, TRT, RNA and TRT*NN. Thus a treatment effect was identified on the risk of late failure but not early failure. For a fixed value of RNA: the additional PI was estimated to reduce the odds of late failure by 29% for NNRTI naive subjects, and 70% for NNRTI experienced subjects compared to placebo; compared to having NNRTI experience, being naive to NNRTIs was estimated to reduce the odds of late failure by 75% for subjects receiving an additional PI, and 89% for subjects receiving placebo. For a fixed value of NN and TRT, a log10 increase in baseline viral load was estimated to increase the odds of late failure by 1.9 fold.

In conclusion, NNRTI experience and HIV-1 RNA level at baseline were found to be significantly associated with the risk of early and late failure. There was a treatment effect detected only on the risk of late failure, where, after adjusting for baseline NNRTI experience and HIV-1 RNA level, the additional PI was more beneficial for NNRTI experienced subjects than for naive subjects.

9 Discussion

This paper provides a statistically formal yet flexible approach to objective model selection for censored survival data, which can be especially useful when covariate effects depend on time. Consistent estimators of regression parameters and average prediction error for candidate survival models, that are free of the nuisance censoring variable, are obtained when the conditional censoring distribution given covariates is correctly modeled. When the censoring variable is completely observed for each subject in the dataset, this article shows how flexible working models for the conditional censoring distribution can be formed. When the model for censoring is misspecified, consistent estimators of regression parameters require proper specification of the candidate survival model given covariates and P*(Tt|Z) when AIPWCC estimators are used; however estimators of average prediction error are not necessarily consistent in this case. Nevertheless, one may still be able to identify the most parsimonious working survival model and consistently estimate its regression parameters when the censoring distribution is mismodeled, as was evidenced in the simulation study.

The potential censoring time C may not be completely observed for each subject, for example when T is the time to death. In such cases, one can reverse the roles of T and C and use the semiparametric modeling approaches mentioned in the Introduction or those presented in this paper to form working models for P(C > x|Z). However, having to select two censored-data regression models simultaneously can be burdensome.

Apparent error was used instead of cross validated error (Shao, 1993; 1996) to estimate average squared prediction error because of the savings in computation. The proposed model selection strategy is designed to potentially accommodate a large number of candidate working survival models. The small-sample correction that cross validation provides to estimates of average prediction error may be negligible in the test statistics Ŝ[ell], which are based on the difference between two prediction error estimates. However, one may choose to implement cross-validated prediction error estimates, perhaps as a descriptive statistic for a selected survival model.

Although only the PFP(θ) was used in the simulations and analysis of ACTG 398, one could have use the gFWER(k). However, unlike the PFP(θ), interpretation of the gFWER(k) depends on the total number of null hypotheses tested.

When relatively sparse covariates are considered, for example, treatment by categorical covariate interactions, the derivative of the estimating equation corresponding parameters of the working models for censoring or survival may not be of full rank. In these settings, one may use the trust-region dogleg method (Moré, Garbow, et al., 1980; Powell, 1970) instead of the Newton-Raphson algorithm for root finding.

Acknowledgments

This work was supported by grant R01-AI058836 from the U.S. National Institutes of Health. The author is grateful to an Associate Editor and Referee whose comments improved the clarity and presentation of the paper.

Footnotes

Supplementary Materials

Web Appendices referenced in Sections 2, 3, 4 and 7 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org

References

  • Cheng SC, Wei LJ, Ying Z. Analysis of Transformation Models with Censored Data. Biometrika. 1995;82:835–845.
  • Cox DR. Regression Models and Life Tables (with Discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220.
  • Dudoit S, van der Laan M, Pollard K. Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates. Statistical Applications in Genetics and Molecular Biology, 3, Article 13. 2004 http://www.bepress.com/sagmb/vol3/iss1/art13. [PubMed]
  • Hammer S, Vaida F, Bennett K, et al. Dual vs Single Protease Inhibitor Therapy Following Antiretroviral Treatment Failure: A Randomized Trial. Journal of the American Medical Association. 2002;288(2):169–180. [PubMed]
  • Huang W, DeGruttola V, Fischl M, et al. Patterns of Plasma Human Immunodeficiency Virus Type-1 RNA Response to Antiretroviral Therapy. Journal of Infectious Diseases. 2001;183(10):1455–1465. [PubMed]
  • Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed. New York: John Wiley & Sons, Inc.; 2002.
  • Lehmann EL, Romano JP. Generalizations of the Familywise Error Rate. Annals of Statistics. 2005;33:1138–1154.
  • Moré J, Garbow B, Hillstrom K. User Guide for MINPACK 1. Argonne National Laboratory; 1980. Rept. ANL-80-74.
  • Powell MJD. A Fortran Subroutine for Solving Systems of Nonlinear Algebraic Equations. In: Rabinowitz P, editor. Numerical Methods for Nonlinear Algebraic Equations. 1970.
  • Romano J, Shaikh A. Technical Report No. 2004–31. Dept. of Statistics, Stanford Univ.; 2004. On Control of the False Discovery Proportion.
  • Romano J, Shaikh A. Stepup Procedures for Control of Generalizations of the Familywise Error Rate. Annals of Statistics. 2006;34:1850–1873.
  • Shao J. Linear Model Selection by Cross-Validation. Journal of the American Statistical Association. 1993;88:486–494.
  • Shao J. Bootstrap Model Selection. Journal of the American Statistical Association. 1996;91:655–665.
  • Sinisi SE, Neugebauer R, van der Laan MJ. Cross-validated Bagged Prediction of Survival. Statistical Applications in Genetics and Molecular Biology. 2006;5(1) Article 12. [PubMed]
  • Tian L, Cai T, Goetghebeur E, Wei LJ. Model Evaluation Based on the Distribution of Estimated Absolute Prediction Error. Biometrika. 2007;94:297–311.
  • Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006.
  • Uno H, Cai T, Tian L, Wei LJ. Evaluating Prediction Rules for t-Year Survivors with Censored Regression Models. Journal of the American Statistical Association. 2007;102:527–537.