Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC 2010 November 4.
Published in final edited form as:
J Am Stat Assoc. 2009 September 1; 104(487): 1192–1202.
doi:  10.1198/jasa.2009.tm08614
PMCID: PMC2972554

Analyzing Length-biased Data with Semiparametric Transformation and Accelerated Failure Time Models


Right-censored time-to-event data are often observed from a cohort of prevalent cases that are subject to length-biased sampling. Informative right censoring of data from the prevalent cohort within the population often makes it difficult to model risk factors on the unbiased failure times for the general population, because the observed failure times are length biased. In this paper, we consider two classes of flexible semiparametric models: the transformation models and the accelerated failure time models, to assess covariate effects on the population failure times by modeling the length-biased times. We develop unbiased estimating equation approaches to obtain the consistent estimators of the regression coefficients. Large sample properties for the estimators are derived. The methods are confirmed through simulations and illustrated by application to data from a study of a prevalent cohort of dementia patients.

Keywords: Accelerated failure time models, Estimating equation, Informative censoring, Length-biased sampling, Prevalent cohort, Right-censored data, Transformation models

1 Introduction

Length-biased data are often encountered in observational studies, when the observed samples are not randomly selected from the population of interest but with probability proportional to their length (Cox and Miller, 1965; Zelen and Feinleib, 1969; Simon, 1980; Vardi, 1982, 1985, 1989; Sansgiry and Akman, 2000; Zelen, 2004). The data set considered is often from a cross-sectional cohort of subjects diagnosed to have the disease at the time of examination, which is then followed for the occurrence of a subsequent event.

Two examples of such data are seen in a study of dementia and subsequent onset of death, and a study of the natural history of lung cancer. In the first example, about 10,000 Canadians over the age of 65 were recruited and screened for prevalence of dementia. Investigators recorded the approximate date of onset of dementia and the subsequent time of death or censoring for individuals within the study population who were found to have dementia (Asgharian and Wolfson, 2005). The second example involves a prevalent cohort for elderly lung cancer patients. Individuals are sampled from Surveillance Epidemiology and End Results (SEER) database. The sampling (or recruitment) times can be considered to follow a uniform distribution independent of failure times, only those individuals who had been diagnosed to have lung cancer before the recruitment time and who had been alive are eligible for inclusion. The data on each subject in the prevalent cohort included an initiating event (e.g., diagnosis of lung cancer) and failure event (e.g., death) for subjects who had been sampled at an intermediate time. A common feature for data from the prevalent cohort is that the time duration measured from the first (or initiating) event to the terminal event is subject to potential left truncation and right censoring. Length-biased sampling occurs because the “observed” time intervals from initiation to failure within the prevalent cohort tend to be longer than those arising from the underlying distribution of the general population, because individuals diagnosed with the disease have to survive to the examination or sampling time.

Extensive literature has focused on estimating the unbiased distribution given length-biased sampling, using methods conditional on the observed truncation times (Turnbull, 1976; Lagakos et al., 1988; Wang 1991) or an unconditional approach (Vardi, 1982, 1985; Gill et al., 1988). The latter approach often assumes that the initiation times follow a stationary Poisson process (known as the stationarity assumption), which is satisfied in many practical applications. Under the stationarity assumption, Asgharian et al. (2002) and Asgharian and Wolfson (2005) provided an unconditional nonparametric maximum likelihood estimation (NMLE) for the unbiased survival function in the presence of right censoring, and derived the asymptotic properties of the NMLE under that setting. Of note, a majority of the related literature has focused on one-sample estimation of the unbiased survival distribution given length-biased data. Only limited literature is available that describes modeling risk factors on the distribution of the underlying population. A seminal work by Wang (1996) described a semiparametric proportional hazards model to assess risk factors on length-biased data using a bias-adjusted risk set in constructing the pseudo-likelihood. However, a constraint not allowing data to be right censored had to be imposed due to the dependent censoring mechanism with the observed length-biased data. This subtle feature for informative censoring induced by the sampling scheme has been avoided by not allowing right censoring (Vardi, 1982, 1985; Wang, 1996) or has simply been ignored, as noted by Asgharian and Wolfson (2005).

In this article, we fill the gap by providing semiparametric models (transformation and accelerated failure time models) to examine risk factors on the population failure time via the observed length-biased data, in the presence of informative right censoring. We use the terminology of ”length-biased” data for left-truncated and right-censored data under the stationarity assumption. The transformation models and the accelerated failure time models have been extensively investigated for traditional right-censored survival data (e.g., Prentice 1978; Buckley and James 1979; Dabrowska and Doksum, 1988; Ritov 1990; Tsiatis 1990; Lai and Ying 1991; Cheng et al., 1995; Fine, 1999; Lin and Ying 1995; Kalbfleisch and Prentice 2002). However, the literature does not indicate that these models have been used for length-biased data. We estimate the covariate coefficients through constructed unbiased estimating equations, with an inverse weight chosen to adjust the bias induced by the length-biased data and dependent censoring. We present the estimation procedures and the large sample properties of the estimators in Section 2. Under some mild regularity conditions, we show that the resulting estimators are consistent and asymptotically normal. In Section 3, we summarize the simulation results, and we include an application of the proposed method to the dementia study in Section 4. We offer a discussion in Section 5.

2 Notations and Models

Assume T to be the time measured from the initiating event to failure within the population of interest, A to be the time of examination for the disease (measured from the initiating event), V to be the time measured from examination to failure, and C to be the time from examination to censoring. With length-biased sampling, one can only observe T among those T > A, where T = A + V is the observed failure time. Here, A is also referred to as the truncation variable (sampling time or backward recurrence time) and V is the residual survival time (or forward recurrence time). See the following diagram describing the observed dementia data.

An external file that holds a picture, illustration, etc.
Object name is nihms-248000-f0001.jpg

Let Yi = min(Ti, Ai + Ci), Ti = Ai + Vi, δi = I(ViCi), and Zi be the observed data for the i-th subject, where Zi is a vector of covariates for the i-th subject and i = 1, 2, ..., n. It is reasonable to assume that the censoring time measured from the recruitment time, C, and (A, V) are independent given covariate Z.

Recall that the observed failure time data (T1, · · · , Tn) are a biased subset for population sample T. The unbiased density function for T, fU can be related to the length-biased density, g (conditional on T > A) under the stationarity assumption,


Given covariates, Z = z, the density of T can be similarly expressed as


where g(t|z) and fU(t|z) denote the length-biased sampling density and the population (unbiased) density given μ(z) < ∞.

2.1 Transformation Models

Under the transformation models, it is assumed that the true underlying failure time T is linearly related to the covariates with various specified error distributions (Cheng, Wei and Ying, 1995). Specifically,


where H is an unknown increasing function, εT has a known density function and β is the vector of regression coefficients. When εT has an extreme value distribution, T follows the proportional hazards model, whereas when εT has a logistic distribution, T follows the proportional odds model. It is of interest to assess the covariate coefficients β on T.

We first consider a special case for the length-biased data without right censoring. For the unbiased variable T, we have the following conditional expectation given covariates,


where the data vectors for the ith and jth subjects are independent, Zij = ZiZj, and


Using an approach similar to that of Cheng et al. (1995), we construct an unbiased estimating function for β, given the observed length-biased data


where q is a positive, scalar weight function. The estimating equation is unbiased because


by using (1) and (2), when there is no right censoring for the length-biased data.

When the observed failure time T from length-biased sampling is subject to potential right censoring, there are two difficulties. First, the indicators {TiTj} are not always observable. Second, the censoring variable can be mechanistically dependent on the failure time, because Cov(T, A + C) = Cov(A + V, A + C) = Var(A) + Cov(V, A) > 0 except for trivial cases, (see, e.g. Asgharian, 2003).

The primary purpose of this section is to derive unbiased estimating functions for covariates β when the length-biased data are subject to dependent right censoring. Under the stationarity assumption, the joint distribution of (A, V) and (A, T) given Z follows,


as shown in earlier work (Zelen, 2004; Asgharian and Wolfson, 2005). We first assume that the censoring variable C is independent of the covariates. By extending the results of Asgharian and Wolfson (2005), we have the covariate-specific joint probability to observe a failure at y in the length-biased sampling, since the residual censoring time C is assumed to be independent of (A, V) given covariate Z = z,


where SC(t) is the survival function for the residual censoring variable, C. Replacing the bivariate density fA,V in the above formula by (3), we have


where w(t)=0tSC(v)dv. Given the above equations, the following conditional expectation for the observed pairwised length-biased failure times is zero, since


The last equation holds by (2). If the censoring distribution is known, the estimating equation can then be constructed as


When the censoring distribution is unknown, it is natural to replace the unknown censoring distribution by its consistent Kaplan-Meier estimator. Thus, an asymptotic unbiased estimating equation follows


where w^(t)=0tS^C(v)dv, and ŜC is the Kaplan-Meier estimator for censoring variable C. Under the following regularity conditions:

  1. ζ(.) < ∞ is a twice continuously differentiable function;
  2. Z is a p × 1 vector of bounded covariates, not contained in a (p − 1)-dimensional hyperplane;
  3. sup[t : Pr(V > t) > 0] sup[t : Pr(C > t) > 0] = t0, and Pr(δ = 1) > 0;
  4. 0t0[{tt0SC(u)du}2{SC2(t)SV(t)}]dSC(t)<, where SV (t) is the survival function for the residual failure time;
  5. ΓTlimn{n2i<jq(Zij)δiδjZij2ξ(ZijTβ^)w^(Yi)w^(Yj)} is nonsingular.

The estimating equations (4) have, asymptotically, a unique solution β if the weight q(.) is positive under regularity conditions (a)-(c) and (e). A sketch of the proof is provided in the Appendix. In addition, we have the following weak convergence property for the estimating function.

Theorem 1. Under regularity conditions (a)-(d),


in distribution as n → ∞, where t0 is a finite time point defined in (c).

The derivation of the weak convergence of n−3/2UT(β0) is obtained by the properties of U-statistics and the asymptotic martingale integral representation of n(w^(t)w(t)) (Pepe and Fleming, 1989, 1991). Because summation of the pairwise indicators in UT(β) is over (n2)=O(n2) items, the rate of convergence is in the order of n−3/2. Based on the above distributional properties, the variance-covariance of n−3/2UT(β0), ΣT can be consistently estimated by the empirical form of the U-statistic variance




and Λ^C(t) is the Nelson estimate for the cumulative hazards function of the censoring variable C,π^(t)=i=1nI(YiAit)n,


Note that regularity condition (c) ensures that the support for the censoring variable C be shorter than the support of V, so that b^(t)π^(t) is well defined for any t [set membership] [0, t0]. Most of observational studies and clinical trials have limited follow-up, thus this condition can be easily satisfied.

By the Taylor series expansion of UT(β^) around β0,n(β^β0) is asymptotically equivalent to n32ΓT1UT(β0) by the Slutsky theorem and the delta method. Thus, under the regularity conditions (a)-(e), the distribution of n(β^β0) is approximated by a normal distribution as


where ΓT is the Hessian matrix of the estimating function (4) and can be consistently estimated by


Provided that the covariate vectors are not all on a hyperplane, the identifiability of β is guaranteed given conditions (c) and (e).

2.2 Accelerated Failure Time Models

The accelerated failure time (AFT) model relates the logarithm of the failure time linearly to the covariates (Cox and Oakes 1984; Kalbfleisch and Prentice 2002),


where εA has a unknown distribution function with mean zero. In contrast to the transformation models, the distribution of εA in the AFT model is not specified other than the mean zero requirement, but the transformation to T is specified. Under the AFT model, the effect of the covariates on the failure time is direct acceleration or deceleration of the time to failure, which is easier to interpret and more relevant to clinicians.

Because of the dependent right censoring induced from length-biased sampling, unconventional approaches to adjust for the non-informative censoring are required under the AFT models. We assume that the censoring time is independent of the covariates. Based on the joint distribution of (A, Y) and C conditional on covariates Z, we can derive the expectation of the following quantity to be zero,


Accordingly, an unbiased estimating equation for α can be constructed as


where q is a positive, scalar weight function. By replacing the unknown quantity, w(Yi), by its consistent estimator ŵ(Yi), an asymptotic unbiased estimating equation follows:


Of note, the above estimating equation has a closed-form solution for α,


Let α0 be the true value of the regression coefficient vector. We impose the following regularity conditions besides conditions (b)-(d) for a rigorous justification of the asymptotic properties of α^:

  • f
  • g
    det(0t0D2(s){SC2(s)}dSC(s))<, where D(t)=E[{q(Z)δZI(Ys)tYSC(u)du(logYZTα0)}{w2(Y)}];
  • h
    ΓAlimn{1ni=1nq(Zi)δiZi2w^(Yi)} is nonsingular. We show the consistency of α^ in the supplementary technical report by an argument similar to that for the consistency of β^ under the transformation models. The asymptotic normality of α^ is stated next.

Theorem 2. Under regularity conditions (b)-(d), and (f)-(h), n(α^α0) converges weakly to a normal distribution with mean zero and variance-covariance matrix ΓA1ΣAΓA1, in which ΓA is the Hessian matrix of the UA(α0) and ΣA is the asymptotic variance-covariance matrix of n−1/2UA(α0).

A detailed proof for Theorem 2 is provided in the Appendix. Essentially, by Taylor series expansion,


where Γn(α0) is the first derivative of UA(α0). By the asymptotic martingale expression of ŵ(y), we show the asymptotic i.i.d. representation of the difference between n−1/2UA(α0) and n−1/2ŨA(α0),


This representation, together with the consistency of 1nΓn(α0) and an application of the classic Central Limit and Slutsky's Theorem, implies that n(α^α0) converges in distribution to a zero-mean normal distribution.

The Hessian matrix ΓA and the variance-covariance matrix ΣA can be consistently estimated by


respectively, where


and Λ^C(u) is the Nelson-Aalen estimator for the cumulative hazard function of the censoring times.

2.3 Remarks

We have restricted our attention to the setting in which the censoring distribution is independent of the covariates for both model structures. However, it is not conceptually difficult to generalize derivations to the setting with a covariate-dependent censoring distribution. Without loss of generality, we assume that the covariate Z can be discretized to a finite number of possible values for the covariate-specific censoring distribution if the censoring variable is not independent of Z. Taking the transformation model as an example, when sample size is large enough within each subgroup according to discretized covariates, the estimating equations UT(β) under the transformation models can be extended accordingly for a covariate-specific censoring distribution as follows,


where w^(tZl)=0tS^C(uZl)du, and ŜC(u|Zl) is the Kaplan-Meier estimator for the censoring variable C given pairs (YlAl, δl) for which Zl = Z(l = 1, · · · , n). Note that covariate Z does not have to be discretized in modeling the failure time. We may also fit a semiparametric (e.g. Cox model, accelerated failure time model) or parametric model to take into consideration the covariate dependence of the censoring variable, and plug covariate-specific censoring distribution SC(t|Z) into the estimating equation (8) for the transformation models or into the estimating equation (7) for the AFT models. Alternatively, one may consider the profile likelihood approach for length-biased data, in which the distribution of censoring variable C is not required to be estimated. Further research is required to study the tradeoff between robustness and efficiency of various alternative methods; but this is beyond the scope of this paper.

Another note is that either the transformation model assumption or the accelerated failure time model assumption is not invariant for population data and length-biased data in general. For example, the proportional hazards model assumption for the population samples would not lead to the same model assumption for the length-biased samples. Unless evaluating trivial cases for which the risk factors have no impact on the failure times, the model assumptions may be satisfied using both the population sample and the length-biased sample. Another special parametric case is when H(.) is of a log-transformation and ε follows the extreme value distribution, the covariate-specific density functions for T and T both follow the Gamma distribution, but with different parameters.

3 Simulations

3.1 Data Generation

Generating length-biased failure time data is often difficult from a renewal process. Using the following approach, we can simplify the data generating procedure to obtain right-censored, length-biased data under the semiparametric models. We first generated independent pairs of (Ai, Ti) from A ~ U(0, τ) and T ~ fU(t|z), and we kept the pairs with Ai < Ti, which are the length-biased samples, denoted by Ti = Ai + Vi. For simplicity of notation, we use fU(t) instead of fU(t|z) in this section. Thus, the length-biased sample follows the conditional distribution,


where Pr(T = t, A < T) = Pr(T = t, A < t) = fU(t)t/τ, for t < τ, fU(t) for tτ, and




To obtain length-biased data, we should choose τ larger than the upper bound of T, so that P(T = t|A < T) = 0 for t > τ. For example, if T is the length-biased outcome for age of death, then we shall let τ > 100 years. Under both proportional hazards and proportional odds models with a baseline Gamma survival function (i.e. H(t)=2log(t)), ττfU(t)dt=τ(1FU(τ)) approximates to zero if τ → ∞. Therefore, while the probability mass at tail (tτ) is negligible, for t < τ


The censoring variables measured from the examination time (Ci) are generated from uniform distributions for various censoring percentages. The censoring indicator is obtained by δi = I(TiCi + Ai). We considered regression models with two independent covariates. Z1 is generated from a Bernoullli variable and Z2 from a uniform variable on (0,1). Each scenario comprised 1000 runs. Cohort sizes of 200 or 400 were used.

3.2 Transformation Models

First, we considered the underlying population distribution of T to follow a proportional odds model. The average bias, 95% coverage probability, and mean squared errors (MSEs) are summarized in Table 1. We found that the biases decreased with the increase of cohort size, and the MSEs slightly increased with the percentage of censoring. In general, the coverage probabilities were reasonably close to the nominal level but tended to be slightly larger than .95, because the mean standard errors obtained from Γ^T1Σ^TΓ^T1 were somewhat overestimated for the empirical ones, especially with small right censoring percentage.

Table 1
Simulation results under the proportional odds model: Estimated point estimates, coverage probabilities and MSE

We generated unbiased data from another important special case, the proportional hazards model. We also assumed that the baseline cumulative hazard function was t2, so that the covariate specific survival function followed a Weibull distribution. We compared the estimators obtained from the proposed method with those from the naive Cox proportional hazards model when ignoring length-biased sampling. The results are summarized in Table 2. When β = (0, 0), the estimators from the proposed method and the usual Cox model are comparable and virtually unbiased with appropriate coverage probabilities. Under the null hypothesis, it is clear that the length-biased samples and the population samples both follow the proportional hazards model. In contrast, when β ≠ 0, the length-biased samples will not follow the proportional hazards model, if the population samples follow the proportional hazards model. Therefore, the estimators obtained from the naive Cox model can be substantially biased (overestimated in this case) compared with those from the proposed method. We found that the 95% coverage probabilities varied from 28% to 84%.

Table 2
Simulation results under the proportional hazards model: Biases and coverage probabilities for the proposed method (LB) with the naive Cox model

3.3 AFT Models

Next we evaluated the performance of the proposed estimators for the AFT models and compared them with the estimators from the following naive estimating equation ignoring length-biased sampling,


Note that UA2 is an asymptotic unbiased estimating equation for traditional survival data under the AFT models. Here, the distribution of the random error, εA, was assumed to be Uniform (−0.5, 0.5). Table 3 summarizes the performance of estimators with and without adjusting the length-biased sampling. The proposed estimators achieved outstanding accuracy: the empirical biases were less than 1% and the empirical coverage probabilities of the 95% confidence intervals were quite consistent with the nominal level for the scenarios investigated.

Table 3
Simulation results under accelerated failure time model: Biases and coverage probabilities for the proposed method (Eq.UA) and the naive method (Eq.UA2)

By the naive estimating equation UA2, the estimators for the intercept were substantially biased (overestimated) and the associated inference were inaccurate, especially when censoring percentage was larger than 10%. For settings with null covariate effects, both estimation procedures led to virtually unbiased estimators for covariate coefficients with appropriate coverage probabilities. In the presence of non-zero covariate effects, the average biases of the estimated covariate effects by the estimating equation UA2 increased and the associated coverage probabilities decreased with increasing degree of censoring.

3.4 Censoring Times Dependent on Covariate

We have also performed additional simulation studies to evaluate the robustness of the estimating equations to mis-specification of the censoring model used. We simulated survival times from the model described in Table 2 (proportional hazards) and Table 3 (AFT), but the censoring times were dependent on one of the two covariates, Z2. The choice of covariate coefficient for Z2, γ reflected the degree of dependence between the censoring time and the covariate. In the lower panels of Table 2 and Table 3, we list both biases and empirical coverage probabilities, while ignoring the censoring dependent on Z2 in the corresponding estimating equations. With a total sample size of 200, the results show that the estimation procedures are reasonably robust to the dependent censoring in general, though the biases slightly increase when the dependence to Z2 gets stronger. Thus, if there is no strong indication for dependent censoring, we may use the overall nonparametric estimator for the censoring distribution, as shown by our empirical study.

4 A Real Data Example

To illustrate the proposed methods, we applied our methods to analyze a large prevalent cohort study with time to death data for subjects diagnosed with dementia. Dementia is a progressive degenerative medical condition, and is one of the leading causes of all deaths in the United States and Canada. The Canadian Study of Health and Aging (CSHA) study was a multicenter epidemiologic study, in which more than 14,000 subjects who were 65 years or older were randomly chosen to receive an invitation for a health survey throughout Canada, and a total of 10,263 subjects agreed to participate (Wolfson et al, 2001). The participants were then screened for dementia starting in 1991, and 1132 people were identified as having the disease.

All dementia patients had been followed until 1996 and their dates of dementia onset were ascertained from their medical records, and their dates of death or last follow-up were recorded prospectively from the time of screening. The relevant data included three dementia categories (probable Alzheimer's disease, possible Alzheimer's disease and vascular dementia), approximate date of onset, date of screening for dementia, date of death or censoring and death indicator variable. Excluding subjects with missing date of disease onset or missing classification of dementia sub-type, there were a total of 818 patients remained. We used data collected as part of the CSHA to examine if the survival distributions were di erent among three sub-types of dementia diagnosis. Given prevalent cases ascertained cross-sectionally, the stationarity assumption, which is defined as that the incidence of dementia did not change over the period of the study, was validated in Addona and Wolfson (2006). It is clear that patients with worse prognosis of dementia were more likely to die before the study recruitment, therefore the remaining cohort of patients was a collection of length-biased samples.

Among them, 393 patients had probable Alzheimer's disease, 252 had possible Alzheimer's disease and 173 had vascular dementia. At the end of this study, 638 out of 818 patients died and the others were right censored. Using the type of probable Alzheimer's disease as the baseline, we defined two indicator variables for the other two subtypes of dementias. We assess whether the subtype of dementia at diagnosis affects time to death. Since there was no evidence of dependence between the censoring time measured from the study recruitment and the subtype of dementias, it is appropriate to use the estimating equation (4) or equation (7) to make inferences about regression coeffcients. Moreover, given the sampling mechanism for this study the residual censoring time is independent of the residual survival time, and the truncation time A.

We analyzed the above data set using the proportional hazards model, the proportional odds model, and the accelerated failure time model. Applying the proposed methods for length-biased data, the estimated covariate effects of two subtypes of dementias and their standard errors are listed in Table 4. Under the proportional hazards model, the estimated covariate coefficient (s.e) is .241 (.211) for vascular dementia at diagnosis, and -.110 (.201) for possible Alzeimer's disease relative to the probable Alzheimer's disease (with q = 1). The results under the proportional odds model and the accelerated failure time model with length-biased adjustment lead to conclusions similar to those obtained under the proportional hazards model. The data show that the diagnosis of three subtypes of dementia had little difference in long-term survival using the methods that adjusted for length-biased sampling, which was consistent with the nonparametric survival estimators provided in Wolfson et al (2001).

Table 4
Dementia example: Regression coefficient estimates and standard errors (SE)

We also analyzed the the same data set using the estimating equations proposed in Cheng et al (1995) under the proportional hazards and proportional odds models and the estimating equation UA2 under the accelerated failure time model for traditional right-censored data (without considering length-biased sampling). And the results are listed in Table 4 under the column of “Cheng” or “Eq. UA2”. Similar to the results from the simulation studies, under the proportional hazards and proportional odds models, the covariate effects could be overestimated using the “naive” approaches under the same model assumptions (Cheng et al., 1995), when the length-biased sampling is ignored. Comparing the results from the naive methods, we found a weaker relationship between subtype of dementias at diagnosis and the elapsed time to death under the bias-adjusted models (smaller Z-values), though all methods and models did not yield statistically significant difference among the subtypes of dementia. Under the accelerated failure time model, the difference between the estimated intercepts with and without length-biased adjustment indicates that the use of an approach ignoring the length-biased sampling may lead to a substantial overall underestimation of the deleterious effects of dementia. Specifically, when ignoring the length-biased sampling, the estimated median survival duration was 8.88 years (95% CI: 8.43-9.34) for probable Alzheimer's disease, 9.00 years (95% CI: 8.13-9.56 ) for vascular dementia, and 9.40 years (95% CI: 8.75-10.09 ) for possible Alzheimer's disease. When adjusting for length-biased sampling, the estimated median survival duration was 3.71 years (95% CI: 2.95-4.67) for probable Alzheimer's disease, 3.44 years (95% CI: 3.02-3.92) for vascular dementia, and 4.22 years (95% CI: 3.49-5.10) for possible Alzheimer's Disease.

5 Discussion

We have proposed the inference methods to evaluate covariate effects on time-to-event data with right censoring under the semiparametric transformation models and the semiparametric accelerated failure time models, when the observed data are subject to length-biased sampling. Under both models, we impose the semiparametric model structure on the population sample, which is often of primary interest, instead of on the length-biased sample. The interpretation of the regression coefficients in the population model is straightforward. Under the transformation models, one can take advantage of the stochastic ordering preserved for length-biased times and unbiased times, thereby construct the estimating equations using observed length-biased times. The estimation procedures can be easily implemented and applied for various models under the class of transformation models, which include important special cases such as proportional hazards and proportional odds models. In contrast to the pairwise comparison of the data in the estimating equations under the transformation models, the proposed estimating equation approach under the AFT model is based on the least-squares principle. It is appealing for practical applications because of the elegant closed form of the solution for the estimator of α, which can be obtained easily without using an iterative numerical algorithm.

When directly using the Kaplan-Meier estimator, ŜC(t), as an inverse weight, the stability of the weight function at the tail is of practical concern. Our weight function, in contrast, is much more robust especially regarding its behavior at the tail, because the weight is an inverse of the integral of the Kaplan-Meier estimator, w^(t)=0tS^C(u)du. The potential instability for the estimating equation caused by SC(t) → 0 under heavy censoring is moderated by integrating the estimated survival distribution over the observed failure times, because the area under the survival curve, 0tSC(x)dx, would not approximate to zero under either heavy or light censoring. Some theoretical comparison of the asymptotic efficiency between estimating equations using ŜC(t) and ŵ(t) as inverse weights is provided in the supplementary technical report. One common challenge in the use of Kaplan-Meier estimator is the undefined nonparametric estimator beyond the largest observed censoring time. One possible solution is to use a parametric model fitted with the data up to the largest observed censoring time, and then to extrapolate the curve for the censoring distribution. However, from our empirical studies, the impact of using the imperfect survival distribution estimator (extrapolated from the last observed censoring time) seems to be limited for the moderately and heavily censored C.

Theoretically, we can find the optimal weight function q in the estimating equations to achieve the maximum efficiency for the estimation of regression coefficients. Because the optimal weight function consists of the true value of the regression coefficient vector, the use of such an optimal weight is less feasible. As in the literature (e.g., Cheng et al 1995; Fine 1999), q = 1 is a practical and reasonable choice in applications. An alternative approach is to use an initial consistent estimator of regression coefficients for the optimal weight function and calculate a single Newton-like update of the solution to the estimating equation (Gray, 2000).

One potential challenge with the proposed semiparametric models is the inclusion of time-varying covariates for length-biased data, similar to that for traditional survival data in Cheng et al (1995). It is possible to extend the maximum likelihood approaches of Zeng and Lin (2007a, 2007b), which allow time varying covariates under the transformation models or the AFT models for traditional survival data, to length-biased data. It is of interest to estimate the distribution of the baseline function, thus predicting unbiased covariate-specific survival distribution from the linear transformation model with length-biased data. However, a substantial research effort would be required to establish the large sample properties and computational algorithms in this more complicated setting. Specifically, modern empirical process approaches are required to prove the tightness of the estimated quantities involving t. It is our intention in future research to develop methods for the purpose of prediction.

The issues of biased sampling discussed in this work extend beyond length-biased data. The importance of correctly modeling the risk factors in a cohort with selection bias has been discussed in recent literature (Begg, 2002), in which Begg pointed out the over-representation of risk factors such as genetic abnormalities for breast cancer in population-based case proband studies, because adjustments are not made under the size-biased sampling scheme. Although this paper and Begg's work share a common concern regarding the potential bias in estimating covariate effects using naive inference methods under biased sampling (length-biased sampling is a special type of biased sampling mechanism), our data structure with time-to-event outcome is different from the one with a binary outcome under size-biased sampling discussed by Begg (2002), and the proposed method for length-biased data with right-censoring may not be directly applicable to estimation with the size-biased sampling schemes. New method for correcting the bias caused by size-biased sampling is needed.

Supplementary Material

supplementary file


This work was supported in part by the U.S. National Institutes of Health. We thank the editor, the associate editor and three referees for their constructive suggestions that improved the article. We also thank Professor Masoud Asgharian and investigators of the Canadian Study of Health and Aging (CHSA) for providing us the dementia data from CHSA. The data reported in the example were collected as part of the CHSA. The core study was funded by the Seniors’ Independence Research Program, through the National Health Research and Development Program of Health Canada (Project no.6606-3954-MC(S)). Additional funding was provided by Pfizer Canada Incorporated through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada.


Consistency of β^

Under regularity conditions (a)-(c) and (e), similar to the work of Cheng et al. (1995), we can prove there is a unique solution to the equation UT(β) = 0 if the weights in UT(β) are positive. With probability one, the quantity, n2UTT(β)(ββ0), converges to


where F is the distribution function of Z and Z12 = Z1Z2. The above quantity is nonnegative, because ζ(.) is a decreasing function. Thus, β^ is consistent, as long as the weight, q(.) is positive.

Asymptotic distribution of UT(β0)

We derive the asymptotic properties of UT(β0) when the censoring distribution is independent of the covariate vector Z. The uniform consistency of the Kaplan-Meier estimator, ŜC(t), over [0, t0] (e.g., Gill 1982) entails the uniform consistency of ŵ(t), i.e.,


because suptt00t(S^C(u)SC(u))dusuptt00tS^C(u)SC(u)du0 in probability given condition (c). Recall that


Note that condition Pr(δ = 1) > 0 assures the existence of the estimating equation UT(β) in (2.4). Using arguments similar to those in Cheng et al (1995) or Fine (1999), together with the uniform consistency of ŵ(t) to w(t), it is easy to show that


Following from a martingale integral representation for n(w^(t)w(t)) by Pepe and Fleming (1989, 1991), we can re-express n(w^(t)w(t)) as a martingale integral via integration by parts


where hi(t)=I(tYi)[tYiSC(u)du]w(Yi), π(t)=SC(t)SV(t),Mk(t)=I(YkAkt,Δk=0)0tI(YkAku)dΛC(u) is the martingale for the residual censoring variable, and ΛC(u) is the corresponding cumulative hazard function.

To simplify the notation, let


By inserting (10), aij(β0), and b(t) into (9), we see that n−3/2UT(β0) can be asymptotically decomposed to a U-statistic Ũ(β0) and a martingale integral V(β0),




and V(β0)=n12k=1n0b(t)π(t)dMk(t). Note that the martingale integral V(β0) can also be written in the form of a U-statistic, so that


Given the above multivariate U-statistic representation, weak convergence of n−3/2UT(β0) follows from the standard distribution theory (e.g., Wei and Johnson 1985; Serfling 1980, pp. 192). Both Ũ(β0) and V(β0) are naturally tight because they do not involve t. The asymptotic distribution of UT(β0) with discrete covariates can be derived similarly.

Asymptotic distribution of α^

By Taylor series expansion,


where Γn(α0) is the first derivative of UA(α0) and 1nΓn(α0) converges in probability to the Hessian matrix of the UA(α0), ΓA. By the uniform consistency of ŵ(t) to w(t), it is easy to show


Define D(t)=E[{q(Z)δZI(Ys)tYSC(u)du(logYZTα0)}{w2(Y)}]. Then by inserting (10) into (11), we have


Hence, under regularity conditions (c)-(e), n−1/2UA(α0) is asymptotically normally distributed by the Central Limit Theorem. This, combined with an application of Slutsky's theorem, implies that n(α^α0) converges weakly to a normal distribution with mean zero and variance-covariance matrix ΓA1ΣAΓA1, in which ΣA is the asymptotic variance-covariance matrix of n−1/2UA(α0).

Contributor Information

Yu Shen, Department of Biostatistics M. D. Anderson Cancer Center The University of Texas, Houston, TX 77030 ; gro.nosrednadm@nehsy Phone: 713-794-4159, Fax: 713-563-4242.

Jing Ning, Department of Biostatistics M. D. Anderson Cancer Center The University of Texas, Houston, TX 77030.

Jing Qin, Biostatistics Research Branch National Institute of Allergy and Infectious Diseases Bethesda, MD 20892.


  • Addona V, Wolfson DB. A formal test for the stationarity of the incidence rate using data from a prevalent cohort study with follow-up. Lifetime Data Anal. 2006;12:267–284. [PubMed]
  • Asgharian M. Comment on “Goodness-of-fit tests for parametric models based on biased samples” (2002V30 p475-490) The Canadian J. Statist. 2003;31:349–350.
  • Asgharian M, M'lan CM, Wolfson DB. Length-biased sampling with right censoring: an unconditional approach. J Am. Statist. Assoc. 2002;97:201–209.
  • Asgharian M, Wolfson DB. Asymptotic behavior of the unconditional npmle of the length-biased survivor function from right censored prevalent cohort data. Ann. Statist. 2005;33:2109–2131.
  • Asgharian M, Wolfson DB, Zhang X. Checking stationarity of the incidence rate using prevalent cohort survival data. Statist. Med. 2006;25:1751–1767. [PubMed]
  • Begg CB. On the use of familial aggregation in population-based case probands for calculating penetrance (with editorial) J Natl Cancer Inst. 2002;94:1221–1226. [PubMed]
  • Buckley J, James I. Linear Regression with Censored Data. Biometrika. 1979;66:429–36.
  • Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845.
  • Cox DR, Miller HD. The theory of stochastic processes. Chapman and Hall Ltd; London; New York: 1977.
  • Cox DR, Oakes D. Analysis of Survival Data. Chapman and Hall; London: 1984.
  • Dabrowska DM, Doksum KA. Estimation and testing in a two-sample generalized odds-rate model. J Am. Statist. Assoc. 1988;83:744–749.
  • Fine JP. Analysing competing risks data with transformation models. J. R. Statist. Soc. B. 1999;61:817–830.
  • Gill RD. Censoring and Stochastic Integrals. Mathematisch Centrum; Amsterdam: 1982. Mathematical Centre Tract No. 124.
  • Gill RD, Vardi Y, Wellner JA. Large sample theory of empirical distributions in biased sampling models. Ann Statist. 1988;16:1069–1112.
  • Gray RJ. Estimation of regression parameters and the hazard function in transformed linear survival models. Biometrics. 2000;56:571–576. [PubMed]
  • Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002.
  • Lagakos SW, Barraj LM, De Gruttola V. Nonparametric analysis of truncated survival data, with application to AIDS. Biometrika. 1988;75:515–523.
  • Lai TL, Ying Z. Large Sample Theory of a Modified Buckley-James Estimator for Regression Analysis with Censored Data. Annals of Statistics. 1991;19:1370–1402.
  • Lin DY, Ying Z. Semiparametric Inference for the Accelerated Life Model with Time-dependent Covariates. Journal of Statistical Planning and Inference. 1995;44:47–63.
  • Pepe M, Fleming TR. Weighted Kaplan-Meier statistics: A class of distance tests for censored survival data. Biometrics. 1989;45:497–507. [PubMed]
  • Pepe M, Fleming TR. Weighted Kaplan-Meier statistics: Large sample and optimality considerations. J. R. Statist. Soc. B. 1991;53:341–352.
  • Prentice RL. Linear Rank Tests with Right Censored Data. Biometrika. 1978;65:167–79.
  • Ritov Y. Estimation in a Linear Regression Model with Censored Data. Annals of Statistics. 1990;18:303–28.
  • Serfling RJ. Approximation theorems of mathematical statistics. John Wiley & Sons; New York; Chichester: 1981.
  • Sansgiry P, Akman O. Transformations of the lognormal distribution as a selection model. American Statist. 2000;54:307–309.
  • Simon R. Length-biased sampling in etiologic studies. Am J Epidemiol. 1980;111:444–452. [PubMed]
  • Tsiatis AA. Estimating Regression Parameters Using Linear Rank Tests for Censored Data. Annals of Statistics. 1990;18:354–372.
  • Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B. 1976;38:290–295.
  • Vardi Y. Nonparametric estimation in the presence of length bias. Ann. Statist. 1982;10:616–620.
  • Vardi Y. Empirical distributions in selection bias models (Com: p204-205) Ann. Statist. 1985;13:178–203.
  • Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika. 1989;76:751–761.
  • Wang MC. Nonparametric estimation from cross-sectional survival data. J Am. Statist. Assoc. 1991;86:130–143.
  • Wang MC. Hazard regression analysis for length-biased data. Biometrika. 1996;83:343–354.
  • Wei LJ, Johnson WE. Combing dependent tests with incomplete repeated measurements. Biometrika. 1985;72:359–64.
  • Wolfson C, Wolfson DB, Asgharian M, M'Lan CE, Ostbye T, Rockwood K, Hogan DB, the Clinical Progression of Dementia Study Group A reevaluation of the duration of survival after the onset of dementia. New Engl. J. Med. 2001;344:1111–1116. [PubMed]
  • Zelen M, Feinleib M. On the theory of screening for chronic diseases. Biometrika. 1969;56:601–614.
  • Zelen M. Forward and backward recurrence times and length biased sampling : Age specific models. Lifetime Data Anal. 2004;10:325–34. [PubMed]
  • Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data. J. R. Statist. Soc. B. 2007a;69:507–564.
  • Zeng D, Lin DY. E cient Estimation for the Accelerated Failure Time Model. J Am. Statist. Assoc. 2007b;102:1387–1396.