Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2972554

Formats

Article sections

- SUMMARY
- 1 Introduction
- 2 Notations and Models
- 3 Simulations
- 4 A Real Data Example
- 5 Discussion
- Supplementary Material
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 November 4.

Published in final edited form as:

J Am Stat Assoc. 2009 September 1; 104(487): 1192–1202.

doi: 10.1198/jasa.2009.tm08614PMCID: PMC2972554

NIHMSID: NIHMS248000

Yu Shen, Department of Biostatistics M. D. Anderson Cancer Center The University of Texas, Houston, TX 77030 ; Email: gro.nosrednadm@nehsy Phone: 713-794-4159, Fax: 713-563-4242;

See other articles in PMC that cite the published article.

Right-censored time-to-event data are often observed from a cohort of prevalent cases that are subject to length-biased sampling. Informative right censoring of data from the prevalent cohort within the population often makes it difficult to model risk factors on the unbiased failure times for the general population, because the observed failure times are length biased. In this paper, we consider two classes of flexible semiparametric models: the transformation models and the accelerated failure time models, to assess covariate effects on the population failure times by modeling the length-biased times. We develop unbiased estimating equation approaches to obtain the consistent estimators of the regression coefficients. Large sample properties for the estimators are derived. The methods are confirmed through simulations and illustrated by application to data from a study of a prevalent cohort of dementia patients.

Length-biased data are often encountered in observational studies, when the observed samples are not randomly selected from the population of interest but with probability proportional to their length (Cox and Miller, 1965; Zelen and Feinleib, 1969; Simon, 1980; Vardi, 1982, 1985, 1989; Sansgiry and Akman, 2000; Zelen, 2004). The data set considered is often from a cross-sectional cohort of subjects diagnosed to have the disease at the time of examination, which is then followed for the occurrence of a subsequent event.

Two examples of such data are seen in a study of dementia and subsequent onset of death, and a study of the natural history of lung cancer. In the first example, about 10,000 Canadians over the age of 65 were recruited and screened for prevalence of dementia. Investigators recorded the approximate date of onset of dementia and the subsequent time of death or censoring for individuals within the study population who were found to have dementia (Asgharian and Wolfson, 2005). The second example involves a prevalent cohort for elderly lung cancer patients. Individuals are sampled from Surveillance Epidemiology and End Results (SEER) database. The sampling (or recruitment) times can be considered to follow a uniform distribution independent of failure times, only those individuals who had been diagnosed to have lung cancer before the recruitment time and who had been alive are eligible for inclusion. The data on each subject in the prevalent cohort included an initiating event (e.g., diagnosis of lung cancer) and failure event (e.g., death) for subjects who had been sampled at an intermediate time. A common feature for data from the prevalent cohort is that the time duration measured from the first (or initiating) event to the terminal event is subject to potential left truncation and right censoring. Length-biased sampling occurs because the “observed” time intervals from initiation to failure within the prevalent cohort tend to be longer than those arising from the underlying distribution of the general population, because individuals diagnosed with the disease have to survive to the examination or sampling time.

Extensive literature has focused on estimating the unbiased distribution given length-biased sampling, using methods conditional on the observed truncation times (Turnbull, 1976; Lagakos et al., 1988; Wang 1991) or an unconditional approach (Vardi, 1982, 1985; Gill et al., 1988). The latter approach often assumes that the initiation times follow a stationary Poisson process (known as the stationarity assumption), which is satisfied in many practical applications. Under the stationarity assumption, Asgharian et al. (2002) and Asgharian and Wolfson (2005) provided an unconditional nonparametric maximum likelihood estimation (NMLE) for the unbiased survival function in the presence of right censoring, and derived the asymptotic properties of the NMLE under that setting. Of note, a majority of the related literature has focused on one-sample estimation of the unbiased survival distribution given length-biased data. Only limited literature is available that describes modeling risk factors on the distribution of the underlying population. A seminal work by Wang (1996) described a semiparametric proportional hazards model to assess risk factors on length-biased data using a bias-adjusted risk set in constructing the pseudo-likelihood. However, a constraint not allowing data to be right censored had to be imposed due to the dependent censoring mechanism with the observed length-biased data. This subtle feature for informative censoring induced by the sampling scheme has been avoided by not allowing right censoring (Vardi, 1982, 1985; Wang, 1996) or has simply been ignored, as noted by Asgharian and Wolfson (2005).

In this article, we fill the gap by providing semiparametric models (transformation and accelerated failure time models) to examine risk factors on the population failure time via the observed length-biased data, in the presence of informative right censoring. We use the terminology of ”length-biased” data for left-truncated and right-censored data under the stationarity assumption. The transformation models and the accelerated failure time models have been extensively investigated for traditional right-censored survival data (e.g., Prentice 1978; Buckley and James 1979; Dabrowska and Doksum, 1988; Ritov 1990; Tsiatis 1990; Lai and Ying 1991; Cheng et al., 1995; Fine, 1999; Lin and Ying 1995; Kalbfleisch and Prentice 2002). However, the literature does not indicate that these models have been used for length-biased data. We estimate the covariate coefficients through constructed unbiased estimating equations, with an inverse weight chosen to adjust the bias induced by the length-biased data and dependent censoring. We present the estimation procedures and the large sample properties of the estimators in Section 2. Under some mild regularity conditions, we show that the resulting estimators are consistent and asymptotically normal. In Section 3, we summarize the simulation results, and we include an application of the proposed method to the dementia study in Section 4. We offer a discussion in Section 5.

Assume to be the time measured from the initiating event to failure within the population of interest, *A* to be the time of examination for the disease (measured from the initiating event), *V* to be the time measured from examination to failure, and *C* to be the time from examination to censoring. With length-biased sampling, one can only observe *T* among those > *A*, where *T* = *A* + *V* is the observed failure time. Here, *A* is also referred to as the truncation variable (sampling time or backward recurrence time) and *V* is the residual survival time (or forward recurrence time). See the following diagram describing the observed dementia data.

Let *Y _{i}* = min(

Recall that the observed failure time data (*T*_{1}, · · · , *T _{n}*) are a biased subset for population sample . The unbiased density function for ,

$$g\left(t\right)=\frac{t{f}_{U}\left(t\right)}{\mu},\phantom{\rule{1em}{0ex}}\mu ={\int}_{0}^{\infty}u{f}_{U}\left(u\right)du.$$

Given covariates, ** Z** =

$$g(t\mid \mathit{z})=\frac{t{f}_{U}(t\mid \mathit{z})}{{\int}_{0}^{\infty}u{f}_{U}(u\mid \mathit{z})du}:=\frac{t{f}_{U}(t\mid \mathit{z})}{\mu \left(\mathit{z}\right)},$$

(1)

where *g*(*t*|** z**) and

Under the transformation models, it is assumed that the true underlying failure time is linearly related to the covariates with various specified error distributions (Cheng, Wei and Ying, 1995). Specifically,

$$H\left(\stackrel{~}{T}\right)=-{\mathit{Z}}^{T}\beta +{\u220a}_{T},$$

where *H* is an unknown increasing function, *ε _{T}* has a known density function and

We first consider a special case for the length-biased data without right censoring. For the unbiased variable , we have the following conditional expectation given covariates,

$$\begin{array}{cc}\hfill E\left[I{\stackrel{~}{T}}_{i}\ge {\stackrel{~}{T}}_{j})\mid {\mathit{Z}}_{i},{\mathit{Z}}_{j}\right]\phantom{\rule{thickmathspace}{0ex}}& =\phantom{\rule{thickmathspace}{0ex}}P[H\left({\stackrel{~}{T}}_{i}\right)\ge H\left({\stackrel{~}{T}}_{j}\right)\mid {\mathit{Z}}_{i},{\mathit{Z}}_{j}]\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}P({\u220a}_{Ti}-{\u220a}_{Tj}\ge {\mathit{Z}}_{ij}^{T}\beta )\u2254\xi \left({\mathit{Z}}_{ij}^{T}\beta \right),\hfill \end{array}$$

where the data vectors for the *i*th and *j*th subjects are independent, * Z_{ij}* =

$$\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)=\int \int I(t\ge s){f}_{U}(t\mid {\mathit{Z}}_{i}){f}_{U}(s\mid {\mathit{Z}}_{j})dsdt.$$

(2)

Using an approach similar to that of Cheng et al. (1995), we construct an unbiased estimating function for ** β**, given the observed length-biased data

$${U}_{1}\left(\beta \right)=\sum _{i<j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}\left\{\frac{I({T}_{i}\ge {T}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)}{{T}_{i}{T}_{j}}\right\},$$

where *q* is a positive, scalar weight function. The estimating equation is unbiased because

$$E\left[\phantom{\mid}\frac{\{I({T}_{i}\ge {T}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)\}}{{T}_{i}{T}_{j}}\mid {\mathit{Z}}_{i},{\mathit{Z}}_{j}\right]=\int \int \frac{\{I(t\ge s)-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)\}ts{f}_{U}(t\mid {\mathit{Z}}_{i}){f}_{U}(s\mid {\mathit{Z}}_{j})dsdt}{ts\mu \left({\mathit{Z}}_{i}\right)\mu \left({\mathit{Z}}_{j}\right)}=0$$

by using (1) and (2), when there is no right censoring for the length-biased data.

When the observed failure time *T* from length-biased sampling is subject to potential right censoring, there are two difficulties. First, the indicators {*T _{i}* ≥

The primary purpose of this section is to derive unbiased estimating functions for covariates ** β** when the length-biased data are subject to dependent right censoring. Under the stationarity assumption, the joint distribution of (

$${f}_{A,V}(a,v\mid \mathit{Z}=\mathit{z})={f}_{U}(a+v\mid \mathit{z})I(a>0,v>0)\u2215\mu \left(\mathit{z}\right)$$

(3)

as shown in earlier work (Zelen, 2004; Asgharian and Wolfson, 2005). We first assume that the censoring variable *C* is independent of the covariates. By extending the results of Asgharian and Wolfson (2005), we have the covariate-specific joint probability to observe a failure at *y* in the length-biased sampling, since the residual censoring time *C* is assumed to be independent of (*A, V*) given covariate ** Z** =

$$\mathrm{Pr}(Y\in (y,y+dy),A\in (a,a+da),C\ge y-a\mid \mathit{z})={f}_{A,V}(a,y\mid \mathit{z}){S}_{C}(y-a)dady,$$

where *S _{C}*(

$$\mathrm{Pr}(Y\in (t,t+dt),\delta =1\mid \mathit{z})={\int}_{0}^{t}{f}_{A,V}(t-v,v\mid \mathit{z}){S}_{C}\left(v\right)dvdt=\frac{{f}_{U}(t\mid \mathit{z})w\left(t\right)dt}{\mu \left(\mathit{z}\right)},$$

where $w\left(t\right)={\int}_{0}^{t}{S}_{C}\left(v\right)dv$. Given the above equations, the following conditional expectation for the observed pairwised length-biased failure times is zero, since

$$\begin{array}{cc}\hfill E\left[\phantom{\mid}{\delta}_{i}{\delta}_{j}\frac{I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)}{w\left({Y}_{i}\right)w\left({Y}_{j}\right)}\mid {\mathit{Z}}_{i},{\mathit{Z}}_{j}\right]& =\phantom{\rule{thickmathspace}{0ex}}\int \int \left[w\left(t\right)w\left(s\right){f}_{U}(t\mid {\mathit{Z}}_{i}){f}_{U}(s\mid {\mathit{Z}}_{j})\frac{I(t\ge s)-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)}{w\left(t\right)w\left(s\right)\mu \left({\mathit{Z}}_{i}\right)\mu \left({\mathit{Z}}_{j}\right)}\right]dtds\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}\frac{\int \int \{I(t\ge s)-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)\}{f}_{U}(t\mid {\mathit{Z}}_{i}){f}_{U}(s\mid {\mathit{Z}}_{j})dsdt}{\mu \left({\mathit{Z}}_{i}\right)\mu \left({\mathit{Z}}_{j}\right)}=0.\hfill \end{array}$$

The last equation holds by (2). If the censoring distribution is known, the estimating equation can then be constructed as

$${\stackrel{~}{\mathit{U}}}_{T}\left(\beta \right)=\sum _{i,j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\delta}_{i}{\delta}_{j}\left[\frac{I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)}{w\left({Y}_{i}\right)w\left({Y}_{j}\right)}\right].$$

When the censoring distribution is unknown, it is natural to replace the unknown censoring distribution by its consistent Kaplan-Meier estimator. Thus, an asymptotic unbiased estimating equation follows

$${\mathit{U}}_{T}\left(\beta \right)=\sum _{i,j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\delta}_{i}{\delta}_{j}\left[\frac{I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)}{\widehat{w}\left({Y}_{i}\right)\widehat{w}\left({Y}_{j}\right)}\right],$$

(4)

where $\widehat{w}\left(t\right)={\int}_{0}^{t}{\widehat{S}}_{C}\left(v\right)dv$, and *Ŝ _{C}* is the Kaplan-Meier estimator for censoring variable

*ζ*(.) < ∞ is a twice continuously differentiable function;is a*Z**p*× 1 vector of bounded covariates, not contained in a (p − 1)-dimensional hyperplane;- sup[
*t*:*Pr*(*V*>*t*) > 0] sup[*t*:*Pr*(*C*>*t*) > 0] =*t*_{0}, and*Pr*(*δ*= 1) > 0; - ${\int}_{0}^{{t}_{0}}[{\left\{{\int}_{t}^{{t}_{0}}{S}_{C}\left(u\right)du\right\}}^{2}\u2215\left\{{S}_{C}^{2}\left(t\right){S}_{V}\left(t\right)\right\}]d{S}_{C}\left(t\right)<\infty $, where
*S*(_{V}*t*) is the survival function for the residual failure time; - ${\Gamma}_{T}\equiv {\text{lim}}_{n\to \infty}\left\{{n}^{-2}{\sum}_{i<j}q\left({\mathit{Z}}_{ij}\right){\delta}_{i}{\delta}_{j}{\mathit{Z}}_{ij}^{\otimes 2}\frac{{\xi}^{\prime}\left({Z}_{ij}^{T}\widehat{\beta}\right)}{\widehat{w}\left({Y}_{i}\right)\widehat{w}\left({Y}_{j}\right)}\right\}$ is nonsingular.

The estimating equations (4) have, asymptotically, a unique solution ** β** if the weight

Theorem 1. *Under regularity conditions (a)-(d)*,

$${n}^{-3\u22152}{\mathit{U}}_{T}\left({\beta}_{0}\right)\to N(0,{\Sigma}_{T}),$$

*in distribution as n* → ∞, *where t*_{0 }*is a finite time point defined in (c)*.

The derivation of the weak convergence of *n*^{−3/2}* U_{T}*(

$${\widehat{\Sigma}}_{T}=\frac{1}{{n}^{3}}\sum _{i<j<k}\left\{{\widehat{\eta}}_{ij}\left(\widehat{\beta}\right){\widehat{\eta}}_{ik}^{T}\left(\widehat{\beta}\right)+{\widehat{\eta}}_{ik}\left(\widehat{\beta}\right){\widehat{\eta}}_{ij}^{T}\left(\widehat{\beta}\right)+{\widehat{\eta}}_{ik}\left(\widehat{\beta}\right){\widehat{\eta}}_{jk}^{T}\left(\widehat{\beta}\right)+{\widehat{\eta}}_{jk}\left(\widehat{\beta}\right){\widehat{\eta}}_{ik}^{T}\left(\widehat{\beta}\right)\right\},$$

where

$$\begin{array}{c}\hfill {\widehat{\eta}}_{ij}\left({\beta}_{0}\right)=q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{a}_{ij}\left(\widehat{\beta}\right)+{\int}_{0}^{\infty}\frac{\widehat{\mathit{b}}\left(t\right)}{\widehat{\pi}\left(t\right)}\left[d{\widehat{M}}_{i}\left(t\right)+d{\widehat{M}}_{j}\left(t\right)\right],\hfill \\ \hfill {\widehat{M}}_{i}\left(t\right)=I({Y}_{i}-{A}_{i}\le t,{\Delta}_{i}=0)-{\int}_{0}^{t}I({Y}_{i}-{A}_{i}\ge u)d{\widehat{\Lambda}}_{C}\left(u\right),\hfill \end{array}$$

and ${\widehat{\Lambda}}_{C}\left(t\right)$ is the Nelson estimate for the cumulative hazards function of the censoring variable $C,\widehat{\pi}\left(t\right)={\sum}_{i=1}^{n}I({Y}_{i}-{A}_{i}\ge t)\u2215n$,

$$\widehat{\mathit{b}}\left(t\right)=\underset{n\to \infty}{\text{lim}}\frac{1}{{n}^{2}}\sum _{i<j}^{n}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\widehat{a}}_{ij}\left(\widehat{\beta}\right)\left\{{\widehat{h}}_{i}\left(t\right)+{\widehat{h}}_{j}\left(t\right)\right\},$$

$${\widehat{a}}_{ij}\left(\widehat{\beta}\right)=\frac{{\delta}_{i}{\delta}_{j}\left[I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\widehat{\beta}\right)\right]}{\widehat{w}\left({Y}_{i}\right)\widehat{w}\left({Y}_{j}\right)}.$$

Note that regularity condition (c) ensures that the support for the censoring variable *C* be shorter than the support of *V*, so that $\widehat{\mathit{b}}\left(t\right)\u2215\widehat{\pi}\left(t\right)$ is well defined for any *t* [0, *t*_{0}]. Most of observational studies and clinical trials have limited follow-up, thus this condition can be easily satisfied.

By the Taylor series expansion of ${\mathit{U}}_{T}\left(\widehat{\beta}\right)$ around ${\beta}_{0},\sqrt{n}(\widehat{\beta}-{\beta}_{0})$ is asymptotically equivalent to ${n}^{-3\u22152}{\Gamma}_{T}^{-1}{\mathit{U}}_{T}\left({\beta}_{0}\right)$ by the Slutsky theorem and the delta method. Thus, under the regularity conditions (a)-(e), the distribution of $\sqrt{n}(\widehat{\beta}-{\beta}_{0})$ is approximated by a normal distribution as

$$\sqrt{n}(\widehat{\beta}-{\beta}_{0})\to N(0,{\Gamma}_{T}^{-1}{\Sigma}_{T}{\Gamma}_{T}^{-1}),$$

where **Γ*** _{T}* is the Hessian matrix of the estimating function (4) and can be consistently estimated by

$${\widehat{\Gamma}}_{T}=\left\{{n}^{-2}\sum _{i<j}q\left({\mathit{Z}}_{ij}\right){\delta}_{i}{\delta}_{j}{\mathit{Z}}_{ij}^{\otimes 2}\frac{{\xi}^{\prime}\left({\mathit{Z}}_{ij}^{T}\widehat{\beta}\right)}{\widehat{w}\left({Y}_{i}\right)\widehat{w}\left({Y}_{j}\right)}\right\}.$$

Provided that the covariate vectors are not all on a hyperplane, the identifiability of ** β** is guaranteed given conditions (c) and (e).

The accelerated failure time (AFT) model relates the logarithm of the failure time linearly to the covariates (Cox and Oakes 1984; Kalbfleisch and Prentice 2002),

$$\text{log}\phantom{\rule{thinmathspace}{0ex}}\stackrel{~}{T}={\mathit{Z}}^{T}\alpha +{\u220a}_{A},$$

(5)

where *ε _{A}* has a unknown distribution function with mean zero. In contrast to the transformation models, the distribution of

Because of the dependent right censoring induced from length-biased sampling, unconventional approaches to adjust for the non-informative censoring are required under the AFT models. We assume that the censoring time is independent of the covariates. Based on the joint distribution of (*A, Y*) and *C* conditional on covariates ** Z**, we can derive the expectation of the following quantity to be zero,

$$\begin{array}{cc}\hfill E\left\{\phantom{\mid}\frac{\delta (\text{log}\phantom{\rule{thinmathspace}{0ex}}Y-{\mathit{Z}}^{T}\alpha )}{w\left(Y\right)}\mid \mathit{Z}=\mathit{z}\right\}& =\phantom{\rule{thickmathspace}{0ex}}E\left\{{\int}_{0}^{\infty}{\int}_{0}^{\infty}P(Y=y,A=a,\delta =1\mid \mathit{Z}=\mathit{z})\frac{(\text{log}\phantom{\rule{thinmathspace}{0ex}}y-{\mathit{z}}^{T}\alpha )}{w\left(y\right)}dady\right\}\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}E\left\{\frac{1}{\mu \left(\mathit{z}\right)}{\int}_{0}^{\infty}{\int}_{0}^{y}{f}_{U}(y\mid \mathit{Z}=\mathit{z}){S}_{C}(y-a)\frac{(\text{log}\phantom{\rule{thinmathspace}{0ex}}y-{\mathit{z}}^{T}\alpha )}{w\left(y\right)}dady\right\}\hfill \\ \hfill & =\phantom{\rule{thickmathspace}{0ex}}E\left\{\frac{1}{\mu \left(\mathit{z}\right)}{\int}_{0}^{\infty}{f}_{U}(y\mid \mathit{Z})(\text{log}\phantom{\rule{thinmathspace}{0ex}}y-{\mathit{z}}^{T}\alpha )dy\right\}=0.\hfill \end{array}$$

Accordingly, an unbiased estimating equation for ** α** can be constructed as

$${\stackrel{~}{U}}_{A}\left(\alpha \right)=\sum _{i=1}^{n}q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}\frac{(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}\alpha )}{w\left({Y}_{i}\right)}=0,$$

(6)

where *q* is a positive, scalar weight function. By replacing the unknown quantity, *w*(*Y _{i}*), by its consistent estimator

$${\mathit{U}}_{A}\left(\alpha \right)=\sum _{i=1}^{n}q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}\frac{(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}\alpha )}{\widehat{w}\left({Y}_{i}\right)}=0.$$

(7)

Of note, the above estimating equation has a closed-form solution for ** α**,

$$\widehat{\alpha}={\left\{\sum _{i=1}^{n}\frac{q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}{\mathit{Z}}_{i}^{T}}{\widehat{w}\left({Y}_{i}\right)}\right\}}^{-1}\sum _{i=1}^{n}\frac{q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}\phantom{\rule{thinmathspace}{0ex}}\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}}{\widehat{w}\left({Y}_{i}\right)}.$$

Let *α*_{0} be the true value of the regression coefficient vector. We impose the following regularity conditions besides conditions (b)-(d) for a rigorous justification of the asymptotic properties of $\widehat{\alpha}$:

- f$\text{det}\left(E{[\left\{\delta \mathit{Z}(\text{log}\phantom{\rule{thickmathspace}{0ex}}Y-{\mathit{Z}}^{T}{\alpha}_{0})\right\}\u2215\left\{w\left(Y\right)\right\}]}^{\otimes 2}\right)<\infty $;
- g$\text{det}\left({\int}_{0}^{{t}_{0}}{\mathit{D}}^{\otimes 2}\left(s\right)\u2215\left\{{S}_{C}^{2}\left(s\right)\right\}d{S}_{C}\left(s\right)\right)<\infty $, where $\mathit{D}\left(t\right)=E\left[\left\{q\left(\mathit{Z}\right)\delta \mathit{Z}I(Y\ge s){\int}_{t}^{Y}{S}_{C}\left(u\right)du(\text{log}\phantom{\rule{thickmathspace}{0ex}}Y-{\mathit{Z}}^{T}{\alpha}_{0})\right\}\u2215\left\{{w}^{2}\left(Y\right)\right\}\right]$;
- h${\Gamma}_{A}\equiv -{\text{lim}}_{n\to \infty}\left\{\frac{1}{n}{\sum}_{i=1}^{n}\frac{q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}^{\otimes 2}}{\widehat{w}\left({Y}_{i}\right)}\right\}$ is nonsingular. We show the consistency of $\widehat{\alpha}$ in the supplementary technical report by an argument similar to that for the consistency of $\widehat{\beta}$ under the transformation models. The asymptotic normality of $\widehat{\alpha}$ is stated next.

Theorem 2. *Under regularity conditions (b)-(d), and (f)-(h)*, $\sqrt{n}(\widehat{\alpha}-{\alpha}_{0})$ converges weakly to a normal distribution with mean zero and variance-covariance matrix ${\Gamma}_{A}^{-1}{\Sigma}_{A}{\Gamma}_{A}^{-1}$, *in which Γ_{A} is the Hessian matrix of the U_{A}*(

A detailed proof for Theorem 2 is provided in the Appendix. Essentially, by Taylor series expansion,

$${n}^{-1\u22152}{\mathit{U}}_{A}\left(\widehat{\alpha}\right)={n}^{-1\u22152}{\mathit{U}}_{A}\left({\alpha}_{0}\right)-\frac{1}{n}{\Gamma}_{n}\left({\alpha}_{0}\right)\sqrt{n}(\widehat{\alpha}-{\alpha}_{0})+{o}_{p}\left(1\right),$$

where **Γ*** _{n}*(

$${n}^{-1\u22152}{\mathit{U}}_{A}\left({\alpha}_{0}\right)-{n}^{-1\u22152}{\stackrel{~}{\mathit{U}}}_{A}\left({\alpha}_{0}\right)={n}^{-1\u22152}\sum _{i=1}^{n}{\int}_{0}^{\infty}\frac{\mathit{D}\left(s\right)}{{S}_{C}\left(s\right){S}_{V}\left(s\right)}d{M}_{i}\left(s\right)+{o}_{p}\left(1\right).$$

This representation, together with the consistency of ${\scriptstyle \frac{1}{n}}{\Gamma}_{n}\left({\alpha}_{0}\right)$ and an application of the classic Central Limit and Slutsky's Theorem, implies that $\sqrt{n}(\widehat{\alpha}-{\alpha}_{0})$ converges in distribution to a zero-mean normal distribution.

The Hessian matrix **Γ*** _{A}* and the variance-covariance matrix

$$\begin{array}{c}\hfill {\widehat{\Gamma}}_{A}=-\frac{1}{n}\sum _{i=1}^{n}\frac{q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}^{\otimes 2}}{\widehat{w}\left({Y}_{i}\right)},\hfill \\ \hfill {\widehat{\Sigma}}_{A}=\frac{1}{n}\sum _{i=1}^{n}{\left\{q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}\frac{\left(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}\widehat{\alpha}\right)}{\widehat{w}\left({Y}_{i}\right)}+{\int}_{0}^{\infty}\frac{\widehat{\mathit{D}}\left(t\right)d{\widehat{M}}_{i}\left(t\right)}{\sum _{i=1}^{n}I({Y}_{i}-{A}_{i}\ge t)\u2215n}\right\}}^{\otimes 2},\hfill \end{array}$$

respectively, where

$$\begin{array}{c}\hfill \widehat{\mathit{D}}\left(t\right)=\frac{1}{n}\sum _{i=1}^{n}q\left({\mathit{Z}}_{i}\right)I(t\le {Y}_{i}){\delta}_{i}{\mathit{Z}}_{i}{\int}_{t}^{{Y}_{i}}{\widehat{S}}_{C}\left(u\right)du\frac{\left(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}\widehat{\alpha}\right)}{{\widehat{w}}^{2}\left({Y}_{i}\right)},\hfill \\ \hfill {\widehat{M}}_{i}\left(t\right)=I({Y}_{i}\le t,{\Delta}_{i}=0)-{\int}_{0}^{t}I({Y}_{i}\ge u)d{\widehat{\Lambda}}_{C}\left(u\right),\hfill \end{array}$$

and ${\widehat{\Lambda}}_{C}\left(u\right)$ is the Nelson-Aalen estimator for the cumulative hazard function of the censoring times.

We have restricted our attention to the setting in which the censoring distribution is independent of the covariates for both model structures. However, it is not conceptually difficult to generalize derivations to the setting with a covariate-dependent censoring distribution. Without loss of generality, we assume that the covariate ** Z** can be discretized to a finite number of possible values for the covariate-specific censoring distribution if the censoring variable is not independent of

$${\mathit{U}}_{T}^{\star}\left(\beta \right)=\sum _{i,j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\delta}_{i}{\delta}_{j}\left[\frac{I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}\beta \right)}{\widehat{w}({Y}_{i}\mid {\mathit{Z}}_{i})\widehat{w}({Y}_{j}\mid {\mathit{Z}}_{j})}\right],$$

(8)

where $\widehat{w}(t\mid {\mathit{Z}}_{l})={\int}_{0}^{t}{\widehat{S}}_{C}(u\mid {\mathit{Z}}_{l})du$, and *Ŝ _{C}*(

Another note is that either the transformation model assumption or the accelerated failure time model assumption is not invariant for population data and length-biased data in general. For example, the proportional hazards model assumption for the population samples would not lead to the same model assumption for the length-biased samples. Unless evaluating trivial cases for which the risk factors have no impact on the failure times, the model assumptions may be satisfied using both the population sample and the length-biased sample. Another special parametric case is when *H*(.) is of a log-transformation and *ε* follows the extreme value distribution, the covariate-specific density functions for and *T* both follow the Gamma distribution, but with different parameters.

Generating length-biased failure time data is often difficult from a renewal process. Using the following approach, we can simplify the data generating procedure to obtain right-censored, length-biased data under the semiparametric models. We first generated independent pairs of (*A _{i}*,

$$\mathrm{Pr}(T=t\mid A<T)=\mathrm{Pr}(T=t,A<T)\u2215\mathrm{Pr}(A<T)$$

where Pr(*T* = *t*, *A* < *T*) = Pr(*T* = *t*, *A* < *t*) = *f _{U}*(

$$\mathrm{Pr}(A<T)={\int}_{0}^{\tau}\mathrm{Pr}(T=t,A<t)dt={\int}_{0}^{\tau}{f}_{U}\left(t\right)tdt\u2215\tau +{\int}_{\tau}^{\infty}{f}_{U}\left(t\right)dt.$$

Thus,

$$\mathrm{Pr}(T=t\mid A<T)=\{\begin{array}{cc}\frac{t{f}_{U}\left(t\right)}{{\int}_{0}^{\tau}{f}_{U}\left(t\right)tdt+\tau {\int}_{\tau}^{\infty}{f}_{U}\left(t\right)dt}\hfill & t<\tau \hfill \\ \frac{\tau {f}_{U}\left(t\right)}{{\int}_{0}^{\tau}{f}_{U}\left(t\right)tdt+\tau {\int}_{\tau}^{\infty}{f}_{U}\left(t\right)dt}\hfill & t\ge \tau .\hfill \end{array}\phantom{\}}$$

To obtain length-biased data, we should choose *τ* larger than the upper bound of *T*, so that *P*(*T* = *t*|*A* < *T*) = 0 for *t* > *τ*. For example, if *T* is the length-biased outcome for age of death, then we shall let *τ* > 100 years. Under both proportional hazards and proportional odds models with a baseline Gamma survival function (i.e. $H\left(t\right)=2\phantom{\rule{thickmathspace}{0ex}}\text{log}\left(t\right)$), $\tau {\int}_{\tau}^{\infty}{f}_{U}\left(t\right)dt=\tau (1-{F}_{U}\left(\tau \right))$ approximates to zero if *τ* → ∞. Therefore, while the probability mass at tail (*t* ≥ *τ*) is negligible, for *t* < *τ*

$$\mathrm{Pr}(T=t\mid A<T)\approx \frac{tf\left(t\right)}{{\int}_{0}^{\infty}tf\left(t\right)dt}.$$

The censoring variables measured from the examination time (*C _{i}*) are generated from uniform distributions for various censoring percentages. The censoring indicator is obtained by

First, we considered the underlying population distribution of to follow a proportional odds model. The average bias, 95% coverage probability, and mean squared errors (MSEs) are summarized in Table 1. We found that the biases decreased with the increase of cohort size, and the MSEs slightly increased with the percentage of censoring. In general, the coverage probabilities were reasonably close to the nominal level but tended to be slightly larger than .95, because the mean standard errors obtained from ${\widehat{\Gamma}}_{T}^{-1}{\widehat{\Sigma}}_{T}{\widehat{\Gamma}}_{T}^{-1}$ were somewhat overestimated for the empirical ones, especially with small right censoring percentage.

Simulation results under the proportional odds model: Estimated point estimates, coverage probabilities and MSE

We generated unbiased data from another important special case, the proportional hazards model. We also assumed that the baseline cumulative hazard function was *t*^{2}, so that the covariate specific survival function followed a Weibull distribution. We compared the estimators obtained from the proposed method with those from the naive Cox proportional hazards model when ignoring length-biased sampling. The results are summarized in Table 2. When ** β** = (0, 0), the estimators from the proposed method and the usual Cox model are comparable and virtually unbiased with appropriate coverage probabilities. Under the null hypothesis, it is clear that the length-biased samples and the population samples both follow the proportional hazards model. In contrast, when

Next we evaluated the performance of the proposed estimators for the AFT models and compared them with the estimators from the following naive estimating equation ignoring length-biased sampling,

$${\mathit{U}}_{A2}\left(\alpha \right)=\sum _{i=1}^{n}{\delta}_{i}{\mathit{Z}}_{i}\frac{(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}\alpha )}{{\widehat{S}}_{C}\left({Y}_{i}\right)}.$$

Note that **U**_{A2} is an asymptotic unbiased estimating equation for traditional survival data under the AFT models. Here, the distribution of the random error, *ε _{A}*, was assumed to be Uniform (−0.5, 0.5). Table 3 summarizes the performance of estimators with and without adjusting the length-biased sampling. The proposed estimators achieved outstanding accuracy: the empirical biases were less than 1% and the empirical coverage probabilities of the 95% confidence intervals were quite consistent with the nominal level for the scenarios investigated.

Simulation results under accelerated failure time model: Biases and coverage probabilities for the proposed method (Eq.*U*_{A}) and the naive method (Eq.*U*_{A2})

By the naive estimating equation *U*_{A2}, the estimators for the intercept were substantially biased (overestimated) and the associated inference were inaccurate, especially when censoring percentage was larger than 10%. For settings with null covariate effects, both estimation procedures led to virtually unbiased estimators for covariate coefficients with appropriate coverage probabilities. In the presence of non-zero covariate effects, the average biases of the estimated covariate effects by the estimating equation *U*_{A2} increased and the associated coverage probabilities decreased with increasing degree of censoring.

We have also performed additional simulation studies to evaluate the robustness of the estimating equations to mis-specification of the censoring model used. We simulated survival times from the model described in Table 2 (proportional hazards) and Table 3 (AFT), but the censoring times were dependent on one of the two covariates, *Z*_{2}. The choice of covariate coefficient for *Z*_{2}, *γ* reflected the degree of dependence between the censoring time and the covariate. In the lower panels of Table 2 and Table 3, we list both biases and empirical coverage probabilities, while ignoring the censoring dependent on *Z*_{2} in the corresponding estimating equations. With a total sample size of 200, the results show that the estimation procedures are reasonably robust to the dependent censoring in general, though the biases slightly increase when the dependence to *Z*_{2} gets stronger. Thus, if there is no strong indication for dependent censoring, we may use the overall nonparametric estimator for the censoring distribution, as shown by our empirical study.

To illustrate the proposed methods, we applied our methods to analyze a large prevalent cohort study with time to death data for subjects diagnosed with dementia. Dementia is a progressive degenerative medical condition, and is one of the leading causes of all deaths in the United States and Canada. The Canadian Study of Health and Aging (CSHA) study was a multicenter epidemiologic study, in which more than 14,000 subjects who were 65 years or older were randomly chosen to receive an invitation for a health survey throughout Canada, and a total of 10,263 subjects agreed to participate (Wolfson et al, 2001). The participants were then screened for dementia starting in 1991, and 1132 people were identified as having the disease.

All dementia patients had been followed until 1996 and their dates of dementia onset were ascertained from their medical records, and their dates of death or last follow-up were recorded prospectively from the time of screening. The relevant data included three dementia categories (probable Alzheimer's disease, possible Alzheimer's disease and vascular dementia), approximate date of onset, date of screening for dementia, date of death or censoring and death indicator variable. Excluding subjects with missing date of disease onset or missing classification of dementia sub-type, there were a total of 818 patients remained. We used data collected as part of the CSHA to examine if the survival distributions were di erent among three sub-types of dementia diagnosis. Given prevalent cases ascertained cross-sectionally, the stationarity assumption, which is defined as that the incidence of dementia did not change over the period of the study, was validated in Addona and Wolfson (2006). It is clear that patients with worse prognosis of dementia were more likely to die before the study recruitment, therefore the remaining cohort of patients was a collection of length-biased samples.

Among them, 393 patients had probable Alzheimer's disease, 252 had possible Alzheimer's disease and 173 had vascular dementia. At the end of this study, 638 out of 818 patients died and the others were right censored. Using the type of probable Alzheimer's disease as the baseline, we defined two indicator variables for the other two subtypes of dementias. We assess whether the subtype of dementia at diagnosis affects time to death. Since there was no evidence of dependence between the censoring time measured from the study recruitment and the subtype of dementias, it is appropriate to use the estimating equation (4) or equation (7) to make inferences about regression coeffcients. Moreover, given the sampling mechanism for this study the residual censoring time is independent of the residual survival time, and the truncation time *A*.

We analyzed the above data set using the proportional hazards model, the proportional odds model, and the accelerated failure time model. Applying the proposed methods for length-biased data, the estimated covariate effects of two subtypes of dementias and their standard errors are listed in Table 4. Under the proportional hazards model, the estimated covariate coefficient (s.e) is .241 (.211) for vascular dementia at diagnosis, and -.110 (.201) for possible Alzeimer's disease relative to the probable Alzheimer's disease (with *q* = 1). The results under the proportional odds model and the accelerated failure time model with length-biased adjustment lead to conclusions similar to those obtained under the proportional hazards model. The data show that the diagnosis of three subtypes of dementia had little difference in long-term survival using the methods that adjusted for length-biased sampling, which was consistent with the nonparametric survival estimators provided in Wolfson et al (2001).

We also analyzed the the same data set using the estimating equations proposed in Cheng et al (1995) under the proportional hazards and proportional odds models and the estimating equation *U*_{A2} under the accelerated failure time model for traditional right-censored data (without considering length-biased sampling). And the results are listed in Table 4 under the column of “Cheng” or “Eq. *U*_{A2}”. Similar to the results from the simulation studies, under the proportional hazards and proportional odds models, the covariate effects could be overestimated using the “naive” approaches under the same model assumptions (Cheng et al., 1995), when the length-biased sampling is ignored. Comparing the results from the naive methods, we found a weaker relationship between subtype of dementias at diagnosis and the elapsed time to death under the bias-adjusted models (smaller *Z*-values), though all methods and models did not yield statistically significant difference among the subtypes of dementia. Under the accelerated failure time model, the difference between the estimated intercepts with and without length-biased adjustment indicates that the use of an approach ignoring the length-biased sampling may lead to a substantial overall underestimation of the deleterious effects of dementia. Specifically, when ignoring the length-biased sampling, the estimated median survival duration was 8.88 years (95% CI: 8.43-9.34) for probable Alzheimer's disease, 9.00 years (95% CI: 8.13-9.56 ) for vascular dementia, and 9.40 years (95% CI: 8.75-10.09 ) for possible Alzheimer's disease. When adjusting for length-biased sampling, the estimated median survival duration was 3.71 years (95% CI: 2.95-4.67) for probable Alzheimer's disease, 3.44 years (95% CI: 3.02-3.92) for vascular dementia, and 4.22 years (95% CI: 3.49-5.10) for possible Alzheimer's Disease.

We have proposed the inference methods to evaluate covariate effects on time-to-event data with right censoring under the semiparametric transformation models and the semiparametric accelerated failure time models, when the observed data are subject to length-biased sampling. Under both models, we impose the semiparametric model structure on the population sample, which is often of primary interest, instead of on the length-biased sample. The interpretation of the regression coefficients in the population model is straightforward. Under the transformation models, one can take advantage of the stochastic ordering preserved for length-biased times and unbiased times, thereby construct the estimating equations using observed length-biased times. The estimation procedures can be easily implemented and applied for various models under the class of transformation models, which include important special cases such as proportional hazards and proportional odds models. In contrast to the pairwise comparison of the data in the estimating equations under the transformation models, the proposed estimating equation approach under the AFT model is based on the least-squares principle. It is appealing for practical applications because of the elegant closed form of the solution for the estimator of ** α**, which can be obtained easily without using an iterative numerical algorithm.

When directly using the Kaplan-Meier estimator, *Ŝ _{C}*(

Theoretically, we can find the optimal weight function *q* in the estimating equations to achieve the maximum efficiency for the estimation of regression coefficients. Because the optimal weight function consists of the true value of the regression coefficient vector, the use of such an optimal weight is less feasible. As in the literature (e.g., Cheng et al 1995; Fine 1999), *q* = 1 is a practical and reasonable choice in applications. An alternative approach is to use an initial consistent estimator of regression coefficients for the optimal weight function and calculate a single Newton-like update of the solution to the estimating equation (Gray, 2000).

One potential challenge with the proposed semiparametric models is the inclusion of time-varying covariates for length-biased data, similar to that for traditional survival data in Cheng et al (1995). It is possible to extend the maximum likelihood approaches of Zeng and Lin (2007a, 2007b), which allow time varying covariates under the transformation models or the AFT models for traditional survival data, to length-biased data. It is of interest to estimate the distribution of the baseline function, thus predicting unbiased covariate-specific survival distribution from the linear transformation model with length-biased data. However, a substantial research effort would be required to establish the large sample properties and computational algorithms in this more complicated setting. Specifically, modern empirical process approaches are required to prove the tightness of the estimated quantities involving *t*. It is our intention in future research to develop methods for the purpose of prediction.

The issues of biased sampling discussed in this work extend beyond length-biased data. The importance of correctly modeling the risk factors in a cohort with *selection bias* has been discussed in recent literature (Begg, 2002), in which Begg pointed out the over-representation of risk factors such as genetic abnormalities for breast cancer in population-based case proband studies, because adjustments are not made under the size-biased sampling scheme. Although this paper and Begg's work share a common concern regarding the potential bias in estimating covariate effects using naive inference methods under biased sampling (length-biased sampling is a special type of biased sampling mechanism), our data structure with time-to-event outcome is different from the one with a binary outcome under *size-biased* sampling discussed by Begg (2002), and the proposed method for length-biased data with right-censoring may not be directly applicable to estimation with the size-biased sampling schemes. New method for correcting the bias caused by size-biased sampling is needed.

This work was supported in part by the U.S. National Institutes of Health. We thank the editor, the associate editor and three referees for their constructive suggestions that improved the article. We also thank Professor Masoud Asgharian and investigators of the Canadian Study of Health and Aging (CHSA) for providing us the dementia data from CHSA. The data reported in the example were collected as part of the CHSA. The core study was funded by the Seniors’ Independence Research Program, through the National Health Research and Development Program of Health Canada (Project no.6606-3954-MC(S)). Additional funding was provided by Pfizer Canada Incorporated through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada.

Under regularity conditions (a)-(c) and (e), similar to the work of Cheng et al. (1995), we can prove there is a unique solution to the equation * U_{T}*(

$${\int}_{{\mathit{z}}_{1},{\mathit{z}}_{2}}q\left({\mathit{z}}_{12}\right)({\mathit{z}}_{12}^{T}\beta -{\mathit{z}}_{12}^{T}{\beta}_{0})\frac{\{\xi \left({\mathit{z}}_{12}^{T}{\beta}_{0}\right)-\xi \left({\mathit{z}}_{12}^{T}\beta \right)\}}{\mu \left({\mathit{z}}_{1}\right)\mu \left({\mathit{z}}_{2}\right)}dF\left({\mathit{z}}_{1}\right)dF\left({\mathit{z}}_{2}\right),$$

where *F* is the distribution function of ** Z** and

We derive the asymptotic properties of * U_{T}*(

$$\underset{t\le {t}_{0}}{\text{sup}}\mid \widehat{w}\left(t\right)-w\left(t\right)\mid \to 0,$$

because ${\text{sup}}_{t\le {t}_{0}}\mid {\int}_{0}^{t}({\widehat{S}}_{C}\left(u\right)-{S}_{C}\left(u\right))du\mid \le {\text{sup}}_{t\le {t}_{0}}{\int}_{0}^{t}\mid {\widehat{S}}_{C}\left(u\right)-{S}_{C}\left(u\right)\mid du\to 0$ in probability given condition (c). Recall that

$$\frac{{\mathit{U}}_{T}\left({\beta}_{0}\right)}{{n}^{3\u22152}}=\frac{1}{{n}^{3\u22152}}\sum _{i,j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\delta}_{i}{\delta}_{j}\left[\frac{I({Y}_{j}\ge {Y}_{i})-\xi \left({\mathit{Z}}_{ij}^{T}{\beta}_{0}\right)}{\widehat{w}\left({Y}_{i}\right)\widehat{w}\left({Y}_{j}\right)}\right].$$

Note that condition *Pr*(*δ* = 1) > 0 assures the existence of the estimating equation * U_{T}*(

$$\frac{{\mathit{U}}_{T}\left({\beta}_{0}\right)}{{n}^{3\u22152}}=\frac{1}{{n}^{3\u22152}}\sum _{i,j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\delta}_{i}{\delta}_{j}\left[\frac{I({Y}_{j}\ge {Y}_{i})-\xi \left({\mathit{Z}}_{ij}^{T}{\beta}_{0}\right)}{w\left({Y}_{i}\right)w\left({Y}_{j}\right)}\right]\times \left\{\frac{w\left({Y}_{i}\right)-\widehat{w}\left({Y}_{i}\right)}{w\left({Y}_{i}\right)}+\frac{w\left({Y}_{j}\right)-\widehat{w}\left({Y}_{j}\right)}{w\left({Y}_{j}\right)}+1\right\}+{o}_{p}\left(1\right).$$

(9)

Following from a martingale integral representation for $\sqrt{n}(\widehat{w}\left(t\right)-w\left(t\right))$ by Pepe and Fleming (1989, 1991), we can re-express $\sqrt{n}(\widehat{w}\left(t\right)-w\left(t\right))$ as a martingale integral via integration by parts

$$\begin{array}{cc}\hfill \sqrt{n}(w\left({Y}_{i}\right)-\widehat{w}\left({Y}_{i}\right))\phantom{\rule{thickmathspace}{0ex}}& =\phantom{\rule{thickmathspace}{0ex}}{n}^{-1\u22152}\sum _{k=1}^{n}{\int}_{0}^{{Y}_{i}}\left[{\int}_{t}^{{Y}_{i}}{S}_{C}\left(u\right)du\right]\frac{d{M}_{k}\left(t\right)}{\pi \left(t\right)}+{o}_{p}\left(1\right)\hfill \\ \hfill \sqrt{n}\frac{w\left({Y}_{i}\right)-\widehat{w}\left({Y}_{i}\right)}{w\left({Y}_{i}\right)}\phantom{\rule{thickmathspace}{0ex}}& =\phantom{\rule{thickmathspace}{0ex}}{n}^{-1\u22152}\sum _{k=1}^{n}{\int}_{0}^{\infty}\frac{{h}_{i}\left(t\right)}{\pi \left(t\right)}d{M}_{k}\left(t\right)+{o}_{p}\left(1\right),\hfill \end{array}$$

(10)

where ${h}_{i}\left(t\right)=I(t\le {Y}_{i})\left[{\int}_{t}^{{Y}_{i}}{S}_{C}\left(u\right)du\right]\u2215w\left({Y}_{i}\right)$, $\pi \left(t\right)={S}_{C}\left(t\right){S}_{V}\left(t\right),{M}_{k}\left(t\right)=I({Y}_{k}-{A}_{k}\le t,{\Delta}_{k}=0)-{\int}_{0}^{t}I({Y}_{k}-{A}_{k}\le u)d{\Lambda}_{C}\left(u\right)$ is the martingale for the residual censoring variable, and Λ* _{C}*(

To simplify the notation, let

$$\begin{array}{c}\hfill {a}_{ij}\left({\beta}_{0}\right)=\frac{{\delta}_{i}{\delta}_{j}[I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}{\beta}_{0}\right)]}{w\left({Y}_{i}\right)w\left({Y}_{j}\right)},\hfill \\ \hfill \mathit{b}\left(t\right)=\underset{n\to \infty}{\text{lim}}\frac{1}{{n}^{2}}\sum _{i<j}^{n}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{a}_{ij}\left({\beta}_{0}\right)\{{h}_{i}\left(t\right)+{h}_{j}\left(t\right)\}.\hfill \end{array}$$

By inserting (10), *a _{ij}*(

$$\begin{array}{cc}\hfill {n}^{-3\u22152}{\mathit{U}}_{T}\left({\beta}_{0}\right)\phantom{\rule{thickmathspace}{0ex}}& =\phantom{\rule{thickmathspace}{0ex}}{n}^{-3\u22152}\sum _{i<j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{a}_{ij}\left({\beta}_{0}\right)+{n}^{-1\u22152}\sum _{k=1}^{n}{\int}_{0}^{\infty}\frac{\mathit{b}\left(t\right)}{\pi \left(t\right)}d{M}_{k}\left(t\right)+{o}_{p}\left(1\right)\hfill \\ \hfill & \equiv \phantom{\rule{thickmathspace}{0ex}}\stackrel{~}{\mathit{U}}\left({\beta}_{0}\right)+\mathit{V}\left({\beta}_{0}\right)+{o}_{p}\left(1\right),\hfill \end{array}$$

where

$$\stackrel{~}{\mathit{U}}\left({\beta}_{0}\right)=\sum _{i,j}q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{\delta}_{i}{\delta}_{j}\left[\frac{I({Y}_{i}\ge {Y}_{j})-\xi \left({\mathit{Z}}_{ij}^{T}{\beta}_{0}\right)}{w\left({Y}_{i}\right)w\left({Y}_{j}\right)}\right]$$

and $\mathit{V}\left({\beta}_{0}\right)={n}^{-1\u22152}{\sum}_{k=1}^{n}{\int}_{0}^{\infty}{\scriptstyle \frac{\mathit{b}\left(t\right)}{\pi \left(t\right)}}d{M}_{k}\left(t\right)$. Note that the martingale integral ** V**(

$${n}^{-3\u22152}{\mathit{U}}_{T}\left({\beta}_{0}\right)={n}^{-3\u22152}\sum _{i<j}\left\{q\left({\mathit{Z}}_{ij}\right){\mathit{Z}}_{ij}{a}_{ij}\left({\beta}_{0}\right)+{\int}_{0}^{\infty}\frac{\mathit{b}\left(t\right)}{\pi \left(t\right)}[d{M}_{i}\left(t\right)+d{M}_{j}\left(t\right)]\right\}+{o}_{p}\left(1\right).$$

Given the above multivariate U-statistic representation, weak convergence of *n*^{−3/2}* U_{T}*(

By Taylor series expansion,

$$\frac{1}{\sqrt{n}}{\mathit{U}}_{A}\left(\widehat{\alpha}\right)=\frac{1}{\sqrt{n}}{\mathit{U}}_{A}\left({\alpha}_{0}\right)-\frac{1}{n}{\Gamma}_{n}\left({\alpha}_{0}\right)\sqrt{n}(\widehat{\alpha}-{\alpha}_{0})+{o}_{p}\left(1\right),$$

where **Γ*** _{n}*(

$$\frac{1}{\sqrt{n}}{\mathit{U}}_{A}\left({\alpha}_{0}\right)=\frac{1}{\sqrt{n}}\sum _{i=1}^{n}q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}\frac{\left(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}{\alpha}_{0}\right)}{w\left({Y}_{i}\right)}\left\{1+\frac{w\left({Y}_{i}\right)-\widehat{w}\left({Y}_{i}\right)}{w\left({Y}_{i}\right)}\right\}+{o}_{p}\left(1\right).$$

(11)

Define $\mathit{D}\left(t\right)=E\left[\left\{q\left(\mathit{Z}\right)\delta \mathit{Z}I(Y\ge s){\int}_{t}^{Y}{S}_{C}\left(u\right)du(\text{log}\phantom{\rule{thickmathspace}{0ex}}Y-{\mathit{Z}}^{T}{\alpha}_{0})\right\}\u2215\left\{{w}^{2}\left(Y\right)\right\}\right]$. Then by inserting (10) into (11), we have

$$\frac{1}{\sqrt{n}}{\mathit{U}}_{A}\left({\alpha}_{0}\right)=\frac{1}{\sqrt{n}}\sum _{i=1}^{n}\left\{q\left({\mathit{Z}}_{i}\right){\delta}_{i}{\mathit{Z}}_{i}\frac{\left(\text{log}\phantom{\rule{thinmathspace}{0ex}}{Y}_{i}-{\mathit{Z}}_{i}^{T}{\alpha}_{0}\right)}{w\left({Y}_{i}\right)}+{\int}_{0}^{\infty}\frac{\mathit{D}\left(t\right)d{M}_{i}\left(t\right)}{\pi \left(t\right)}\right\}+{o}_{p}\left(1\right).$$

Hence, under regularity conditions (c)-(e), *n*^{−1/2}* U_{A}*(

Yu Shen, Department of Biostatistics M. D. Anderson Cancer Center The University of Texas, Houston, TX 77030 ; Email: gro.nosrednadm@nehsy Phone: 713-794-4159, Fax: 713-563-4242.

Jing Ning, Department of Biostatistics M. D. Anderson Cancer Center The University of Texas, Houston, TX 77030.

Jing Qin, Biostatistics Research Branch National Institute of Allergy and Infectious Diseases Bethesda, MD 20892.

- Addona V, Wolfson DB. A formal test for the stationarity of the incidence rate using data from a prevalent cohort study with follow-up. Lifetime Data Anal. 2006;12:267–284. [PubMed]
- Asgharian M. Comment on “Goodness-of-fit tests for parametric models based on biased samples” (2002V30 p475-490) The Canadian J. Statist. 2003;31:349–350.
- Asgharian M, M'lan CM, Wolfson DB. Length-biased sampling with right censoring: an unconditional approach. J Am. Statist. Assoc. 2002;97:201–209.
- Asgharian M, Wolfson DB. Asymptotic behavior of the unconditional npmle of the length-biased survivor function from right censored prevalent cohort data. Ann. Statist. 2005;33:2109–2131.
- Asgharian M, Wolfson DB, Zhang X. Checking stationarity of the incidence rate using prevalent cohort survival data. Statist. Med. 2006;25:1751–1767. [PubMed]
- Begg CB. On the use of familial aggregation in population-based case probands for calculating penetrance (with editorial) J Natl Cancer Inst. 2002;94:1221–1226. [PubMed]
- Buckley J, James I. Linear Regression with Censored Data. Biometrika. 1979;66:429–36.
- Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845.
- Cox DR, Miller HD. The theory of stochastic processes. Chapman and Hall Ltd; London; New York: 1977.
- Cox DR, Oakes D. Analysis of Survival Data. Chapman and Hall; London: 1984.
- Dabrowska DM, Doksum KA. Estimation and testing in a two-sample generalized odds-rate model. J Am. Statist. Assoc. 1988;83:744–749.
- Fine JP. Analysing competing risks data with transformation models. J. R. Statist. Soc. B. 1999;61:817–830.
- Gill RD. Censoring and Stochastic Integrals. Mathematisch Centrum; Amsterdam: 1982. Mathematical Centre Tract No. 124.
- Gill RD, Vardi Y, Wellner JA. Large sample theory of empirical distributions in biased sampling models. Ann Statist. 1988;16:1069–1112.
- Gray RJ. Estimation of regression parameters and the hazard function in transformed linear survival models. Biometrics. 2000;56:571–576. [PubMed]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 2002.
- Lagakos SW, Barraj LM, De Gruttola V. Nonparametric analysis of truncated survival data, with application to AIDS. Biometrika. 1988;75:515–523.
- Lai TL, Ying Z. Large Sample Theory of a Modified Buckley-James Estimator for Regression Analysis with Censored Data. Annals of Statistics. 1991;19:1370–1402.
- Lin DY, Ying Z. Semiparametric Inference for the Accelerated Life Model with Time-dependent Covariates. Journal of Statistical Planning and Inference. 1995;44:47–63.
- Pepe M, Fleming TR. Weighted Kaplan-Meier statistics: A class of distance tests for censored survival data. Biometrics. 1989;45:497–507. [PubMed]
- Pepe M, Fleming TR. Weighted Kaplan-Meier statistics: Large sample and optimality considerations. J. R. Statist. Soc. B. 1991;53:341–352.
- Prentice RL. Linear Rank Tests with Right Censored Data. Biometrika. 1978;65:167–79.
- Ritov Y. Estimation in a Linear Regression Model with Censored Data. Annals of Statistics. 1990;18:303–28.
- Serfling RJ. Approximation theorems of mathematical statistics. John Wiley & Sons; New York; Chichester: 1981.
- Sansgiry P, Akman O. Transformations of the lognormal distribution as a selection model. American Statist. 2000;54:307–309.
- Simon R. Length-biased sampling in etiologic studies. Am J Epidemiol. 1980;111:444–452. [PubMed]
- Tsiatis AA. Estimating Regression Parameters Using Linear Rank Tests for Censored Data. Annals of Statistics. 1990;18:354–372.
- Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B. 1976;38:290–295.
- Vardi Y. Nonparametric estimation in the presence of length bias. Ann. Statist. 1982;10:616–620.
- Vardi Y. Empirical distributions in selection bias models (Com: p204-205) Ann. Statist. 1985;13:178–203.
- Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika. 1989;76:751–761.
- Wang MC. Nonparametric estimation from cross-sectional survival data. J Am. Statist. Assoc. 1991;86:130–143.
- Wang MC. Hazard regression analysis for length-biased data. Biometrika. 1996;83:343–354.
- Wei LJ, Johnson WE. Combing dependent tests with incomplete repeated measurements. Biometrika. 1985;72:359–64.
- Wolfson C, Wolfson DB, Asgharian M, M'Lan CE, Ostbye T, Rockwood K, Hogan DB, the Clinical Progression of Dementia Study Group A reevaluation of the duration of survival after the onset of dementia. New Engl. J. Med. 2001;344:1111–1116. [PubMed]
- Zelen M, Feinleib M. On the theory of screening for chronic diseases. Biometrika. 1969;56:601–614.
- Zelen M. Forward and backward recurrence times and length biased sampling : Age specific models. Lifetime Data Anal. 2004;10:325–34. [PubMed]
- Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data. J. R. Statist. Soc. B. 2007a;69:507–564.
- Zeng D, Lin DY. E cient Estimation for the Accelerated Failure Time Model. J Am. Statist. Assoc. 2007b;102:1387–1396.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |