Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4457675

Formats

Article sections

Authors

Related links

Stat Med. Author manuscript; available in PMC 2016 July 20.

Published in final edited form as:

PMCID: PMC4457675

NIHMSID: NIHMS674633

Tianle Chen, Biogen Idec, Cambridge, MA;

Tianle Chen: moc.cedinegoib@nehc.elnait; Yanyuan Ma: ude.umat.tats@am; Yuanjia Wang: ude.aibmuloc@gnaw.aijnauy

We propose a simple approach predicting the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We use a nonparametric function for the coefficient of the time-varying effect and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing Huntington’s disease (HD) from subjects with huntingtin gene mutation using a large collaborative HD study data and illustrate an inverse relationship between the cumulative risk of HD and the length of cytosine-adenine-guanine (CAG) repeats in the huntingtin gene.

In many biomedical studies, the research goal is to predict the age-specific cumulative risk of onset of a disease from a set of covariates. For example, Huntington’s disease (HD) is a progressive neurodegenerative disorder caused by the expansion of cytosine-adenine-guanine (CAG) trinucleotide repeats in the huntingtin gene [1]. The genetic model for HD is dominant [2], and there is an inverse relationship between the age-at-onset of HD and the CAG repeats length: the greater the CAG expansion, the earlier the age-at-onset of the disease. Accurate prediction of a subject’s age-at-onset of HD from CAG repeats and other covariates is useful to assess an individual’s risk of developing HD based on available genetic mutation testing results when providing genetic counseling. Such estimates are also useful when designing a clinical trial. For instance, estimating the age-at-onset distribution of HD from a subject’s CAG repeats and other baseline information can be used to recruit patients who are close to the onset of disease to improve efficiency of a therapeutic trial. Improving existing estimation of HD risk is one of the research goals in the Cooperative Huntington’s Observational Research Trial (COHORT) study which includes 42 sites [3, 4].

Age-at-onset of disease information is usually subject to right censoring due to termination of study, patient loss to follow up, or death of a subject. Popular regression models for censored outcomes include Cox-proportional hazards model [5], which relates hazard rate at a particular age to covariates. Although it is possible to obtain cumulative risk function in this model, the interpretation of the regression coefficients of the covariates is reflected through the hazard function. Instead of working through a hazard function, a more appealing approach is to model the cumulative disease risk directly since predicting disease onset is our primary goal and hazard function is not of interest. Another motivation to avoid proportional hazards model and to work with cumulative distribution function directly is that the proportional hazards assumption may not be satisfied in certain applications. For example, Langbehn et al. [6] reported that the proportional hazards assumption does not hold with HD data, and proposed a parametric model for the cumulative risk function involving six parameters through a logistic transformation of pr(*T _{i}* ≤

$$\text{logit}\{\text{pr}({T}_{i}\le t\mid {X}_{i})\}=\{t-\mu ({X}_{i};\alpha )\}/s({X}_{i};\gamma ),$$

where *μ*(*X _{i}*,

In this work, we consider a varying-coefficient proportional odds model

$$\text{logit}\{\text{pr}({T}_{i}\le t\mid {X}_{i})\}={\beta}_{0}(t)+{\beta}_{1}(t){X}_{i}.$$

(1)

To provide flexibility and protect against misspecification, *β*_{0}(*t*) and *β*_{1}(*t*) are left as unknown nonparametric functions. The interpretation of *β*_{1}(*t*) is then directly related to the cumulative risk of disease, since exp{*β*_{1}(*t*)} is the odds ratio of experiencing disease onset by age *t* for subjects with one unit difference in *X*. Since pr(*T _{i}* ≤

An extension to model (1) is a nonparametric varying-coefficient model of the cumulative risk using a logistic link

$$\text{logit}\{\text{pr}({T}_{i}\le t\mid {X}_{i})\}={\beta}_{0}(t)+{c}_{0}({X}_{i})+{\beta}_{1}(t){c}_{1}({X}_{i}),$$

(2)

where *c*_{0}(*x*) and *c*_{1}(*x*) are known parametric functions of covariates. Note that when *c*_{1}(*x*) = 1/*s*(*x*, *γ*), *c*_{0}(*x*) = −*μ*(*x*, *α*)/*s*(*x*, *γ*), *β*_{0}(*t*) = 0, and *β*_{1}(*t*) = *t*, model (2) reduces to that in Langbehn et al. [6].

In the literature, Jung [9] directly modeled survival function using regression model at a fixed time point without considering temporal effect. There are a number of other works on extending proportional hazards or proportional odds model to account for temporal covariate effect or time-varying covariates. Peng and Huang [10] proposed an alternative extension of Cox proportional hazards model to account for a nonparametric temporal effect of a covariate. The procedure involves solving a series of estimating equations sequentially. In contrast, our method is proposed for a proportional odds model with a nonparametric time-varying effect. Chen et al. [11] proposed methods to extend transformation models considered in for example, Zeng and Lin [12], to account for external time-varying covariates.

Here, we take a completely different approach that does not involve counting process and with straightforward and simple computational algorithm. When there is no censoring, to estimate the cumulative risk function at a time point *t*_{0} given a covariate, e.g., pr(*T _{i}* ≤

In this work, to estimate ** β**(

For the purpose of illustration, we mainly focus on the varying coefficient model (1). Extension to the more general model (2) is discussed in Section 4.

First we investigate estimation at a fixed time point *t*_{0} when the outcome is not subject to censoring. Let ** β**(

$$\sum _{i=1}^{n}m({X}_{i},{T}_{i};{t}_{0},{\mathit{\beta}}_{{t}_{0}})=0,$$

where *m*(*X _{i}, T_{i}*,

$$\varphi ({X}_{i},{T}_{i};{t}_{0},{\mathit{\beta}}_{{t}_{0}})=A({X}_{i};{\mathit{\beta}}_{{t}_{0}})\phantom{\rule{0.16667em}{0ex}}\{I({T}_{i}\le {t}_{0})-\mu ({X}_{i};{\mathit{\beta}}_{{t}_{0}})\}\phantom{\rule{0.16667em}{0ex}}{Z}_{i},$$

where
$A({X}_{i};{\mathit{\beta}}_{{t}_{0}})={\left(E[\mu ({X}_{i};{\mathit{\beta}}_{{t}_{0}})\{1-\mu ({X}_{i};{\mathit{\beta}}_{{t}_{0}})\}{Z}_{i}{Z}_{i}^{T}]\right)}^{-1}$. We fit a logistic regression of *I*(*T _{i}* ≤

When a subject is right censored (i.e., *T _{i}* >

$${S}_{n}(\mathit{\beta})={n}^{-1}\sum _{i=1}^{n}\frac{I({T}_{i}\le {C}_{i})m({X}_{i},{T}_{i};{t}_{0},\mathit{\beta})}{G({T}_{i})}=0,$$

where *G*(·) is the survival function for the censoring times *C _{i}*. Estimating

$$\sum _{i=1}^{n}\frac{I({T}_{i}\le {C}_{i})m({X}_{i},{T}_{i};{t}_{0},\mathit{\beta})}{\widehat{G}({T}_{i})}=0.$$

(3)

This process is repeated for *t*_{0} on a grid (*u*_{1}*,* ···*, u _{M}*). Alternatively, one can let the grid points include only uncensored observations, which is equivalent to creating the grid.

Here we propose a new type of estimator that re-distributes weights to the right for ambiguous subjects based on self-consistency equations similar to Efron [13] and Wang and Wang [15]. Let *O _{i}* = {

$${S}_{n}({\mathit{\beta}}_{{t}_{0}},\mathit{\beta})={n}^{-1}\sum _{i=1}^{n}s({O}_{i};{t}_{0},{\mathit{\beta}}_{{t}_{0}},\mathit{\beta})=0,$$

(4)

where $s({O}_{i};{t}_{0},{\mathit{\beta}}_{{t}_{0}},\mathit{\beta})=w\{{O}_{i};{t}_{0},\mathit{\beta}(\xb7)\}m({X}_{i},{T}_{i}\wedge {C}_{i};{t}_{0},{\mathit{\beta}}_{{t}_{0}})+[1-w\{{O}_{i};{t}_{0},\mathit{\beta}(\xb7)\}]m({X}_{i}+\infty ;{t}_{0},{\mathit{\beta}}_{{t}_{0}}),$ and

$$w\{{O}_{i};{t}_{0},,\mathit{\beta}(\xb7)\}=\{\begin{array}{ll}1,\hfill & {\mathrm{\Delta}}_{i}=1\phantom{\rule{0.16667em}{0ex}}\text{or}\phantom{\rule{0.16667em}{0ex}}({\mathrm{\Delta}}_{i}=0\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{C}_{i}\ge {t}_{0})\hfill \\ \frac{F({t}_{0}\mid {X}_{i})-F({C}_{i}\mid {X}_{i})}{1-F({C}_{i}\mid {X}_{i})},\hfill & {\mathrm{\Delta}}_{i}=0\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{C}_{i}<{t}_{0}.\hfill \end{array}$$

(5)

Here *F*(*t*|*x*) = *μ*{*x*, ** β**(

To gain insights on the weights, note that subjects with observed *I*(*T _{i}* ≤

$$E\{I({T}_{i}\le {t}_{0})\mid {T}_{i}>{C}_{i},{X}_{i}\}=\frac{F({t}_{0}\mid {X}_{i})-F({C}_{i}\mid {X}_{i})}{1-F({C}_{i}\mid {X}_{i})}.$$

Treating (*X _{i}, C_{i}*) as pseudo-observations for censored subjects with censoring time less than

In practice, the weights *w*{*O _{i}*,

$${S}_{n}(O;{t}_{0},{\mathit{\beta}}_{{t}_{0}},\stackrel{\sim}{\mathit{\beta}}(\xb7))={n}^{-1}\sum _{i=1}^{n}s({O}_{i};{t}_{0},{\mathit{\beta}}_{{t}_{0}},\stackrel{\sim}{\mathit{\beta}}(\xb7))=0.$$

(6)

It is extremely easy to implement this weighting scheme. Without loss of generality, assume the first *n*_{0} subjects have unobserved outcomes *I*(*T _{i}* ≤

$$[w\{{O}_{1};{t}_{0},\stackrel{\sim}{\mathit{\beta}}(\xb7)\},\cdots ,w\{{O}_{n};{t}_{0},\stackrel{\sim}{\mathit{\beta}}(\xb7)\},1-w\{{O}_{1};{t}_{0},\stackrel{\sim}{\mathit{\beta}}(\xb7)\},\cdots ,1-w\{{O}_{{n}_{0}};{t}_{0},\stackrel{\sim}{\mathit{\beta}}(\xb7)\}].$$

Then _{t0} is estimated by a weighted logistic regression. The weights *w*{*O*, *t*_{0}*,*
**(·)} extract information at multiple time points simultaneously, and thus pool information across time points to estimate the distribution function at ***t*_{0}.

To show consistency and asymptotic normality of **(***t*) at fixed *t* obtained from (6), we will need the following technical conditions:

- A1Assume that
(*β**t*) is right continuous with left-hand limits (cadlag) componentwise. - A2Assume that for
*t*[*a, b*] with*b*< ∞ to be finite, and there exists subjects with*P*(min(*C*) >_{i}, T_{i}*b*) > 0. Also assume(*β**t*) is uniformly bounded on [*a, b*] componentwise, that is, sup_{t}_{[}_{a,b}_{]}|(*β**t*)| ≤*c*< ∞ componentwise. - A3Assume that the covariates
*X*are not degenerate, i.e., pr(_{i}*X*=_{i}*x*_{0}) ≠ 1 and are bounded in probability, i.e., pr(|*X*| <_{i}*c*) = 1. - A4Assume that the censoring times are bounded, i.e., pr(
*C*<_{i}*c*) = 1. - A5Assume that $E({Z}_{i}{Z}_{i}^{T}exp\{{Z}_{i}^{T}\mathit{\beta}(t)\}/{[1+exp\{{Z}_{i}^{T}\mathit{\beta}(t)\}]}^{2})$ is positive definite.

The conditions A1–A2 control the size of the parameter space. The condition A2 states that one can only estimate distribution function in the time range where there are still subjects with positive probability of being at risk. The conditions A3–A4 exclude some degenerate cases. The condition A5 ensures a unique solution to the estimating equation. For the simplicity of notation we let ** θ** =

Assume that {*O _{i}*,

The proof of this theorem uses the semiparametric asymptotic results developed in Newey [18] and Chen et al. [19].

Since the final estimator involves estimates **(·) in the entire range of ***T _{i}*, uniform consistency of the initial estimator is required. The next theorem establishes the asymptotic normality of

Under the assumptions of Theorem 1, as *n* → ∞,

$$\sqrt{n}(\widehat{\mathit{\theta}}-\mathit{\theta})\to N(0,{A}^{-1}V{A}^{-1})$$

in distribution, where
$A=E[\mu ({X}_{i};\mathit{\theta})\{1-\mu ({X}_{i};\mathit{\theta})\}{Z}_{i}{Z}_{i}^{T}]$, *V* = *cov*{*s*(*O _{i}*,

$$\xi ({T}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})={\int}_{0}^{{t}_{0}}g(u)\int h(x){zz}^{T}[F({t}_{0}\mid x)\{1-F({t}_{0}\mid x)\}\psi (x,{T}_{i};{t}_{0},\mathit{\theta})-F(u\mid x)\{1-F({t}_{0}\mid x)\}\psi \{x,{T}_{i};u,\mathit{\beta}(u)\}]\phantom{\rule{0.16667em}{0ex}}\mathit{dxdu},$$

*g*(*u*) is the density function for *C _{i}*,

The proof of this theorem is in the appendix and it also uses the results in Newey [18].

In this section, we provide Monte Carlo results on simulation experiments and application of the method to a real world study.

To study the finite sample performance of the proposed estimator, we ran two sets of simulation studies. In each set, the true survival times were generated from the model (1) with *β*_{0}(*t*) = *β*_{00} + *β*_{01}log(*t*), *β*_{1}(*t*) = *β*_{10} + *β*_{11}log(*t*) and (*β*_{00}*, β*_{01}*, β*_{10}*, β*_{11})* ^{T}* = (−80, 21.5,−1.4, 0.7)

Mean estimated CDFs by IPW and REW estimators, their mean estimated SEs, empirical SEs, empirical MSEs, and 95% coverages (all in 100 fold scale). *n* = 1000, 400 simulations, CAG=42 and 46.

Mean estimated CDFs by IPW and REW estimators, their mean estimated SEs, empirical SEs, empirical MSEs, and 95% coverages (all in 100 fold scale). *n* = 2000, 400 simulations, CAG=42 and 46.

We compared two types of estimators. The first is the initial inverse probability weighted estimator (IPW) (*t*) from (3) and the second is the proposed redistribution to the right weighted estimator (REW) (*t*) from (6). Since the theoretical variance estimator involves integrations and unknown quantities which are difficult to compute, we used bootstrap to obtain the mean estimated standard errors of the two estimators in each simulation repetition. The empirical standard errors based on the total 400 repetitions and the empirical MSEs were also summarized in tables 1 and and2,2, where we presented the estimated distribution functions obtained from the two estimators at various ages and CAG values (42 and 46). It can be seen that both IPW and REW estimators have small finite sample biases. The mean estimated standard errors and empirical standard errors are close to each other over most age range with some exceptions at the extreme tail area. The empirical standard error of REW is smaller than that of IPW, especially at older ages. For example, the efficiency gain of REW over IPW is 10% at age 50 for CAG=42 and *n* = 1000. The coverage of the 95% confidence interval is close to the nominal level when age is below 60 for both IPW and REW. At age 60, since censoring is heavier, the coverages of both IPW and REW are lower than the nominal level, with the performance of REW slightly better between the two. The mean estimated standard error of IPW estimator at age 55 and CAG=42 in table 1 is very large because the IPW estimator has unstable weights at that age range due to a turning point in the distribution function. It is a limitation of the IPW estimator and the increased variance when the weights are unstable was reported in the literature [20]. This issue is relieved when we increased the sample size as seen in table 2. In addition, the proposed estimator does not suffer from the high variability of the IPW initial estimator.

We presented the true and the mean estimated cumulative distribution functions (CDFs) obtained from the REW estimator and their empirical 95% CI at various CAGs in figure 1. The estimated curves coincide with the true curves in most cases. When CAG=42 and *n* = 1000, there appears to be a small bias at the tail area, for example, at *t* = 65 (bias=0.0051, empirical SE=0.0015). However, this bias is within the variability range, which may be explained by the higher censoring rate within this range for subjects with CAG=42 (about 45%). When we increase the sample size to *n* = 2000, the bias decreased to almost zero.

True and REW CDF curves evaluated at CAG=50, 48, 46, 44, 42 (left to right). The true and mean estimated curves are indistinguishable for most cases. *n* = 1000 (top) and *n* = 2000 (bottom), 400 simulations.

In simulation setting B (table 3), we basically kept the same setting as in A, but increased the censoring rate to 45% and also increased the number of simulations to 2000. Due to the computation burden we didn’t conduct the bootstrap on each simulation repetition to get the mean estimated SEs and MSEs, as well as the coverage probabilities. Only the empirical SEs and MSEs are reported in table 3. The results are similar to those in tables 1 and and2,2, where the empirical SEs and MSEs of REW are consistently smaller than those of IPW.

Mean estimated CDFs by IPW and REW estimators, their empirical SEs, and empirical MSEs (all in 100 fold scale). *n* = 1000 or 2000, 2000 simulations, CAG=42 and 46.

In addition to the above estimators, we also investigated a smoothed REW estimator, where **(***t*) were smoothed across the range of *t* subject to monotone constraint using a Generalized Pooled-Adjacent-Violators Algorithm [21]. The mean estimated cumulative distribution functions and empirical standard errors are almost identical to those of the non-smoothed estimator. The maximum absolute difference in the mean of the two estimators averaged across simulations was very small. Therefore we omit the results of the smoothed estimator here.

As introduced in Section 1, despite identification of the causative gene for HD, there is currently no effective treatment that delays HD onset or stops disease progression. To improve the care of HD patients and inform the development of effective treatment, a large genetic epidemiological study on HD, the Cooperative Huntington’s Observational Research Trial (COHORT), was started in 1996. This is a study organized by 42 Huntington Study Group research centers in North America and Australia [3, 4]. Participants in COHORT underwent a clinical evaluation where blood samples are genotyped for huntingtin gene mutation and their CAG repeats lengths were obtained. Modeling the inverse association between the CAG repeats length and age-at-onset of HD accurately is important.

In this section, we fit the COHORT data by the model (1) where we do not assume a parametric form of *β*_{0}(*t*) or *β*_{1}(*t*) and the censoring distribution *G*(·). In our analysis, information on CAG repeats length, age at the time of evaluation, and age at diagnosis of HD onset (if a subject had been diagnosed) were available for 1151 subjects recruited in COHORT. In the study, both HD affected carriers and pre-symptomatic carriers (24%) were included. Their ages-at-first-motor-symptom were also recorded. Among 1151 subjects, 876 (76%) subjects had experienced HD motor sign onset and the average age of the diagnosis was 44 years of age. There were 280 (24%) participants who did not develop HD by the end of study and were treated as censored. All the participants were alive at the baseline in order to participate in the study, and none of them died without HD during the follow up years. Censoring was assumed to be independent of HD diagnosis.

To estimate the distribution of age-at-onset of HD given a subject’s CAG repeats length, we fit three estimators: IPW, REW, and the Kaplan-Meier (KM) estimator using only subjects with a particular CAG repeats length at a time. Figure 2 presents the estimated CDFs at various CAG values. The results show a positive correlation between the onset probability and the CAG repeats, that is, the cumulative risk of HD onset by a given age increases with increasing number of CAG repeats. Subjects with longer CAG repeats have a higher probability of developing HD by a certain age, which is consistent with the literature [6]. We summarize numerical results of estimated CDFs at a few CAGs and ages in table 4. As a comparison, we see that IPW and REW provides point estimates of CDFs similar to KM using only subjects with the same CAG values. However, the standard errors of REW at different ages and CAGs are smaller than both KM and IPW, suggesting an efficiency gain. For example, at CAG=42 and age 50, the standard error of the cumulative risk estimated by IPW is 18% larger than REW, and KM is 40% larger than REW. The post-hoc smoothing of **(***t*) leads to a CDF close to the non-smoothed CDF and therefore not reported here. We also modeled the survival function for the censoring times *G*(·) based on CAG repeats using a Cox model. The estimates are identical to those in table 4 up to the third decimal place and therefore not reported here.

Estimated CDF curves (KM, IPW and REW) on COHORT proband data (*n* = 1151) evaluated at CAG=50, 48, 46, 44, 42 (left to right).

We propose methods to estimate cumulative disease risk from a known mutation (i.e., also referred as the penetrance function in genetic epidemiology) from a nonparametric varying-coefficient model. For most complex diseases, predicting the age-at-onset of a disease from genetic markers such as single-nucleiotide polymorphisms continues to be a challenging issue [22]. The proposed method explores a pseudo-logistic regression and redistributes the probability mass at the censored outcomes to the right. The procedure has desirable numerical and asymptotic properties and is extremely easy to implement. Although we focused on assessing the effect of CAG repeats on HD onset, it is easy to include other covariates with time-invariant effect through a backfitting procedure for models such as

$$\text{logit}\{\text{pr}({T}_{i}\le t\mid {X}_{i})\}={\beta}_{0}(t)+{\beta}_{1}(t){X}_{i}+{\gamma}^{T}{Z}_{i}.$$

or model (2). The proposed methods have computational advantages compared to, for example, Peng and Huang [10]. In addition to the logistic link as discussed here, the developed methods can be adapted to transformation models with a known link function.

Satten and Datta [23] showed an equivalence between IPW-based and self-consistency-equation-based methods for Kaplan-Meier estimator for a pure nonparametric model. It is less clear whether such equivalence still holds for our model (1) which is equivalent to a proportional odds model with nonparametric time-varying coefficients. This may be worth future exploration. In some applications, investigators may be interested in testing the distribution function at more than one time point or building confidence bands. We proposed a procedure to test a distribution function in a sequence of pre-specified time points simultaneously in Ma and Wang [24], which can be adapted here. The construction of simultaneous confidence bands may rely on theoretical properties of supremes of Gaussian processes (e.g., Fine et al. [25]). However, such confidence bands may be conservative and the details are beyond the scope of this work.

Lastly, in practice it may not be easy to correctly specify a biologically meaningful parametric form for *c*_{0}(*X _{i}*) and

Wang and Ma’s research is supported by NIH grant NS073671-01, NS082062 and NSF grants DMS-1000354, DMS-1206693.

We show consistency by Lemma 5.2 in Newey [18]. We need to show uniform consistency of the initial IPW estimator, i.e., sup_{t}_{[}_{a,b}_{]} |**(***t*) − ** β**(

$$\widehat{\mathit{\beta}}(t)-\mathit{\beta}(t)=\frac{1}{n}\sum _{i=1}^{n}\psi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}+{o}_{p}({n}^{-1/2}),$$

(7)

where

$$\psi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}=\varphi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}-\sum _{i=1}^{n}\int \frac{[\varphi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}-\mathcal{B}(\varphi ,u)]{dM}_{i}^{c}(u)}{G(u)},$$

(*ϕ, u*) = *E*[*ϕ*{*X _{i}, T_{i}*,

Now we check the second term in *ψ*{*X _{i}*,

$$\begin{array}{l}\mathcal{B}(\varphi ,u)=E[\varphi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}\mid {X}_{i},{T}_{i}\ge u]\\ =\frac{E[\varphi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}I({T}_{i}\ge u)\mid {X}_{i}]}{E({T}_{i}\ge u\mid {X}_{i})}\\ =\frac{{\int}_{u}^{\infty}A\{{X}_{i};\mathit{\beta}(t)\}[I(s\le t)-\mu \{{X}_{i};\mathit{\beta}(t)\}]{Z}_{i}dF(s\mid {X}_{i})}{1-F(u\mid {X}_{i})}\\ =\frac{A\{{X}_{i};\mathit{\beta}(t)\}[F(t\mid {X}_{i})-F(u\mid {X}_{i})-\mu \{{X}_{i};\mathit{\beta}(t)\}\{1-F(u\mid {X}_{i})\}]{Z}_{i}}{1-F(u\mid {X}_{i})}.\end{array}$$

Therefore

$$\begin{array}{l}\sum _{i=1}^{n}\int \frac{[\varphi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}-\mathcal{B}(\varphi ,u)]{dM}_{i}^{c}(u)}{G(u)}\\ =\sum _{i=1}^{n}\frac{(1-{\delta}_{i})}{G({C}_{i})}\left\{\varphi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}-\frac{A\{{X}_{i};\mathit{\beta}(t)\}[F(t\mid {X}_{i})-F({C}_{i}\mid {X}_{i})-\mu \{{X}_{i};\mathit{\beta}(t)\}\{1-F({C}_{i}\mid {X}_{i})\}]{Z}_{i}}{1-F({C}_{i}\mid {X}_{i})}\right\}.\end{array}$$

(8)

Under condition A4, *G*(*C _{i}*) > 0. Under model (2) and conditions A1, A2, the above term indexed by

$$\underset{t\in [a,b]}{sup}\left|{n}^{-1}\sum _{i=1}^{n}\psi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}-E[\psi \{{X}_{i},{T}_{i};t,\mathit{\beta}(t)\}]\right|\to 0.$$

Since *E*[*ψ*{*X _{i}*,

$$\underset{t\in [a,b]}{sup}\mid \widehat{\mathit{\beta}}(t)-\mathit{\beta}(t)\mid =0.$$

Now we verify assumptions 5.4 and 5.5 in Newey [18]. In what follows, we use ** θ** and

$$\begin{array}{l}s({O}_{i};{t}_{0},\stackrel{\sim}{\mathit{\theta}},\stackrel{\sim}{\mathit{\beta}})-s({O}_{i},;{t}_{0},\stackrel{\sim}{\mathit{\theta}},\mathit{\beta})=I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0})\{w({O}_{i};{t}_{0},\stackrel{\sim}{\mathit{\beta}})-w({O}_{i};{t}_{0},\mathit{\beta})\}{Z}_{i}\\ =I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0}){Z}_{i}\phantom{\rule{0.16667em}{0ex}}\left[\frac{\mu \{{X}_{i};\stackrel{\sim}{\mathit{\beta}}({t}_{0})\}-\mu \{{X}_{i};\stackrel{\sim}{\mathit{\beta}}({C}_{i})\}}{1-\mu \{{X}_{i};\stackrel{\sim}{\mathit{\beta}}({C}_{i})\}}-\frac{\mu \{{X}_{i};\mathit{\beta}({t}_{0})\}-\mu \{{X}_{i};\mathit{\beta}({C}_{i})\}}{1-\mu \{{X}_{i};\mathit{\beta}({C}_{i})\}}\right]\\ =I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0}){Z}_{i}{Z}_{i}^{T}(\frac{\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}[1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}]}{1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}}\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}-\frac{\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}[1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}]}{1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}}\{\stackrel{\sim}{\mathit{\beta}}({C}_{i})-\mathit{\beta}({C}_{i})\}),\end{array}$$

(9)

where ** β̌**(

$$\Vert s({O}_{i};{t}_{0},\mathit{\theta},\stackrel{\sim}{\mathit{\beta}})-s({O}_{i},;{t}_{0},\mathit{\theta},\mathit{\beta})\Vert \phantom{\rule{0.16667em}{0ex}}\le b({O}_{i})\Vert \stackrel{\sim}{\mathit{\beta}}-\mathit{\beta}\Vert .$$

By condition A5, the assumption 5.5 in Newey (1994) is satisfied. Finally, by Lemma 5.2 of Newey [18], we have ** = **** θ** +

We show the asymptotic normality of ** by Lemma 5.3 of Newey [18]. For assumption 5.1(i), note again**

$$s({O}_{i};{t}_{0},\mathit{\theta},\stackrel{\sim}{\mathit{\beta}})-s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})=I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0})\{w({O}_{i};{t}_{0},\stackrel{\sim}{\mathit{\beta}})-w({O}_{i};{t}_{0},\mathit{\beta})\}{Z}_{i}.$$

We now compute a pathwise derivative of *w*(*O _{i}*;

$$\begin{array}{l}{lim}_{\epsilon \to 0}\frac{1}{\epsilon}\left\{\frac{F({t}_{0}\mid {X}_{i};{\mathit{\beta}}_{\epsilon})-F({C}_{i}\mid {X}_{i};{\mathit{\beta}}_{\epsilon})}{1-F({C}_{i}\mid {X}_{i};{\mathit{\beta}}_{\epsilon})}-\frac{F({t}_{0}\mid {X}_{i})-F({C}_{i}\mid {X}_{i})}{1-F({C}_{i}\mid {X}_{i})}\right\}\\ =\frac{F({t}_{0}\mid {X}_{i})\{1-F({t}_{0}\mid {X}_{i})\}{Z}_{i}^{T}}{1-F({C}_{i}\mid {X}_{i})}\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}-\frac{F({C}_{i}\mid {X}_{i})\{1-F({t}_{0}\mid {X}_{i})\}{Z}_{i}^{T}}{1-F({C}_{i}\mid {X}_{i})}\{\stackrel{\sim}{\mathit{\beta}}({C}_{i})-\mathit{\beta}({C}_{i})\}.\end{array}$$

(10)

Let

$$D({O}_{i};\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})=I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0}){Z}_{i}{Z}_{i}^{T}\phantom{\rule{0.16667em}{0ex}}[\frac{F({t}_{0}\mid {X}_{i})\{1-F({t}_{0}\mid {X}_{i})\}}{1-F({C}_{i}\mid {X}_{i})}\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}-\frac{F({C}_{i}\mid {X}_{i})\{1-F({t}_{0}\mid {X}_{i})\}}{1-F({C}_{i}\mid {X}_{i})}\{\stackrel{\sim}{\mathit{\beta}}({C}_{i})-\mathit{\beta}({C}_{i})\}].$$

(11)

From (9), we can verify

$$\begin{array}{l}s({O}_{i};{t}_{0},\mathit{\theta},\stackrel{\sim}{\mathit{\beta}})-s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})-D({O}_{i};\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})\\ =I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0}){Z}_{i}{Z}_{i}^{T}(\frac{\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}[1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}]}{1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}}\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}\\ -\frac{\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}[1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}]}{1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}}\{\stackrel{\sim}{\mathit{\beta}}({C}_{i})-\mathit{\beta}({C}_{i})\})-D({O}_{i};\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})\\ =I({T}_{i}>{C}_{i})I({C}_{i}<{t}_{0}){Z}_{i}{Z}_{i}^{T}\\ \times (\left[\frac{\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}[1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}]}{1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}}-\frac{F({t}_{0}\mid {X}_{i})\{1-F({t}_{0}\mid {X}_{i})\}}{1-F({C}_{i}\mid {X}_{i})}\right]\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}\\ -\left[\frac{\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}[1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({t}_{0})\}]}{1-\mu \{{X}_{i};\stackrel{\u02c7}{\mathit{\beta}}({C}_{i})\}}-\frac{F({C}_{i}\mid {X}_{i})\{1-F({t}_{0}\mid {X}_{i})\}}{1-F({C}_{i}\mid {X}_{i})}\{\stackrel{\sim}{\mathit{\beta}}({C}_{i})-\mathit{\beta}({C}_{i})\}\right]),\end{array}$$

where again ** β̌**(

$$\Vert s({O}_{i};{t}_{0},\mathit{\theta},\stackrel{\sim}{\mathit{\beta}})-s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})-D({O}_{i};\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})\Vert \phantom{\rule{0.16667em}{0ex}}\le b({O}_{i}){\Vert \stackrel{\sim}{\mathit{\beta}}-\mathit{\beta}\Vert}^{2}.$$

For (ii) in assumption 5.1, we need to show that the convergence rate of the IPW estimator ** is at least ***n*^{1/4}. Let
denote all cadlag functions uniformly bounded on [*a*, *b*]. By adapting the proof in the previous item, we know that {*ψ*{*X _{i}*,

We now prove assumption 5.2 (stochastic equicontinuity). Note

$$\begin{array}{l}\int D(o;\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})\mathit{dGdH}\\ ={\int}_{0}^{{t}_{0}}g(u)\int h(x)\phantom{\rule{0.16667em}{0ex}}[\frac{F({t}_{0}\mid x)\{1-F({t}_{0}\mid x)\}{zz}^{T}\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}}{1-F(u\mid x)}\{1-F(u\mid x)\}\\ -\frac{F(u\mid x)\{1-F({t}_{0}\mid x)\}{zz}^{T}\{\stackrel{\sim}{\mathit{\beta}}(u)-\mathit{\beta}(u)\}}{1-F(u\mid x)}\{1-F(u\mid x)\}]\phantom{\rule{0.16667em}{0ex}}\mathit{dxdu}\\ ={\int}_{0}^{{t}_{0}}g(u)\int h(x){zz}^{T}[F({t}_{0}\mid x)\{1-F({t}_{0}\mid x)\}\{\stackrel{\sim}{\mathit{\beta}}({t}_{0})-\mathit{\beta}({t}_{0})\}\\ -F(u\mid x)\{1-F({t}_{0}\mid x)\}\{\stackrel{\sim}{\mathit{\beta}}(u)-\mathit{\beta}(u)\}]\phantom{\rule{0.16667em}{0ex}}\mathit{dxdu}.\end{array}$$

A sufficient condition for stochastic equicontinuity is provided in Chen et al. [19], Remark 2. To be specific, we need to show for *δ _{n}* =

$$\underset{\Vert \stackrel{\sim}{\mathit{\beta}}-\mathit{\beta}\Vert \le {\delta}_{n}}{sup}\Vert \frac{1}{n}\sum _{i=1}^{n}D({O}_{i},\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})-\int D(o,\stackrel{\sim}{\mathit{\beta}}-\mathit{\beta})\mathit{dGdH}\Vert ={o}_{p}({n}^{-1/2}).$$

This can be proved by showing the process {*D*(*O _{i}*,

A sufficient condition for assumption 5.3 in Newey (1994) is

$$\sqrt{n}\int D(o;\widehat{\mathit{\beta}}-\mathit{\beta})\mathit{dGdH}-\sum _{i=1}^{n}\alpha ({O}_{i})/\sqrt{n}\to 0,$$

for some *α*(·) (p.1366, 18). Using the expansion (7) for **(***t*), we obtain

$$\int D(o;\widehat{\mathit{\beta}}-\mathit{\beta})\mathit{dGdH}=\frac{1}{n}\sum _{i=1}^{n}\xi ({T}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})+{o}_{p}({n}^{-1/2}),$$

where

$$\xi ({T}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})={\int}_{0}^{{t}_{0}}g(u)\int h(x){zz}^{T}[F({t}_{0}\mid x)\{1-F({t}_{0}\mid x)\}\psi (x,{T}_{i};{t}_{0},\mathit{\theta})-F(u\mid x)\{1-F({t}_{0}\mid x)\}\psi \{x,{T}_{i};u,\mathit{\beta}(u)\}]\phantom{\rule{0.16667em}{0ex}}\mathit{dxdu}.$$

Therefore assumption 5.3 holds.

For assumption 5.6, it is straightforward that (i) and (ii) are satisfied. We have

$$A=E\left\{\frac{\partial s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})}{\partial \mathit{\theta}}\right\}=E[\mu ({X}_{i};\mathit{\theta})\{1-\mu ({X}_{i};\mathit{\theta})\}{Z}_{i}{Z}_{i}^{T}],$$

which is nonsingular under the assumption A5. It is easy to see that (iv) holds. For (v), since
$\frac{\partial s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})}{\partial \mathit{\theta}}$is continuous in ** θ**, assumption 5.4 (i) holds for
$\frac{\partial s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})}{\partial \mathit{\theta}}$. The assumption 5.4 (ii) holds for
$\frac{\partial s({O}_{i};{t}_{0},\mathit{\theta},\mathit{\beta})}{\partial \mathit{\theta}}$since it does not depend on

By Lemma 5.3 of Newey [18], we obtain

$$\sqrt{n}(\widehat{\mathit{\theta}}-\mathit{\theta})\to N(0,{A}^{-1}V{A}^{-1}).$$

Tianle Chen, Biogen Idec, Cambridge, MA.

Yanyuan Ma, Department of Statistics, Texas A&M University.

Yuanjia Wang, Department of Biostatistics, Mailman School of Public Health, Columbia University.

1. Huntington’s Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntingtons disease chromosomes. Cell. 1993;72:971–983. [PubMed]

2. Lee JM, Ramos EM, Lee JH, Gillis T, Mysore JS, Hayden MR, Warby SC, Morrison P, Nance M, Ross CA, et al. Cag repeat expansion in huntington disease determines age at onset in a fully dominant fashion. Neurology. 2012;78(10):690–695. [PMC free article] [PubMed]

3. Kieburtz K. Huntington Study Group. The unified huntington’s disease rating scale: reliability and consistency. Movement Disorders. 1996;11:136–142. [PubMed]

4. Dorsey ER, Beck C, Adams M. Huntington Study Group. Trend-hd communicating clinical trial results to research participants. Archives of Neurology. 2008;65(12):1590– 1595. [PubMed]

5. Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34(2):187–220.

6. Langbehn DR, Brinkman RR, Falush D, Paulsen JS, Hayden MR. A new model for prediction of the age of onset and penetrance for huntington’s disease based on cag length. Clinical Genetics. 2004;65:267–277. [PubMed]

7. Chen T, Wang Y, Ma Y, Marder K, Langbehn DR. Predicting disease onset from mutation status using proband and family data with applications to huntington’s disease. Journal of Probability and Statistics. 2012:Article ID 375935. doi: 10.1155/2012/375935. [PMC free article] [PubMed] [Cross Ref]

8. Langbehn DR, Hayden MR, Paulsen JS. the PREDICT-HD Investigators of the Huntington Study Group. Cag-repeat length and the age of onset in huntington disease (hd): A review and validation study of statistical approaches. American Journal of Medical Genetics Part B. 2009;153B:397–408. [PMC free article] [PubMed]

9. Jung SH. Regression analysis for long-term survival rate. Biometrika. 1996;83:227–232.

10. Peng L, Huang Y. Survival analysis with temporal covariate effects. Biometrika. 2007;94 :719–733.

11. Chen YQ, Hu N, Cheng SC, Musoke P, Zhao LP. Estimating regression parameters in an extended proportional odds model. Journal of the American Statistical Association. 2012;107:318–330. [PMC free article] [PubMed]

12. Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data. Journal of the Royal Statistical Society, Series B. 2007;69:1–30.

13. Efron B. The two sample problem with censored data. Fifth Berkeley Symposium on Mathematical Statistics and Probability; 1967; p. 4.

14. Portnoy S. Censored regression quantiles. Journal of the American Statistical Association. 2003;98(464):1001–1012.

15. Wang HJ, Wang L. Locally weighted censored quantile regression. Journal of the American Statistical Association. 2009;104(487):1117–1128.

16. Ma Y, Wei Y. Analysis on censored quantile residual life model via spline smoothing. Statistica Sinica. 2012;22:47–68. [PMC free article] [PubMed]

17. Bang H, Tsiatis AA. Estimating medical costs with censored data. Biometrika. 2000;87 (2):329–343.

18. Newey WK. The asymptotic variance of semiparametric estimators. Econometrica. 1994;62 :1349–1382.

19. Chen X, Linton O, Keilegom IV. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71(5):1591–1608.

20. Khan S, Tamer E. Irregular identification, support conditions, and inverse weight estimation. Econometrica. 2010;78(6):2021–2042.

21. de Leeuw J, Hornik K, Mair P. Isotone optimization in r: Pool-adjacent-violators algorithm (pava) and active set methods. Journal of Statistical Software. 2009;32:1–24.

22. Kang J, Chobc J, Zhao H. Practical issues in building risk-predicting models for complex diseases. Journal of Biopharmaceutical Statistics. 2010;20(2):415–440. [PMC free article] [PubMed]

23. Satten GA, Datta S. The kaplancmeier estimator as an inverse-probability-of-censoring weighted average. The American Statistician. 2001;55(3):207–210.

24. Ma Y, Wang Y. Estimating disease distribution functions from censored mixture data. Journal of the Royal Statistical Society, Series C. 2014;63:1–23.

25. Fine J, Yan J, Kosorok MR. Temporal process regression. Biometrika. 2004;91:683–703.

26. Wang Y, Garcia TP, Ma Y. Nonparametric estimation for censored mixture data with application to the cooperative huntingtons observational research trial. Journal of the American Statistical Association. 2012;107(500):1324–1338. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |