PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Stat Med. Author manuscript; available in PMC 2016 July 20.
Published in final edited form as:
Stat Med. 2015 July 20; 34(16): 2427–2443.
Published online 2015 April 6. doi:  10.1002/sim.6499
PMCID: PMC4457675
NIHMSID: NIHMS674633

Predicting Cumulative Risk of Disease Onset by Re-distributing Weights

Abstract

We propose a simple approach predicting the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We use a nonparametric function for the coefficient of the time-varying effect and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing Huntington’s disease (HD) from subjects with huntingtin gene mutation using a large collaborative HD study data and illustrate an inverse relationship between the cumulative risk of HD and the length of cytosine-adenine-guanine (CAG) repeats in the huntingtin gene.

Keywords: Proportional odds model, Self-consistency equation, Varying-coefficient model, Huntington’s disease

1 Introduction

In many biomedical studies, the research goal is to predict the age-specific cumulative risk of onset of a disease from a set of covariates. For example, Huntington’s disease (HD) is a progressive neurodegenerative disorder caused by the expansion of cytosine-adenine-guanine (CAG) trinucleotide repeats in the huntingtin gene [1]. The genetic model for HD is dominant [2], and there is an inverse relationship between the age-at-onset of HD and the CAG repeats length: the greater the CAG expansion, the earlier the age-at-onset of the disease. Accurate prediction of a subject’s age-at-onset of HD from CAG repeats and other covariates is useful to assess an individual’s risk of developing HD based on available genetic mutation testing results when providing genetic counseling. Such estimates are also useful when designing a clinical trial. For instance, estimating the age-at-onset distribution of HD from a subject’s CAG repeats and other baseline information can be used to recruit patients who are close to the onset of disease to improve efficiency of a therapeutic trial. Improving existing estimation of HD risk is one of the research goals in the Cooperative Huntington’s Observational Research Trial (COHORT) study which includes 42 sites [3, 4].

Age-at-onset of disease information is usually subject to right censoring due to termination of study, patient loss to follow up, or death of a subject. Popular regression models for censored outcomes include Cox-proportional hazards model [5], which relates hazard rate at a particular age to covariates. Although it is possible to obtain cumulative risk function in this model, the interpretation of the regression coefficients of the covariates is reflected through the hazard function. Instead of working through a hazard function, a more appealing approach is to model the cumulative disease risk directly since predicting disease onset is our primary goal and hazard function is not of interest. Another motivation to avoid proportional hazards model and to work with cumulative distribution function directly is that the proportional hazards assumption may not be satisfied in certain applications. For example, Langbehn et al. [6] reported that the proportional hazards assumption does not hold with HD data, and proposed a parametric model for the cumulative risk function involving six parameters through a logistic transformation of pr(Tit|Xi), where Ti denotes age-at-onset of HD, and Xi denotes the covariate of interest (i.e., CAG repeats length). Specifically, their model is

logit{pr(TitXi)}={t-μ(Xi;α)}/s(Xi;γ),

where μ(Xi, α) and s(Xi, γ) are exponential functions with parameters α and γ. In Langbehn et al. [6] these functions are μ(x, α) = α1+exp(α2α3x) and s(x;γ)=γ1+exp(γ2-γ3x). The correlation between the estimated parameters may be high, causing numerical difficulties in practice. We fitted Langbehn’s model on COHORT data and the results were reported in an earlier paper [7]. Other ad-hoc parametric models are also proposed for HD data, with no consensus on the best model to use [8].

In this work, we consider a varying-coefficient proportional odds model

logit{pr(TitXi)}=β0(t)+β1(t)Xi.
(1)

To provide flexibility and protect against misspecification, β0(t) and β1(t) are left as unknown nonparametric functions. The interpretation of β1(t) is then directly related to the cumulative risk of disease, since exp{β1(t)} is the odds ratio of experiencing disease onset by age t for subjects with one unit difference in X. Since pr(Tit|Xi) is a cumulative distribution function, β0(t) and β0(t) + β1(t)Xi are constrained to be non-decreasing functions of t. In applications where Xi’s are positive, we require β1(t) to be non-decreasing as well. When β0(t) and β1(t) take some parametric form of t, model (1) reduces to a standard proportional odds model.

An extension to model (1) is a nonparametric varying-coefficient model of the cumulative risk using a logistic link

logit{pr(TitXi)}=β0(t)+c0(Xi)+β1(t)c1(Xi),
(2)

where c0(x) and c1(x) are known parametric functions of covariates. Note that when c1(x) = 1/s(x, γ), c0(x) = −μ(x, α)/s(x, γ), β0(t) = 0, and β1(t) = t, model (2) reduces to that in Langbehn et al. [6].

In the literature, Jung [9] directly modeled survival function using regression model at a fixed time point without considering temporal effect. There are a number of other works on extending proportional hazards or proportional odds model to account for temporal covariate effect or time-varying covariates. Peng and Huang [10] proposed an alternative extension of Cox proportional hazards model to account for a nonparametric temporal effect of a covariate. The procedure involves solving a series of estimating equations sequentially. In contrast, our method is proposed for a proportional odds model with a nonparametric time-varying effect. Chen et al. [11] proposed methods to extend transformation models considered in for example, Zeng and Lin [12], to account for external time-varying covariates.

Here, we take a completely different approach that does not involve counting process and with straightforward and simple computational algorithm. When there is no censoring, to estimate the cumulative risk function at a time point t0 given a covariate, e.g., pr(Tit0|Xi), a straightforward analysis is to fit a logistic regression of I(Tit0) on the covariates Xi. When the outcome is subject to censoring, I(Tit) may not be observed for some of the censored subjects. Let Ci denote the censoring time, Efron [13] proposed a nonparametric estimator of a survival function by re-distributing the conditional masses for the censored subjects, pr(Ti > Ci|Ci), equally to all the non-censored observations above Ci, where the common weight for these subjects depends on the number of at-risk subjects at Ci. Portnoy [14] and Wang and Wang [15] used similar ideas to fit a quantile regression with covariates Xi, where the conditional point masses pr(Ti > Ci|Ci, Xi) for censored subjects are re-distributed to the right. For quantile regression, the estimator only depends on the signs of residuals and thus the point masses for censored subjects are re-distributed to +∞. Since there are covariates involved, the conditional masses to be estimated depend on the covariates and the unknown distribution function.

In this work, to estimate β(t0) from (1) or (2), we fit a pseudo-logistic regression of I(Tit0) through redistributing weights to the right to account for censoring. We apply the procedure to estimate the coefficient function at distinct uncensored event times, and smooth the coefficient functions across the entire support of event times when necessary. This type of smoothing was found to be equivalent to applying local kernel smoothing directly [16]. The proposed computational procedure is extremely easy to implement and can be handled by standard softwares. We investigate the asymptotic properties of the proposed estimator to show consistency and normality, and conduct simulation studies to examine its finite sample performance. The proposed methods are applied to estimating the cumulative risk of developing HD from subjects with huntingtin gene mutation using the COHORT data and illustrate an positive relationship between the cumulative risk of HD and the length of CAG repeats in the huntingtin gene. We compare the estimates under model (1) with fully nonparametric Kaplan-Meier estimates using subjects with the same CAG values and reveal consistent results.

2 Methods

For the purpose of illustration, we mainly focus on the varying coefficient model (1). Extension to the more general model (2) is discussed in Section 4.

2.1 Uncensored data

First we investigate estimation at a fixed time point t0 when the outcome is not subject to censoring. Let β(t) = {β0(t), β1(t)}T, let βt0 = β(t0) denote β(·) evaluated at t0, and let Zi = (1, Xi)T. When there is no censoring, the likelihood for {I(Tit0), Xi, i = 1, ···, n} under a logistic link takes the standard form, iexp{I(Tit0)ZiTβt0}1+exp{ZiTβt0}. To estimate βt0, we solve the estimating equation

i=1nm(Xi,Ti;t0,βt0)=0,

where m(Xi, Ti, t0, βt0) = {I(Tit0) − μ(Xi, βt0)} Zi, and μ{Xi;βt0}=exp{ZiTβt0}1+exp{ZiTβt0}. The influence function for the estimate [beta]t0 is

ϕ(Xi,Ti;t0,βt0)=A(Xi;βt0){I(Tit0)-μ(Xi;βt0)}Zi,

where A(Xi;βt0)=(E[μ(Xi;βt0){1-μ(Xi;βt0)}ZiZiT])-1. We fit a logistic regression of I(Tit0) on Xi and repeat this process while varying t0 at all distinct values of observed Ti’s. One can then smooth the estimates as a function of t0 [16] subject to the monotonicity constraint. An alternative is to fit a nonparametric regression (for example using splines) treating I(Tit) as generalized outcomes. This method was shown to have similar performance as the post-hoc smoothing above [16], but is more difficult to implement under the monotonicity constraint, therefore we do not further explore here.

2.2 Censored data

When a subject is right censored (i.e., Ti > Ci) and Cit0, we still observe I(Tit0) = 0. Ambiguity occurs when a subject is censored and Ci < t0. One type of estimator for β(·) can be obtained by the inverse probability of censoring weighting (IPW) proposed in Bang and Tsiatis [17], which weights subjects having an event by the inverse of their probabilities of not being censored. To be specific, we can obtain the IPW estimator by solving the estimating equation

Sn(β)=n-1i=1nI(TiCi)m(Xi,Ti;t0,β)G(Ti)=0,

where G(·) is the survival function for the censoring times Ci. Estimating G(·) by the Kaplan-Meier of the censoring process, the estimating equation for β(·) is

i=1nI(TiCi)m(Xi,Ti;t0,β)G^(Ti)=0.
(3)

This process is repeated for t0 on a grid (u1, ···, uM). Alternatively, one can let the grid points include only uncensored observations, which is equivalent to creating the grid.

Here we propose a new type of estimator that re-distributes weights to the right for ambiguous subjects based on self-consistency equations similar to Efron [13] and Wang and Wang [15]. Let Oi = {Xi, TiCi, Δi [equivalent] I(TiCi)} denote the ith observation. We solve the following weighted estimating equation

Sn(βt0,β)=n-1i=1ns(Oi;t0,βt0,β)=0,
(4)

where s(Oi;t0,βt0,β)=w{Oi;t0,β(·)}m(Xi,TiCi;t0,βt0)+[1-w{Oi;t0,β(·)}]m(Xi+;t0,βt0), and

w{Oi;t0,,β(·)}={1,Δi=1or(Δi=0andCit0)F(t0Xi)-F(CiXi)1-F(CiXi),Δi=0andCi<t0.
(5)

Here F(t|x) = μ{x, β(t)} is the conditional distribution of Ti given Xi introduced in model (1), and the weight for the ith subject depends on β(·) evaluated at t0 and Ci.

To gain insights on the weights, note that subjects with observed I(Tit0) will receive a weight of one for their contributions to the estimating equation. Subjects with missing I(Tit0) have conditional probability masses

E{I(Tit0)Ti>Ci,Xi}=F(t0Xi)-F(CiXi)1-F(CiXi).

Treating (Xi, Ci) as pseudo-observations for censored subjects with censoring time less than t0, they receive weights w{Oi, t0, β(·)} = pr(Tit0|Ti > Ci, Xi). We re-distribute their complementary weights 1 − w{Oi, t0, β(·)} = pr(Ti > t0|Ti > Ci, Ci, Xi) to the right. Since the outcomes are binary variables, the complementary masses 1 − w{Oi, t0, β(·)} for pseudo-observations (Xi, Ci) can be re-distributed to any point that is greater than all observations that is not specific to any observation above Ci (also see Portnoy [14] and Wang and Wang [15]). Thus, any point above Ci contributes the same information to the estimating equation. Without loss of generality, we re-distribute the complementary mass to +∞, and the contribution from these observations to the estimating equation is m(Xi,+;t0,βt0)=-μ(Xi;βt0)Zi

In practice, the weights w{Oi, t0, β(·)} in (5) depend on unknown distribution function F(·|X) which needs to be estimated. We substitute β(·) with the IPW estimators, denoted as beta(·), to obtain the weights w{Oi, t0, beta(·)} to be redistributed. The REW estimator [beta]t0 then solves the weighted estimating equation

Sn(O;t0,βt0,β(·))=n-1i=1ns(Oi;t0,βt0,β(·))=0.
(6)

It is extremely easy to implement this weighting scheme. Without loss of generality, assume the first n0 subjects have unobserved outcomes I(Tit0). Create pseudo-observations Õ1 = (X1,+∞,Δ1), ···, Õn0 = (Xn0, +∞, Δn0). Append all pseudo-observations to the original observations to obtain observations (O1, ···, On, Õ1, ···, Õn0) with weights

[w{O1;t0,β(·)},,w{On;t0,β(·)},1-w{O1;t0,β(·)},,1-w{On0;t0,β(·)}].

Then [beta]t0 is estimated by a weighted logistic regression. The weights w{O, t0, beta(·)} extract information at multiple time points simultaneously, and thus pool information across time points to estimate the distribution function at t0.

2.3 Asymptotic properties

To show consistency and asymptotic normality of [beta](t) at fixed t obtained from (6), we will need the following technical conditions:

  • A1
    Assume that β(t) is right continuous with left-hand limits (cadlag) componentwise.
  • A2
    Assume that for t [set membership] [a, b] with b < ∞ to be finite, and there exists subjects with P(min(Ci, Ti) > b) > 0. Also assume β(t) is uniformly bounded on [a, b] componentwise, that is, supt[set membership][a,b] |β(t)| ≤ c < ∞ componentwise.
  • A3
    Assume that the covariates Xi are not degenerate, i.e., pr(Xi = x0) ≠ 1 and are bounded in probability, i.e., pr(|Xi| < c) = 1.
  • A4
    Assume that the censoring times are bounded, i.e., pr(Ci < c) = 1.
  • A5
    Assume that E(ZiZiTexp{ZiTβ(t)}/[1+exp{ZiTβ(t)}]2) is positive definite.

The conditions A1–A2 control the size of the parameter space. The condition A2 states that one can only estimate distribution function in the time range where there are still subjects with positive probability of being at risk. The conditions A3–A4 exclude some degenerate cases. The condition A5 ensures a unique solution to the estimating equation. For the simplicity of notation we let θ = β(t0) denote β(·) evaluated at t0 in this subsection. The following theorem establishes the consistency of the estimator [theta w/ hat].

Theorem 1

Assume that {Oi, i = 1, ···, n} are i.i.d. random samples, and Ti and Ci are independent given Xi. Then under model (2) and assumptions A1–A5, [theta w/ hat]θ in probability as n → ∞ for any t0 [set membership] (a, b).

The proof of this theorem uses the semiparametric asymptotic results developed in Newey [18] and Chen et al. [19].

Since the final estimator involves estimates [beta](·) in the entire range of Ti, uniform consistency of the initial estimator is required. The next theorem establishes the asymptotic normality of [theta w/ hat].

Theorem 2

Under the assumptions of Theorem 1, as n → ∞,

n(θ^-θ)N(0,A-1VA-1)

in distribution, where A=E[μ(Xi;θ){1-μ(Xi;θ)}ZiZiT], V = cov{s(Oi, t0, θ, β)+ ξ(Ti, t0, θ, β)},

ξ(Ti;t0,θ,β)=0t0g(u)h(x)zzT[F(t0x){1-F(t0x)}ψ(x,Ti;t0,θ)-F(ux){1-F(t0x)}ψ{x,Ti;u,β(u)}]dxdu,

g(u) is the density function for Ci, h(x) is the density function for Xi, z = (1, x)T, and ψ{x, Ti, u, β(u)} is defined in the appendix.

The proof of this theorem is in the appendix and it also uses the results in Newey [18].

3 Numeric results

In this section, we provide Monte Carlo results on simulation experiments and application of the method to a real world study.

3.1 Simulation studies

To study the finite sample performance of the proposed estimator, we ran two sets of simulation studies. In each set, the true survival times were generated from the model (1) with β0(t) = β00 + β01log(t), β1(t) = β10 + β11log(t) and (β00, β01, β10, β11)T = (−80, 21.5,−1.4, 0.7)T. The parameters were designed such that the cumulative risk functions resembles the fit from COHORT data in Section 3.2. In simulation setting A (Tables 1 and and2),2), we simulated Xi from a multinomial distribution with support on integer values between 41 and 50 representing CAG repeats. CAG 41–45 were simulated with an equal probability, which is two times of the equal probability that CAG 46–50 were simulated from. The censoring times were generated from a Beta distribution where the overall censoring rate is about 35%. We simulated two samples sizes n = 1000 and n = 2000 since the real data has a sample size of 1151.

Table 1
Mean estimated CDFs by IPW and REW estimators, their mean estimated SEs, empirical SEs, empirical MSEs, and 95% coverages (all in 100 fold scale). n = 1000, 400 simulations, CAG=42 and 46.
Table 2
Mean estimated CDFs by IPW and REW estimators, their mean estimated SEs, empirical SEs, empirical MSEs, and 95% coverages (all in 100 fold scale). n = 2000, 400 simulations, CAG=42 and 46.

We compared two types of estimators. The first is the initial inverse probability weighted estimator (IPW) beta(t) from (3) and the second is the proposed redistribution to the right weighted estimator (REW) [beta](t) from (6). Since the theoretical variance estimator involves integrations and unknown quantities which are difficult to compute, we used bootstrap to obtain the mean estimated standard errors of the two estimators in each simulation repetition. The empirical standard errors based on the total 400 repetitions and the empirical MSEs were also summarized in tables 1 and and2,2, where we presented the estimated distribution functions obtained from the two estimators at various ages and CAG values (42 and 46). It can be seen that both IPW and REW estimators have small finite sample biases. The mean estimated standard errors and empirical standard errors are close to each other over most age range with some exceptions at the extreme tail area. The empirical standard error of REW is smaller than that of IPW, especially at older ages. For example, the efficiency gain of REW over IPW is 10% at age 50 for CAG=42 and n = 1000. The coverage of the 95% confidence interval is close to the nominal level when age is below 60 for both IPW and REW. At age 60, since censoring is heavier, the coverages of both IPW and REW are lower than the nominal level, with the performance of REW slightly better between the two. The mean estimated standard error of IPW estimator at age 55 and CAG=42 in table 1 is very large because the IPW estimator has unstable weights at that age range due to a turning point in the distribution function. It is a limitation of the IPW estimator and the increased variance when the weights are unstable was reported in the literature [20]. This issue is relieved when we increased the sample size as seen in table 2. In addition, the proposed estimator does not suffer from the high variability of the IPW initial estimator.

We presented the true and the mean estimated cumulative distribution functions (CDFs) obtained from the REW estimator and their empirical 95% CI at various CAGs in figure 1. The estimated curves coincide with the true curves in most cases. When CAG=42 and n = 1000, there appears to be a small bias at the tail area, for example, at t = 65 (bias=0.0051, empirical SE=0.0015). However, this bias is within the variability range, which may be explained by the higher censoring rate within this range for subjects with CAG=42 (about 45%). When we increase the sample size to n = 2000, the bias decreased to almost zero.

Figure 1
True and REW CDF curves evaluated at CAG=50, 48, 46, 44, 42 (left to right). The true and mean estimated curves are indistinguishable for most cases. n = 1000 (top) and n = 2000 (bottom), 400 simulations.

In simulation setting B (table 3), we basically kept the same setting as in A, but increased the censoring rate to 45% and also increased the number of simulations to 2000. Due to the computation burden we didn’t conduct the bootstrap on each simulation repetition to get the mean estimated SEs and MSEs, as well as the coverage probabilities. Only the empirical SEs and MSEs are reported in table 3. The results are similar to those in tables 1 and and2,2, where the empirical SEs and MSEs of REW are consistently smaller than those of IPW.

Table 3
Mean estimated CDFs by IPW and REW estimators, their empirical SEs, and empirical MSEs (all in 100 fold scale). n = 1000 or 2000, 2000 simulations, CAG=42 and 46.

In addition to the above estimators, we also investigated a smoothed REW estimator, where [beta](t) were smoothed across the range of t subject to monotone constraint using a Generalized Pooled-Adjacent-Violators Algorithm [21]. The mean estimated cumulative distribution functions and empirical standard errors are almost identical to those of the non-smoothed estimator. The maximum absolute difference in the mean of the two estimators averaged across simulations was very small. Therefore we omit the results of the smoothed estimator here.

3.2 Application to COHORT data

As introduced in Section 1, despite identification of the causative gene for HD, there is currently no effective treatment that delays HD onset or stops disease progression. To improve the care of HD patients and inform the development of effective treatment, a large genetic epidemiological study on HD, the Cooperative Huntington’s Observational Research Trial (COHORT), was started in 1996. This is a study organized by 42 Huntington Study Group research centers in North America and Australia [3, 4]. Participants in COHORT underwent a clinical evaluation where blood samples are genotyped for huntingtin gene mutation and their CAG repeats lengths were obtained. Modeling the inverse association between the CAG repeats length and age-at-onset of HD accurately is important.

In this section, we fit the COHORT data by the model (1) where we do not assume a parametric form of β0(t) or β1(t) and the censoring distribution G(·). In our analysis, information on CAG repeats length, age at the time of evaluation, and age at diagnosis of HD onset (if a subject had been diagnosed) were available for 1151 subjects recruited in COHORT. In the study, both HD affected carriers and pre-symptomatic carriers (24%) were included. Their ages-at-first-motor-symptom were also recorded. Among 1151 subjects, 876 (76%) subjects had experienced HD motor sign onset and the average age of the diagnosis was 44 years of age. There were 280 (24%) participants who did not develop HD by the end of study and were treated as censored. All the participants were alive at the baseline in order to participate in the study, and none of them died without HD during the follow up years. Censoring was assumed to be independent of HD diagnosis.

To estimate the distribution of age-at-onset of HD given a subject’s CAG repeats length, we fit three estimators: IPW, REW, and the Kaplan-Meier (KM) estimator using only subjects with a particular CAG repeats length at a time. Figure 2 presents the estimated CDFs at various CAG values. The results show a positive correlation between the onset probability and the CAG repeats, that is, the cumulative risk of HD onset by a given age increases with increasing number of CAG repeats. Subjects with longer CAG repeats have a higher probability of developing HD by a certain age, which is consistent with the literature [6]. We summarize numerical results of estimated CDFs at a few CAGs and ages in table 4. As a comparison, we see that IPW and REW provides point estimates of CDFs similar to KM using only subjects with the same CAG values. However, the standard errors of REW at different ages and CAGs are smaller than both KM and IPW, suggesting an efficiency gain. For example, at CAG=42 and age 50, the standard error of the cumulative risk estimated by IPW is 18% larger than REW, and KM is 40% larger than REW. The post-hoc smoothing of [beta](t) leads to a CDF close to the non-smoothed CDF and therefore not reported here. We also modeled the survival function for the censoring times G(·) based on CAG repeats using a Cox model. The estimates are identical to those in table 4 up to the third decimal place and therefore not reported here.

Figure 2
Estimated CDF curves (KM, IPW and REW) on COHORT proband data (n = 1151) evaluated at CAG=50, 48, 46, 44, 42 (left to right).
Table 4
COHORT data: Estimated CDFs by KM, IPW and REW estimators and their estimated SEs at CAG=42, 44, 46 and 48 (all in 100 fold scale).

4 Discussion

We propose methods to estimate cumulative disease risk from a known mutation (i.e., also referred as the penetrance function in genetic epidemiology) from a nonparametric varying-coefficient model. For most complex diseases, predicting the age-at-onset of a disease from genetic markers such as single-nucleiotide polymorphisms continues to be a challenging issue [22]. The proposed method explores a pseudo-logistic regression and redistributes the probability mass at the censored outcomes to the right. The procedure has desirable numerical and asymptotic properties and is extremely easy to implement. Although we focused on assessing the effect of CAG repeats on HD onset, it is easy to include other covariates with time-invariant effect through a backfitting procedure for models such as

logit{pr(TitXi)}=β0(t)+β1(t)Xi+γTZi.

or model (2). The proposed methods have computational advantages compared to, for example, Peng and Huang [10]. In addition to the logistic link as discussed here, the developed methods can be adapted to transformation models with a known link function.

Satten and Datta [23] showed an equivalence between IPW-based and self-consistency-equation-based methods for Kaplan-Meier estimator for a pure nonparametric model. It is less clear whether such equivalence still holds for our model (1) which is equivalent to a proportional odds model with nonparametric time-varying coefficients. This may be worth future exploration. In some applications, investigators may be interested in testing the distribution function at more than one time point or building confidence bands. We proposed a procedure to test a distribution function in a sequence of pre-specified time points simultaneously in Ma and Wang [24], which can be adapted here. The construction of simultaneous confidence bands may rely on theoretical properties of supremes of Gaussian processes (e.g., Fine et al. [25]). However, such confidence bands may be conservative and the details are beyond the scope of this work.

Lastly, in practice it may not be easy to correctly specify a biologically meaningful parametric form for c0(Xi) and c1(Xi) as in model (2). In these situations, using a two-dimensional nonparametric function of Xi and t may be helpful. To assist determining a parametric (e.g., Langbehn et. al. 2004) versus a semiparametric model or nonparametric model, a goodness-of-fit statistic is useful. For example, one may consider test statistic based on supremem norm such as supt |F1(t)− F0(t)|, where F1(t) is fitted under a nonparametric or semiparametric model, and F0(t) is fitted under a parametric model. The critical value or the p-value of the test will be computed empirically based on simulations under the null hypothesis.

Acknowledgments

Wang and Ma’s research is supported by NIH grant NS073671-01, NS082062 and NSF grants DMS-1000354, DMS-1206693.

Appendix

A.1 Proof of Theorem 1

We show consistency by Lemma 5.2 in Newey [18]. We need to show uniform consistency of the initial IPW estimator, i.e., supt[set membership][a,b] |[beta](t) − β(t)| = op(1), and verify assumption 5.4 and 5.5 in Newey [18]. First show uniform consistency of the initial IPW estimator. Wang et al. [26] showed that the IPW estimator can be expanded as

β^(t)-β(t)=1ni=1nψ{Xi,Ti;t,β(t)}+op(n-1/2),
(7)

where

ψ{Xi,Ti;t,β(t)}=ϕ{Xi,Ti;t,β(t)}-i=1n[ϕ{Xi,Ti;t,β(t)}-(ϕ,u)]dMic(u)G(u),

An external file that holds a picture, illustration, etc.
Object name is nihms674633ig1.jpg(ϕ, u) = E[ϕ{Xi, Ti, t, β(t)}|Tiu, Xi], and dMic(u) is the martingale of the censoring process. To show uniform consistency, we need to show that the set ψ{Xi, Ti, t, β(t)} : t [set membership] [a, b] is a Glivenko-Cantelli class. Note that ϕ{Xi, Ti, t, β(t)} = A{Xi, β(t)}[I(Tit) −μ{Xi, β(t)}]Zi. Indicator functions are cadlag processes which are bounded in total variation and belong to the Vapnic-Červonencis class. Thus they are bounded in uniform entropy integral with square-integrable envelope. It follows that they belongs to a Donsker class, and hence Glivenko-Cantelli. In addition, μ{Xi, β(t)} is Lipschitz continuous. By assumption A1, β(t) belongs to a cadlag processes therefore are also bounded in uniform entropy integral. Since Lipschitz continuous functions of classes bounded in uniform entropy integral and pointwise measurable are also bounded in uniform entropy integral and pointwise measurable, {μ{Xi, β(t)}, t [set membership] [a, b]}, is Glivenko-Cantelli. From A{Xi;β(t)}={E(μ{Xi;β(t)}[1-μ{Xi;β(t)}]ZiZiT)}-1, under assumption A5, A{Xi, β(t)} is bounded from below and above by positive constants component-wise and bounded in uniform entropy integral, therefore is Glivenko-Cantelli. Lastly, since Xi is bounded and products of classes with bounded uniform entropy integral also have bounded uniform entropy integral, we have {ϕ{Xi, Ti, t, β(t)} : t [set membership] [a, b]} is Glivenko-Cantelli.

Now we check the second term in ψ{Xi, Ti; t, β(t)}. Note that

B(ϕ,u)=E[ϕ{Xi,Ti;t,β(t)}Xi,Tiu]=E[ϕ{Xi,Ti;t,β(t)}I(Tiu)Xi]E(TiuXi)=uA{Xi;β(t)}[I(st)-μ{Xi;β(t)}]ZidF(sXi)1-F(uXi)=A{Xi;β(t)}[F(tXi)-F(uXi)-μ{Xi;β(t)}{1-F(uXi)}]Zi1-F(uXi).

Therefore

i=1n[ϕ{Xi,Ti;t,β(t)}-B(ϕ,u)]dMic(u)G(u)=i=1n(1-δi)G(Ci){ϕ{Xi,Ti;t,β(t)}-A{Xi;β(t)}[F(tXi)-F(CiXi)-μ{Xi;β(t)}{1-F(CiXi)}]Zi1-F(CiXi)}.
(8)

Under condition A4, G(Ci) > 0. Under model (2) and conditions A1, A2, the above term indexed by t is also Glivenko-Cantelli. This proves that {ψ{Xi, Ti; t, β(t)} : t [set membership] [a, b]} is Glivenko-Cantelli. It follows that

supt[a,b]|n-1i=1nψ{Xi,Ti;t,β(t)}-E[ψ{Xi,Ti;t,β(t)}]|0.

Since E[ψ{Xi, Ti; t, β(t)}] = 0, we have shown the uniform consistency of the IPW estimator,

supt[a,b]β^(t)-β(t)=0.

Now we verify assumptions 5.4 and 5.5 in Newey [18]. In what follows, we use θ and β(·) to denote true parameter values and use [theta w/ tilde] and beta(·) to denote other values different from the truth. For assumption 5.4 (i), it is straightforward to see that s(Oi; t0, [theta w/ tilde], β) is continuous in [theta w/ tilde] and is bounded under the assumptions A1, A3, and A4. For the assumption 5.4 (ii), note

s(Oi;t0,θ,β)-s(Oi,;t0,θ,β)=I(Ti>Ci)I(Ci<t0){w(Oi;t0,β)-w(Oi;t0,β)}Zi=I(Ti>Ci)I(Ci<t0)Zi[μ{Xi;β(t0)}-μ{Xi;β(Ci)}1-μ{Xi;β(Ci)}-μ{Xi;β(t0)}-μ{Xi;β(Ci)}1-μ{Xi;β(Ci)}]=I(Ti>Ci)I(Ci<t0)ZiZiT(μ{Xi;βˇ(t0)}[1-μ{Xi;βˇ(t0)}]1-μ{Xi;βˇ(Ci)}{β(t0)-β(t0)}-μ{Xi;βˇ(Ci)}[1-μ{Xi;βˇ(t0)}]1-μ{Xi;βˇ(Ci)}{β(Ci)-β(Ci)}),
(9)

where β̌(u) is on the line segment between beta(u) and β(u). Here the last equality is obtained by taking pathwise derivative with respect to β. See also (10). Since 0 < μ{x; β̌(u)} < 1 for u [set membership] [a, b], it follows that there exists b(Oi) such that component-wise we have

s(Oi;t0,θ,β)-s(Oi,;t0,θ,β)b(Oi)β-β.

By condition A5, the assumption 5.5 in Newey (1994) is satisfied. Finally, by Lemma 5.2 of Newey [18], we have [theta w/ hat] = θ + op(1):

A.2 Proof of Theorem 2

We show the asymptotic normality of [theta w/ hat] by Lemma 5.3 of Newey [18]. For assumption 5.1(i), note again

s(Oi;t0,θ,β)-s(Oi;t0,θ,β)=I(Ti>Ci)I(Ci<t0){w(Oi;t0,β)-w(Oi;t0,β)}Zi.

We now compute a pathwise derivative of w(Oi; t0, β) w.r.t. β evaluated at the true β in the direction [betaβ]. Let βε(u) = β(u) + ε{beta(u) − β(u)}. We can verify that

limε01ε{F(t0Xi;βε)-F(CiXi;βε)1-F(CiXi;βε)-F(t0Xi)-F(CiXi)1-F(CiXi)}=F(t0Xi){1-F(t0Xi)}ZiT1-F(CiXi){β(t0)-β(t0)}-F(CiXi){1-F(t0Xi)}ZiT1-F(CiXi){β(Ci)-β(Ci)}.
(10)

Let

D(Oi;β-β)=I(Ti>Ci)I(Ci<t0)ZiZiT[F(t0Xi){1-F(t0Xi)}1-F(CiXi){β(t0)-β(t0)}-F(CiXi){1-F(t0Xi)}1-F(CiXi){β(Ci)-β(Ci)}].
(11)

From (9), we can verify

s(Oi;t0,θ,β)-s(Oi;t0,θ,β)-D(Oi;β-β)=I(Ti>Ci)I(Ci<t0)ZiZiT(μ{Xi;βˇ(t0)}[1-μ{Xi;βˇ(t0)}]1-μ{Xi;βˇ(Ci)}{β(t0)-β(t0)}-μ{Xi;βˇ(Ci)}[1-μ{Xi;βˇ(t0)}]1-μ{Xi;βˇ(Ci)}{β(Ci)-β(Ci)})-D(Oi;β-β)=I(Ti>Ci)I(Ci<t0)ZiZiT×([μ{Xi;βˇ(t0)}[1-μ{Xi;βˇ(t0)}]1-μ{Xi;βˇ(Ci)}-F(t0Xi){1-F(t0Xi)}1-F(CiXi)]{β(t0)-β(t0)}-[μ{Xi;βˇ(Ci)}[1-μ{Xi;βˇ(t0)}]1-μ{Xi;βˇ(Ci)}-F(CiXi){1-F(t0Xi)}1-F(CiXi){β(Ci)-β(Ci)}]),

where again β̌(u) is on the line segment of beta(u) and β(u). It is now easy to see that

s(Oi;t0,θ,β)-s(Oi;t0,θ,β)-D(Oi;β-β)b(Oi)β-β2.

For (ii) in assumption 5.1, we need to show that the convergence rate of the IPW estimator [beta] is at least n1/4. Let An external file that holds a picture, illustration, etc.
Object name is nihms674633ig2.jpg denote all cadlag functions uniformly bounded on [a, b]. By adapting the proof in the previous item, we know that {ψ{Xi, Ti; β(t)} : t [set membership] [a, b], β [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms674633ig2.jpg} belongs to a Donsker class. Therefore n{β^(·)-β(·)}converges weakly to a Gaussian process. Therefore this assumption is satisfied.

We now prove assumption 5.2 (stochastic equicontinuity). Note

D(o;β-β)dGdH=0t0g(u)h(x)[F(t0x){1-F(t0x)}zzT{β(t0)-β(t0)}1-F(ux){1-F(ux)}-F(ux){1-F(t0x)}zzT{β(u)-β(u)}1-F(ux){1-F(ux)}]dxdu=0t0g(u)h(x)zzT[F(t0x){1-F(t0x)}{β(t0)-β(t0)}-F(ux){1-F(t0x)}{β(u)-β(u)}]dxdu.

A sufficient condition for stochastic equicontinuity is provided in Chen et al. [19], Remark 2. To be specific, we need to show for δn = op(1),

supβ-βδn1ni=1nD(Oi,β-β)-D(o,β-β)dGdH=op(n-1/2).

This can be proved by showing the process {D(Oi, betaβ) : t [set membership] [a, b], betaβ [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms674633ig2.jpg} belongs to a Donsker class. Note the form of D(Oi, betaβ) in (11), again by adapting proof in item 4 this holds under the conditions A1–A5.

A sufficient condition for assumption 5.3 in Newey (1994) is

nD(o;β^-β)dGdH-i=1nα(Oi)/n0,

for some α(·) (p.1366, 18). Using the expansion (7) for [beta](t), we obtain

D(o;β^-β)dGdH=1ni=1nξ(Ti;t0,θ,β)+op(n-1/2),

where

ξ(Ti;t0,θ,β)=0t0g(u)h(x)zzT[F(t0x){1-F(t0x)}ψ(x,Ti;t0,θ)-F(ux){1-F(t0x)}ψ{x,Ti;u,β(u)}]dxdu.

Therefore assumption 5.3 holds.

For assumption 5.6, it is straightforward that (i) and (ii) are satisfied. We have

A=E{s(Oi;t0,θ,β)θ}=E[μ(Xi;θ){1-μ(Xi;θ)}ZiZiT],

which is nonsingular under the assumption A5. It is easy to see that (iv) holds. For (v), since s(Oi;t0,θ,β)θis continuous in θ, assumption 5.4 (i) holds for s(Oi;t0,θ,β)θ. The assumption 5.4 (ii) holds for s(Oi;t0,θ,β)θsince it does not depend on β.

By Lemma 5.3 of Newey [18], we obtain

n(θ^-θ)N(0,A-1VA-1).

Contributor Information

Tianle Chen, Biogen Idec, Cambridge, MA.

Yanyuan Ma, Department of Statistics, Texas A&M University.

Yuanjia Wang, Department of Biostatistics, Mailman School of Public Health, Columbia University.

References

1. Huntington’s Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntingtons disease chromosomes. Cell. 1993;72:971–983. [PubMed]
2. Lee JM, Ramos EM, Lee JH, Gillis T, Mysore JS, Hayden MR, Warby SC, Morrison P, Nance M, Ross CA, et al. Cag repeat expansion in huntington disease determines age at onset in a fully dominant fashion. Neurology. 2012;78(10):690–695. [PMC free article] [PubMed]
3. Kieburtz K. Huntington Study Group. The unified huntington’s disease rating scale: reliability and consistency. Movement Disorders. 1996;11:136–142. [PubMed]
4. Dorsey ER, Beck C, Adams M. Huntington Study Group. Trend-hd communicating clinical trial results to research participants. Archives of Neurology. 2008;65(12):1590– 1595. [PubMed]
5. Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society, Series B. 1972;34(2):187–220.
6. Langbehn DR, Brinkman RR, Falush D, Paulsen JS, Hayden MR. A new model for prediction of the age of onset and penetrance for huntington’s disease based on cag length. Clinical Genetics. 2004;65:267–277. [PubMed]
7. Chen T, Wang Y, Ma Y, Marder K, Langbehn DR. Predicting disease onset from mutation status using proband and family data with applications to huntington’s disease. Journal of Probability and Statistics. 2012:Article ID 375935. doi: 10.1155/2012/375935. [PMC free article] [PubMed] [Cross Ref]
8. Langbehn DR, Hayden MR, Paulsen JS. the PREDICT-HD Investigators of the Huntington Study Group. Cag-repeat length and the age of onset in huntington disease (hd): A review and validation study of statistical approaches. American Journal of Medical Genetics Part B. 2009;153B:397–408. [PMC free article] [PubMed]
9. Jung SH. Regression analysis for long-term survival rate. Biometrika. 1996;83:227–232.
10. Peng L, Huang Y. Survival analysis with temporal covariate effects. Biometrika. 2007;94 :719–733.
11. Chen YQ, Hu N, Cheng SC, Musoke P, Zhao LP. Estimating regression parameters in an extended proportional odds model. Journal of the American Statistical Association. 2012;107:318–330. [PMC free article] [PubMed]
12. Zeng D, Lin DY. Maximum likelihood estimation in semiparametric regression models with censored data. Journal of the Royal Statistical Society, Series B. 2007;69:1–30.
13. Efron B. The two sample problem with censored data. Fifth Berkeley Symposium on Mathematical Statistics and Probability; 1967; p. 4.
14. Portnoy S. Censored regression quantiles. Journal of the American Statistical Association. 2003;98(464):1001–1012.
15. Wang HJ, Wang L. Locally weighted censored quantile regression. Journal of the American Statistical Association. 2009;104(487):1117–1128.
16. Ma Y, Wei Y. Analysis on censored quantile residual life model via spline smoothing. Statistica Sinica. 2012;22:47–68. [PMC free article] [PubMed]
17. Bang H, Tsiatis AA. Estimating medical costs with censored data. Biometrika. 2000;87 (2):329–343.
18. Newey WK. The asymptotic variance of semiparametric estimators. Econometrica. 1994;62 :1349–1382.
19. Chen X, Linton O, Keilegom IV. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71(5):1591–1608.
20. Khan S, Tamer E. Irregular identification, support conditions, and inverse weight estimation. Econometrica. 2010;78(6):2021–2042.
21. de Leeuw J, Hornik K, Mair P. Isotone optimization in r: Pool-adjacent-violators algorithm (pava) and active set methods. Journal of Statistical Software. 2009;32:1–24.
22. Kang J, Chobc J, Zhao H. Practical issues in building risk-predicting models for complex diseases. Journal of Biopharmaceutical Statistics. 2010;20(2):415–440. [PMC free article] [PubMed]
23. Satten GA, Datta S. The kaplancmeier estimator as an inverse-probability-of-censoring weighted average. The American Statistician. 2001;55(3):207–210.
24. Ma Y, Wang Y. Estimating disease distribution functions from censored mixture data. Journal of the Royal Statistical Society, Series C. 2014;63:1–23.
25. Fine J, Yan J, Kosorok MR. Temporal process regression. Biometrika. 2004;91:683–703.
26. Wang Y, Garcia TP, Ma Y. Nonparametric estimation for censored mixture data with application to the cooperative huntingtons observational research trial. Journal of the American Statistical Association. 2012;107(500):1324–1338. [PMC free article] [PubMed]