Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Ann Inst Stat Math. Author manuscript; available in PMC 2013 April 1.
Published in final edited form as:
Ann Inst Stat Math. 2012 April 1; 64(2): 415–438.
doi:  10.1007/s10463-010-0317-2
PMCID: PMC3259712

Hazard Function Estimation with Cause-of-Death Data Missing at Random


Hazard function estimation is an important part of survival analysis. Interest often centers on estimating the hazard function associated with a particular cause of death. We propose three nonparametric kernel estimators for the hazard function, all of which are appropriate when death times are subject to random censorship and censoring indicators can be missing at random. Specifically, we present a regression surrogate estimator, an imputation estimator, and an inverse probability weighted estimator. All three estimators are uniformly strongly consistent and asymptotically normal. We derive asymptotic representations of the mean squared error and the mean integrated squared error for these estimators and we discuss a data-driven bandwidth selection method. A simulation study, conducted to assess finite sample behavior, demonstrates that the proposed hazard estimators perform relatively well. We illustrate our methods with an analysis of some vascular disease data.

Keywords: Imputation estimator, Inverse probability weighted estimator, Kernel estimator, Regression surrogate estimator

1 Introduction

A common feature of survival data is the presence of right censored observations. Censoring can occur, for example, if individuals withdraw from a study before dying or if a study ends before all subjects have died. Additionally, when multiple causes of death are operating, the time to death from one cause can be censored by a death from a different cause. For instance, in a clinical trial one might distinguish between deaths attributable to the disease of interest and deaths due to all other causes. Without loss of generality, we focus on a particular cause of death and treat all other causes as censoring mechanisms with respect to the death time of interest.

Let T and C(0) be random variables representing the time to death from the cause of interest and the time to usual (right) censoring, respectively. Let T(1), T(2), ···, T(r) be the times to death from all other causes. In our problem, T may be censored by C(0), T(1), ···, T(r−1) or T(r). Let C = min(C(0), T(1), ···, T(r)), where C denotes the censoring random variable. We assume that T and C are independent and we observe X = min(T, C) and δ = I(TC), where I(·) is the indicator function. Let F, G, and L be the cumulative distribution functions for T, C, and X, respectively. Finally, let λ(t) = limε→0+ P (tT < t + ε|Tt)/ε be the hazard function for T.

Censored survival time problems frequently are characterized in terms of hazard functions, and thus the estimation of λ(t) has received much attention. Suppose the data consist of n independent and identically distributed pairs {(Xi, δi): i = 1, …, n}. One type of nonparametric hazard estimation is based on kernel smoothers of the form:


where Ri is the rank of Xi, K(·) is a kernel function, and hn is a sequence of bandwidths. Clearly, λn(t) is a convolution of the kernel function and the nonparametric cumulative hazard estimator of Nelson (1972). This class of estimators has been investigated by several authors, including Blum and Susarla (1980), Tanner (1983), Ramlau-Hansen (1983), Tanner and Wong (1983), Regina and John (1985), Diehl and Stute (1988), and Wang (1999).

This paper addresses the problem in which cause of death is unknown for a subset of individuals, and thus some of the censoring indicators are missing. For example, van der Laan and McKeague (1998) describe epidemiological studies in which death certificates were missing for some people, mainly due to emigration or inconclusive hospital case notes and autopsy results. They point out that it can be impossible to determine whether death was due to the cause of interest in these cases. Missing causes of death also arise in carcinogenicity experiments. In some studies only a subset of animals are examined for tumors to cut costs; occasionally tissues autolyze or are cannibalized by cage mates before a necropsy can be performed; and pathologists are not always able to determine each tumor’s role in causing death. For example, Kalbfleisch and Prentice (1980) provide data on mice who died from leukemia, other known (non-leukemia) causes, or unknown causes; and Dinse (1986) presents data on mice whose status of nonrenal vascular disease at death was classified as absent, incidental, fatal, or unknown. This last data set is analyzed in Section 6.

The general problem of analyzing censored survival data with missing cause-of-death data (or missing censoring indicators) has received much attention. Dinse (1982) derived the nonparametric maximum likelihood estimator of the survival function in this situation; see, also, the estimators of Dinse (1986), Lo (1991), McKeague and Subramanian (1998), van der Laan and McKeague (1998), and Subramanian (2004, 2006). Other authors have considered hypothesis testing and regression modeling. Goetghebeur and Ryan (1990) derived a modified log rank test to compare survival rates in two groups; Dewanji (1992) suggested an improvement to that approach; and Goetghebeur and Ryan (1995) extended their earlier results to the proportional hazards regression model. Tsiatis et al. (2002) used multiple imputation methods to evaluate treatment differences in survival. Recently, Gao and Tsiatis (2005) developed a semiparametric procedure to estimate regression coefficients in a linear transformation competing risks model. Klein and Moeschberger (2003, Chapter 6) point out the importance of kernel estimation for hazard functions in the presence of censored data. In this paper, we concentrate on nonparametrically estimating the hazard function, λ(t), by extending well-known kernel smoothing methods to allow for missing data.

Suppose that X is always observed, but the censoring indicator δ is missing for some subjects. Define a missingness indicator ξ which is 1 if δ is observed and is 0 otherwise. Therefore, we observe either {X, δ, ξ = 1} or {X, ξ = 0}. Throughout this paper, we assume that δ is missing at random (MAR), which implies that ξ and δ are conditionally independent given X: P(ξ = 1|X, δ) = P(ξ = 1|X). The MAR assumption is common in statistical analyses involving missing data and is reasonable in many practical situations; see, for example, Little and Rubin (1987, Chapter 1).

When some censoring indicators are missing, the hazard estimator in (1) cannot be applied directly. One simple solution is to use only the complete cases, {X, δ, ξ = 1}, and to ignore all subjects with missing indicators, {X, ξ = 0}. However, the resulting complete case (CC) estimator is highly inefficient if there is a significant degree of missingness; see, e.g., van der Laan and McKeague (1998). Also, the CC estimator is consistent and unbiased only when the censoring indicators are missing completely at random (MCAR), which is a special case of MAR where ξ is independent of both X and δ: P(ξ = 1|X, δ) = P(ξ = 1); see, e.g., Tsiatis et al. (2002).

Imputation has become a popular method for handling missing data; see, for example, Rubin (1987), Lipsitz et al. (1998), Robins and Wang (2000), and Wang and Rao (2002). The popularity of this approach stems largely from the fact that once the missing values are imputed, standard techniques for analysing complete data can be readily applied. The inverse probability weighted procedure is also widely used in missing data situations; see, for example, Robins and Rotnitzky (1992), Robins et al. (1994), and Zhao et al. (1996). These two approaches are usually applied to regression problems with missing responses or covariates, but here we adapt them to handle missing censoring indicators.

This paper develops three kernel estimators for the hazard function: a regression surrogate estimator, an imputation estimator, and an inverse probability weighted estimator. The regression surrogate estimator is based on a particular expression for λ(t), and the imputation estimator and inverse probability weighted estimator are motivated by the regression surrogate estimator. All three estimators are of the form given in (1), except that some or all of the δi values are replaced by other quantities. The regression surrogate estimator replaces every δi, known or unknown, by an estimator of the conditional expectation of δi given Xi. The imputation estimator replaces only the unknown values of δi by estimators of their conditional expectations. The inverse probability weighted estimator replaces each unknown δi by its estimated conditional expectation and each known δi by a weighted sum of δi and its estimated conditional expectation. Existing nonparametric methods for estimating hazard functions assume complete censoring information, while nonparametric methods that allow missing censoring indicators focus on the simpler problem of estimating survival functions rather than hazard functions. The main contributions of this paper are the development of nonparametric approaches for estimating hazard functions with missing censoring indicators and the theoretical results derived for the proposed estimators.

The paper is organized as follows. Section 2 defines three nonparametric hazard estimators. Section 3 shows that these estimators are uniformly strongly consistent and asymptotically normal under the MAR assumption, and derives asymptotic representations for the mean squared error (MSE) and mean integrated squared error (MISE). Section 4 gives a data-driven bandwidth selection procedure. Section 5 reports simulation results for evaluating finite sample performance; Section 6 illustrates our methods by applying them to data from an animal experiment; and Section 7 provides a few concluding remarks. Finally, the main results are proved in the Appendices.

2 Estimation

The hazard function of interest, λ(t), can be expressed as


where f(t) = dF(t)/dt and L1(t) = P (Xt, δ = 1). As noted by Dikta (1998), we can write L1(t)=0tm(s)dL(s), where m(s) = P (δ = 1|X = s) is the conditional expectation of the censoring indicator given the observation time, and thus (2) yields


Given (3), we use kernel smoothing to define a regression surrogate estimator of λ(t):


where mn(s)=i=1nξiδiW(sXibn)/i=1nξiW(sXibn),Ln(s)=n1i=1nI(Xis), and s− is the time just before s. Here mn(s) is the Nadaraya-Watson kernel regression estimator of m(s), where W(·) is a kernel function and bn is a bandwidth sequence.

As n[1 − Ln(Xi −)] = nRi + 1, the estimator in (4) can be rewritten as


If no censoring indicators are missing, the regression surrogate estimator in (5) reduces to a pre-smoothed Nelson-Aalen type estimator (Cao et al., 2005; Cao and Jacome, 2004; Jacome et al., 2008). Similarly, the basic kernel estimator in (1), which is appropriate when none of the censoring indicators are missing, coincides with the regression surrogate estimator in (5) if every δi is replaced by mn(Xi), an estimator of the conditional expectation of δi given Xi. Intuitively, however, it seems reasonable to replace only the missing censoring indicators with estimators of their conditional expectations. If we use this logic and only replace δi by mn(Xi) in (1) if ξi = 0, we obtain our imputation estimator of λ(t):


Finally, define π(x) = P(ξ = 1|X = x), plus its Nadaraya-Watson kernel regression estimator πn(x)=i=1nξiΩ(xXiγn)/i=1nΩ(xXiγn), where Ω(·) is a kernel function and γn is a bandwidth sequence. Our inverse probability weighted estimator of λ(t) is


In this case, δi in (1) is replaced by mn(Xi) if ξi = 0 and by a weighted average of δi and mn(Xi) if ξi = 1, where each weight is inversely proportional to πn(Xi), which is an estimator of the conditional expectation of ξi given Xi.

Just like other nonparametric kernel methods, our methods are robust to the choice of kernel functions. Some common kernel functions include the uniform, biweight, and Epanechnikov kernel functions. Bandwidth selection for hn and robustness of our methods with respect to bn and γn are discussed in subsequent sections.

All three of our hazard estimators use a nonparametric estimator for m(x). One natural alternative is a semiparametric approach that assumes a specific parametric model, say m(x|θ), where θ is a finite-dimensional parameter and m(·|·) is a known function. Another alternative for handling missing data is the use of multiple imputation. Conditional on any observation time Xi for which the censoring indicator is missing (ξi = 0), generate ν independent Bernoulli random variables, say { δij:j = 1, …, ν}, each of which is 1 with probability mn(Xi). An individual imputation estimator, say λn,j(t), can be obtained by replacing each missing δi with δij in (1) for j = 1, …, ν. As this is equivalent to replacing mn(Xi) with δij in (6), we could define two other estimators by performing the same replacement in (5) and (7). In any of these three cases, the multiple imputation estimator is the average: ν1j=1νλn,j(t). We plan to investigate these two approaches in a separate paper.

3 Asymptotic Properties

This section discusses asymptotic properties of the estimators proposed in Section 2. Let [lambda with circumflex]n(t) denote any one of the estimators in (5) – (7). The following theorem, which is proved in Appendix A, establishes the uniform strong consistency of [lambda with circumflex]n(t).

Theorem 1

Under the assumptions given in Appendix A, we have


where 0 < τ < τL and τL = inf{t: L(t) = 1}.

The next theorem, which is proved in Appendix B, establishes the asymptotic normality of [lambda with circumflex]n(t).

Theorem 2

Under the assumptions given in Appendix B, we have


for any fixed 0< t < τL, where λ(k)(t) is the kth derivative of λ(t).

The asymptotic variance, σ2(t), in the above theorem is


where l(t) = dL(t)/dt. A consistent estimator of σ2(t), say σ^n2(t), can be obtained by replacing λ(t), L(t), π(t), m(t) and l(t) in (8) by estimators [lambda with circumflex]n(t), Ln(t), πn(t), mn(t) and ln(t), where ln(t) is the kernel density estimator ln(t)=(nhn)1i=1nK(tXihn). It is easy to see that the asymptotic variance reduces to that of the standard kernel hazard function estimator λn(t) in (1) if there are no missing censoring indicators, in which case π(t) = 1.

The previous theorem implies that the asymptotically optimal bandwidth for a fixed value of t, which minimizes the asymptotic mean squared error, is


This result also can be obtained by applying part (i) of the following theorem, which establishes the asymptotic MSE and MISE representations, as derived in Appendix C.

Theorem 3

Under the assumptions given in Appendices B and C:

  1. We have for any fixed 0 < t < τL:
  2. If ∫ λ(t)w(t)/[1 − L(t)] dt < ∞ and ∫ l(t)w(t)/[1 − L(t)]2 dt < ∞, we have:
    where w(t) is a weight function used to eliminate endpoint effects. A typical weight function is w(t) = 1 for t in some interval, say [τL, τU], and w(t) = 0 otherwise.

As a consequence of Theorem 3(ii), the asymptotically optimal bandwidth that minimizes the asymptotic mean integrated squared error is


Obviously, hopt depends on the unknowns σ2(t) and λ(k)(t). One simple and direct way to obtain an estimator of hopt, say ĥopt, is to substitute consistent estimators of σ2(t) and λ(k)(t), say σ^n2(t) and λ^n(k)(t), into (9). Clearly, ĥopt defines a consistent estimator of hopt, but we need another bandwidth selection procedure because ĥopt depends on bandwidths through σ^n2(t) and λ^n(k)(t). Further research is required to assess the effect of bandwidth selection on ĥopt and to investigate its asymptotic optimality in the context of ĥopt asymptotically minimizing the integrated weighted squared error. Theoretical research on this problem is beyond the scope of this paper, but the next section discusses one data-driven approach to selecting bandwidths. We conclude this section with the following remark.

Remark 1

Theorem 3 shows that the choice of bandwidths bn and γn does not affect the first-order term of the mean squared error, though it might affect higher order terms. Consequently, the selection of bn and γn is not critical to the estimator [gamma with circumflex]n(t), a result which is also verified in our simulation study. Thus, in the next section, we consider the selection of hn only.

4 Data-Driven Bandwidth Selection

Cross-validation techniques based on least squares regression have been applied to density estimation for censored data by Marron and Padgett (1987). These techniques were extended to hazard estimation by Sarda and Vieu (1991) and Patil (1993a) for uncensored data, and by Patil (1993b) and González-Manteiga et al. (1996) for censored data. We further extend the least squares cross-validation approach to the case of hazard estimation for censored data with missing censoring indicators.

In our situation, the least squares cross-validated bandwidth is the minimizer of


where [gamma with circumflex]n(t) is one of the estimators in (5) – (7) and λ^n(i)(t) is the “leave-one-out” version of that estimator. Here Qn(Xi, δi, ξi) denotes mn(Xi), ξiδi + (1 − ξi)mn(Xi), or ξiδi/πn(Xi)+[1 − ξi/πn(Xi)]mn(Xi), respectively, according to whether [lambda with circumflex]n(t) represents [lambda with circumflex]n,S(t), [lambda with circumflex]n,I(t), or [lambda with circumflex]n,W (t). Define the cross-validated bandwidth hopt,n to be the minimizer of the score function in (10). Similar to Patil (1993a), one can establish asymptotic optimality, in the sense that


where An external file that holds a picture, illustration, etc.
Object name is nihms250291ig1.jpg is a set of bandwidths satisfying certain regularity conditions and ISEw is the integrated weighted squared error: ISEw(h) = [[lambda with circumflex]n(t) − λ(t)]2w(t) dt.

5 Simulation Study

We conducted a simulation study to compare the finite sample properties of our estimators with those of the complete case estimator, say [lambda with circumflex]CC(t), and the standard estimator λn(t) in (1). Recall that [lambda with circumflex]CC(t) is obtained by applying formula (1) to the subset of the data for which we observe δ. In practice we cannot compute λn(t) if any of the censoring indicators are missing, but each δ is known in a simulation and thus we use λn(t) as a “gold standard” for our comparisons. We now report the results of our simulation, conducted using Fortran and R programs.

The Weibull distribution is very flexible and is often used to analyze lifetime data. Thus, we generated the failure time T and censoring time C from a Weibull distribution with shape parameter τ and scale parameter η, denoted by W(τ, η). Given T and C, we defined X = min(T, C) and δ = I(TC) for each subject. We fixed (τ, η) = (3, 1) for T, and we specified (τ, η) = (2, 1.96) for C to obtain a 20% censoring rate, (τ, η) = (2, 1.25) for C to obtain a 40% censoring rate, and (τ, η) = (2, 0.74) for C to obtain a 70% censoring rate. We used the logistic model π(x) = [1 + exp(−θ1θ2x)]−1 to classify some of the censoring indicators as missing. Given X = x, the missingness indicator ξ was set to 1 with probability π(x); otherwise ξ was set to 0 (and δ was treated as missing). We denoted π(x) by π1(x) or π2(x) when the corresponding average missingness rate was approximately 20% or 40%, respectively. When the censoring rate (CR) was 20%, we set (θ1, θ2) to (0.7, 0.87) for π1(x) and (0.32, 0.1) for π2(x). Similarly, for CR = 40%, we set (θ1, θ2) to (0.7, 0.98) for π1(x) and (0.33, 0.1) for π2(x); and for CR = 70%, we set (θ1, θ2) to (0.7, 1.28) for π1(x) and (0.33, 0.13) for π2(x). We generated 1000 samples of size n = 30, 60, and 120 for each choice of CR and π (x). We used the biweight kernel function K(u)=1516(1u2)2 if |u| ≤ 1 and K(u) = 0 otherwise, and the uniform kernel functions W(u)=Ω(u)=12 if |u| ≤ 1 and W(u) = Ω(u) = 0 otherwise. Finally, we took w(t) = 1 if t [set membership] [0, 2], and 0 otherwise.

First, we investigated how the bandwidth hn obtained via least squares cross-validation varied with bandwidths bn and λn. Note that [lambda with circumflex]n,W (t) depends on both bn and γn, whereas [lambda with circumflex]n,S(t) and [lambda with circumflex]n,I(t) depend only on bn, and [lambda with circumflex]CC(t) and λn(t) do not depend on either bn or γn. A cross-validation bandwidth hopt,n, the minimizer of CV (hn) in (10), was calculated for each pair (bn, γn) in a 50 × 50 grid on the (0, 1] × (0, 1] plane. For every combination of n, CR, and π(x), this was repeated for all estimators in each of the 1000 samples; then the corresponding 1000 values of hopt,n were averaged. Figure 1a shows the results for [lambda with circumflex]n,W (t) when n = 60, CR = 20%, and π(x) = π2(x). The average values of hopt,n ranged from 0.6616 to 0.6899, indicating that bn and γn had little effect on the optimal choice of bandwidth hn for [lambda with circumflex]n,W (t). Similarly, the optimal choice of hn for [lambda with circumflex]n,S(t) and [lambda with circumflex]n,I(t) did not vary much with bn. These results were consistent across different values of n, CR, and π(x).

Figure 1Figure 1
Figure 1a. CV-Optimal bandwidth hopt,n against bandwidths bn and γn. The hopt,n surface is the average hopt,n over 1000 replicate samples of size n = 60, with a censoring rate of CR = 20% and a missingness rate given by π2(x). These results ...

Alternatively, an optimal bandwidth hn can be obtained by minimizing the mean integrated squared error; refer to the left part of the equation in Theorem 3(ii) and notice that the true hazard function is known to be λ(t) = 3t2 in our study. We also investigated how this MISE-bandwidth hn varied with respect to (bn, γn), each pair of which came from a 50 × 50 grid on the (0, 1] × (0, 1] plane. For every combination of n, CR, and π(x), this was repeated for all estimators in each of the 1000 samples; then the corresponding 1000 values of hopt,n were averaged. Figure 1b shows the results for [lambda with circumflex]n,W (t) when n = 60, CR = 20%, and π(x) = π2(x). The average values of hopt,n ranged from 0.6642 to 0.6727, indicating that bn and γn had little effect on the optimal choice of bandwidth hn for [lambda with circumflex]n,W (t).

Second, we studied the effects of bn and γn on MISE. We set hn equal to the theoretically MISE-optimal bandwidth corresponding to (bn, γn) = (n−1/3, n−1/3). For each combination of n, CR, and π(x), we generated 1000 samples and calculated the MISE from 1000 simulated values of the five estimators for each pair (bn, γn) from a 50 × 50 grid on the (0, 1] × (0, 1] plane. Figure 2 shows the results for [lambda with circumflex]n,W (t) when n = 60, CR = 40%, π(x) = π2(x), and hMISE = 0.6704. The MISE values in Figure 2 ranged from 1.5763 to 1.8274, which shows that the choices of bn and γn were not critical for [lambda with circumflex]n,W (t). Similarly, the choice of bn in our simulation had little impact for [lambda with circumflex]n,S(t) and [lambda with circumflex]n,I(t). These results support Remark 1.

Figure 2
Mean integrated squared error as a function of bandwidths bn and γn. The MISE surface is calculated from 1000 simulated values of [lambda with circumflex]n,W (t), with n = 60, a censoring rate of CR = 20%, a missingness rate given by π2(x), ...

Next, Table 1 gives the MISE values for all estimators and every combination of n, CR, and π(x). We used (bn,γn)=(n13,n13), and we set hn equal to the MISE-optimal bandwidth. In most cases, regardless of the choice of n, CR, and π(x), the MISE values for the three proposed estimators were fairly similar to each other and were usually close to the MISE for the gold standard, λn(t). As expected, the MISE values decreased (i.e., performance improved) as the sample size increased and as the censoring percentage decreased. Except for λn(t), which always used the complete censoring information, the MISE values also decreased as the missingness rate decreased (as expected). The MISE values for λn(t) differed slightly from π1(x) to π2(x) simply as a result of random fluctuation in the data generated in the two sets of 1000 samples. In every situation, all three of our proposed estimators performed better than the complete case estimator, as illustrated by their smaller MISE values.

Table 1
Mean integrated squared error (MISE) by sample size (n), censoring rate (CR), and missingness rate (π) for five hazard estimators.

Finally, plots of the true hazard rate and average curves associated with all five estimators are presented in Figure 3 for samples of size n = 60. In each of the four subplots, which correspond to the four combinations of π(x) and the lower two censoring rates, the dotted line shows the true hazard rate λ(t) = 3t2. The other curves are time-specific averages of the 1000 hazard estimates corresponding to λn(t), [lambda with circumflex]n,S(t), [lambda with circumflex]n,I(t), [lambda with circumflex]n,W (t), and [lambda with circumflex]CC(t). Overall, the complete case estimator is the worst of the five estimators, as should be expected when a substantial amount of data is ignored. The proposed estimators lead to average curves that are nearly identical to each other and to the gold standard λn(t). It is very interesting that the so-called “gold standard” is not much better than the proposed estimators with missing cause-of-death. As pointed out by one of the referees, one reason could be the improvement attributable to the “presmoothing” already observed in the literature; see, e.g., Cao and Jácome (2004) and Cao et al. (2005). As the amount of censoring increases, the proposed estimators diverge from the true curve at the largest times, but they are not nearly as biased as the complete case estimator. Also, the average curves for the proposed estimators are not affected by an increase in the amount of missingness, but the bias of the complete case estimator increases appreciably in these situations.

Figure 3
True hazard rate and average curves for five hazard estimators. The true hazard function, λ(t) = 3t2, is shown by the wide-dot line. The other curves are time-specific averages of estimates based on 1000 replicate samples of size n = 60. We depict ...

6 Vascular Disease Application

This section illustrates our methods by applying them to some data from an animal experiment. These data were previously analyzed by Dinse (1986), who reported the survival time and disease status at death for 58 female mice. At necropsy, each mouse was examined for nonrenal vascular disease (NRVD). Survival was measured in days and NRVD status at death was classified as absent, incidental, unknown, or fatal. An occurrence of NRVD was considered incidental if it was present but not responsible for death and fatal if the mouse died as a direct or indirect result of its disease. In some cases, NRVD was found to be present, but its role in causing death was unknown.

We applied our methods to estimate the hazard function for death due to NRVD, say λ(t), among the subset of mice with the disease. Time to death (X) was known for all mice. We used the same kernel functions and cross-validation bandwidth selection method as in Section 5. Also, we used w(t) = I(X(1) < t < X(n)), where X(1) = min{X1, X2, ···, Xn} and X(n) = max{X1, X2, ···, Xn}. Of the 33 mice who died with NRVD present, 8 died from their disease (δ = 1), 19 died from other known causes (δ = 0), and 6 had an unknown cause of death (δ missing). Thus, 18% of the mice had a missing censoring indicator, and among mice with a known cause of death, 70% of the death times were censored with respect to the cause of interest. Figure 4 displays estimates of λ(t) based on the proposed kernel estimators and the complete case estimator. All four curves are bell shaped and peak between 750 and 800 days. The three proposed hazard estimates are nearly identical to each other, whereas the complete case estimate is smaller, which is consistent with the simulation results plotted in Figure 3.

Figure 4
Hazard estimates for death due to nonrenal vascular disease (NRVD) for 33 female mice with NRVD. We depict [lambda with circumflex]n,S(t), [lambda with circumflex]n,I(t), [lambda with circumflex]n,W (t), and [lambda with circumflex]CC(t) by dotted, dot-dash, dashed, and solid ...

7 Concluding Remarks

Theoretically, all three proposed estimators have the same asymptotic representation (see Lemma 1 in Appendix B) and hence they all have the same asymptotic normal distribution (see Theorem 2). This asymptotic equivalence is similar to that obtained by Cheng (1994) for the marginal average estimator and the imputation estimator in the nonparametric regression context, and to that obtained by Wang et al. (2004) for the marginal average estimator, the regression imputation estimator, and the marginal propensity score weighted estimator in the semiparametric regression context. Furthermore, it is shown that our estimators have the same asymptotic MSE and MISE representations (see Theorem 3). In small samples, however, simulation results show that [lambda with circumflex]n,S(t) and [lambda with circumflex]n,I(t) have slightly smaller MISE values than [lambda with circumflex]n,W (t). In addition, unlike [lambda with circumflex]n,W (t), neither [lambda with circumflex]n,S(t) nor [lambda with circumflex]n,I(t) depends on estimating π(x). Hence, for these reasons, one may prefer [lambda with circumflex]n,S(t) and [lambda with circumflex]n,I(t) to [lambda with circumflex]W (t).

On the other hand, the inverse probability weighted approach enjoys the so-called “double robustness” property (see Scharfstein et al., 1999). That is, if m(x) and π(x) are specified by parametric models m(x|θ) and π(x|β), respectively, the corresponding weighted estimator is consistent as long as one of the two models is specified correctly, where θ and β are finite dimensional parameters. This property implies that the weighted estimator is consistent if either m(x) or π(x) is estimated nonparametrically and the other is specified to be a known function, regardless of whether the specification is correct or not. However, the efficiency of the weighted estimator depends on the bias between the specified model and the true one; the larger the bias, the larger the loss of efficiency. The regression surrogate and imputation methods do not share this property.


This research was supported by National Science Fund for Distinguished Young Scholars, National Natural Science Foundation of China (10671198), National Science Fund for Creative Research Groups, a grant from the Research Grants Council of the Hong Kong, China (HKU 7050/06P), a grant from Key Lab of Random Complex Structures and Data Science, CAS and the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01-ES-102685). The authors thank Shyamal Peddada and Norman Kaplan for their comments and suggestions.

Appendix A: Proof of Theorem 1

We begin by making the following assumptions:

(A.): m(·) and π(·) are uniformly continuous functions.

(A.): l(·) and λ(·) are continuous functions.

(A.K): K(·) is a probability density kernel function with bounded support and bounded variation.

(A.WΩ): W(·) and Ω(·) are bounded kernel functions with bounded support.

( hn → 0 and hn1n12loglogn0.

( bn → 0 and nbn → ∞.

(A.γn): γn → 0 and n → ∞.


First we prove that Theorem 1 is true for [lambda with circumflex]n,I(t). Similar arguments prove that Theorem 1 is also true for [lambda with circumflex]n,S(t) and [lambda with circumflex]n,W (t). Initially we write


As n[1 − Ln(Xi−)] = nRi + 1, the first term in (11) can be rewritten as


Based on the results of Wang (1999), by (A.), (A.K) and ( we have for the first term in (12):


The second term in (12) can be rewritten as


Define the following shorthand notation: Ln1(t)=1ni=1nI[Xit,δi=1],Ln1(t)=1ni=1nI[Xit,ξi=1],Ln11(t)=1ni=1nI[Xit,δi=1,ξi=1]L 1(t) = P(Xt, ξ = 1) and L11(t) = P(Xt, δ = 1, ξ = 1). The first term in (14) can be written


As dL1(s) = m(s)dL(s) by definition, and dL11(s) = m(s)dL1(s) under the MAR assumption, this same term also can be rewritten as


Denote by L*(·) the distribution or sub-distribution function L(·), L1(·), L1(·), or L11(·); and denote by Ln(·) the corresponding empirical function Ln(·), Ln1(·), Ln1(·), or Ln11(·). Using the law of the iterated logarithm for empirical distribution and sub-distribution functions: sup0s<Ln(s)L(s)=O(n12loglogn), a.s., integration by parts, and the bounded variation of K, it follows that


With respect to Tn12[2](t), similar to (16) it can be shown that sup0tτTn12[2](t)a.s.0. This together with (16) proves that sup0tτTn12(t)a.s.0, and thus, together with (13), we have sup0tτTn1(t)a.s.0. To prove Theorem 1, it remains to prove that sup0tτTn2(t)a.s.0. First, we rewrite Tn2(t) as


Then, by (A.), (A.), (A.WΩ) and (, we have


This proves Theorem 1 for [lambda with circumflex]n,I(t) and thus also for [lambda with circumflex]n,S(t). By similar arguments, together with the conditions listed in Appendix A and the fact that sup0sτπn(s)π(s)a.s.0, we can also prove that Theorem 1 is true for [lambda with circumflex]n,W (t).

Appendix B: Proof of Theorem 2

We assume that the following conditions are true:

(B.mlπλ): m(·), l(·), π(·), and λ(·) have bounded derivatives of order k > 1.

(B.K): K(·) is a continuous kernel function of order k > 1 with bounded support.

(B.WΩ): W(·) and Ω(·) are bounded kernel functions of order k > 1 with bounded support.

( nhn → ∞, nhn2k+1=O(1).

(B.hnbn): bn/hn → 0.

(B.hnγn): γn/hn → 0.

(B.hnbnγn): nbnγn/hn → ∞, nhnbn2k0,nhnγn2k0,hnbnγn120,hnbn12γn0.

Remark 2

Conditions (, (B.hnbn), (B.hnγn), and (B.hnbnγn) are clearly satisfied for k = 2 and hn=O(n15),bn=O(n13), and γn=O(n13).

Lemma 1

If [lambda with circumflex]n(t) denotes any one of the proposed estimators [lambda with circumflex]n,S(t), [lambda with circumflex]n,I(t), or [lambda with circumflex]n,W (t), then under the above assumptions, we have


where fn(t)=1hnK(tshn)dLn1(s).


(a) First we prove that Lemma 1 is true for [lambda with circumflex]n,S(t). We can write


The first term in (18) can be rewritten as


where Fn(t) is the Kaplan-Meier product-limit estimator (Kaplan and Meier, 1958). Based on the results of Diehl and Stute (1998), we have


Under conditions (B.mlπλ) and (B.K), the second term in (19) can be written


As n[1 − Ln(Xi−)] = nRi + 1 and supsLn(s)L(s)a.s.0, the second term in (18) can be written as


and under conditions (B.mlπλ) and (B.WΩ), the third term in (18) is


Under conditions (B.K), (B.WΩ), and (B.hnbn), the first term in (23) is


Under conditions (B.mlπλ), (B.K), and (B.WΩ), and following steps similar to those in the proof of equation (17) in Wang and Rao (2002), we can show that


Taken together, equations (23), (24), and (25) prove


and equations (18)(22) and (26) prove Lemma 1 for [lambda with circumflex]n,S(t).

(b) Next we prove that Lemma 1 is true for [lambda with circumflex]n,I(t). We start by noting that the first term in (12) equals the first term in (18):


Based on equation (14), conditions (B.K) and (B.mlπλ), and the MAR assumption, it is straightforward to prove


Under equations (12), (19), (20), (21), (27) and (28), it follows that


Similar to equation (26), we can show that


Together equations (11), (29), and (30) prove that Lemma 1 is true for [lambda with circumflex]n,I(t).

c) Finally, we prove that Lemma 1 is true for [lambda with circumflex]n,W (t). First note that


Under conditions (B.K), (B.mlπλ), (B.WΩ), (B.hnbn), and (B.hnbnγn), we have


The second term in (31) can be rewritten as


Using arguments similar to those used to derive (26), we can prove that


The third term in (31) can be rewritten as


Equations (31)(34) together prove that Lemma 1 is true for [lambda with circumflex]n,W (t).

Proof of Theorem 2

By the Lyapounov central limit theorem, we have








Under the MAR assumption, we can prove


Equations (35)(37), together with Lemma 1, prove that Theorem 2 is true.

Appendix C: Proof of Theorem 3

We start by making the following assumption:

(C.π): infx π(x) > 0.

First we prove that Theorem 3 is true for [lambda with circumflex]n,I(t). Similar proofs can be constructed for [lambda with circumflex]n,S(t) and [lambda with circumflex]n,W (t). Recall the definitions of Tn11(t), Tn12(t), and Tn2(t) in (11) and (12). Note that Tn11(t) = λn(t) − λ(t), where λn(t) is the standard kernel hazard function estimator given in (1). By Wang (1999), we have


Under the MAR assumption, we have


which leads to the following result:


For Tn2(t), we have




Under conditions (B.K), (B.WΩ) and (C.π), and the fact that bn/hn → 0, it follows that


Equations (11), (12), and (38)(41) together prove Theorem 3 for [lambda with circumflex]n,I(t). The proofs for [lambda with circumflex]n,S(t) and [lambda with circumflex]n,W (t) follow along similar lines.


  • Blum JR, Susarla V. Maximal deviation theory of density and failure rate function estimates based on censored data. In: Krishniah PR, editor. Multivariate analysis. New York: North-Holland; 1980. pp. 213–222.
  • Cao R, Jácome MA. Presmoothed kernel density estimator for censored data. Journal of Nonparametric Statistics. 2004;16:289–309.
  • Cao R, López-de-Ullibarri I, Janssen P, Veraverbeke N. Presmoothed Kaplan-Meier and Nelson-Aalen estimators. Journal of Nonparametric Statistics. 2005;17:31–56.
  • Cheng PE. Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association. 1994;89:81–87.
  • Dewanji A. A note on a test for competing risks with missing failure type. Biometrika. 1992;79:855–857.
  • Diehl S, Stute W. Kernel density and hazard function estimation in the presence of censoring. Journal of Multivariate Analysis. 1988;25:299–310.
  • Dikta G. On semiparametric random censorship models. Journal of Statistical Planning and Inference. 1998;66:253–279.
  • Dinse GE. Nonparametric estimation for partially-complete time and type of failure data. Biometrics. 1982;38:417–431. [PubMed]
  • Dinse GE. Nonparametric prevalence and mortality estimators for animal experiments with incomplete cause-of-death data. Journal of the American Statistical Association. 1986;81:328–336.
  • Gao GZ, Tsaitis AA. Semiparametric estimators for the regression coefficients in the linear transformation competing risks model with missing cause of failing. Biometrika. 2005;92:875–891.
  • Goetghebeur EJ, Ryan LM. A modified log rank test for competing risks with missing failure type. Biometrika. 1990;77:207–211.
  • Goetghebeur EJ, Ryan LM. Analysis of competing risks survival data when some failure types are missing. Biometrika. 1995;82:821–833.
  • González-Manteiga W, Cao R, Marron JS. Bootstrap selection of the smoothing parameter in nonparametric hazard rate estimation. Journal of the American Statistical Association. 1996;91:1130–1140.
  • Jacome MA, Gijbels I, Cao R. Comparison of presmoothing methods in kernel density estimation under censoring. Computational Statistics. 2008;23:381–406.
  • Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. New York: John Wiley & Sons; 1980.
  • Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association. 1958;53:457–481.
  • Klein JP, Moeschberger ML. Survival Analysis. New York: Springer; 2003.
  • Lipsitz SR, Zhao LP, Molenberghs G. A semiparametric method of multiple imputation. Journal of the Royal Statistical Society, Series B. 1998;60:127–144.
  • Little RJA, Rubin DB. Statistical analysis with missing data. New York: John Wiley & Sons; 1987.
  • Lo S-H. Estimating a survival function with incomplete cause-of-death data. Journal of Multivariate Analysis. 1991;39:217–235.
  • Marron JS, Padgett JW. Asymptotically optimal bandwidth selection for kernel density estimators from randomly right-censored samples. The Annals of Statistics. 1987;15:1520–1535.
  • McKeague IW, Subramanian S. Product-limit estimators and Cox regression with missing censoring information. Scandinavian Journal of Statistics. 1998;25:589–601.
  • Nelson W. Theory and applications of hazard plotting for censored failure data. Technometrics. 1972;14:945–966.
  • Patil PN. Bandwidth choice for nonparametric hazard rate estimation. Journal of Statistical Planning and Inference. 1993a;35:15–30.
  • Patil PN. On the least squares cross-validation bandwidth in hazard rate estimation. The Annals of Statistics. 1993b;21:1792–1810.
  • Ranlau-Hansen H. Smoothing counting process intensities by means of kernel functions. The Annals of Statistics. 1983;11:453–466.
  • Regina YC, John VR. A histogram estimator of the hazard rate with censored data. The Annals of Statistics. 1985;13:592–605.
  • Robins JM, Rotnitzky A. Recovery of information and adjustment for dependent censoring using surrogate markers. In: Jewell N, Dietz K, Farewell V, editors. AIDS epidemiology - methodological issues. Boston: Birkhäuser; 1992. pp. 297–331.
  • Robins JM, Rotnitzky A, Zhao LP. Estimation of Regression Coefficients when Some Regressors Are Not Always Observed. Journal of the American Statistical Association. 1994;89:846–866.
  • Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–124.
  • Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.
  • Sarda P, Vieu P. Smoothing parameter selection in hazard estimation. Statistics & Probability Letters. 1991;11:429–434.
  • Scharfstein DO, Rotnitzky A, Robins J. Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion) Journal of the American Statistical Association. 1999;94:1096–1146.
  • Subramanian S. Asymptotically efficient estimation of a survival function in the missing censoring indicator model. Journal of Nonparametric Statistics. 2004;16:797–817.
  • Subramanian S. Survival analysis for the missing censoring indicator model using kernel density estimation techniques. Statistical Methodology. 2006;3:125–136. [PMC free article] [PubMed]
  • Tanner MA. A note on the variable kernel estimator of the hazard function from randomly censored data. The Annals of Statistics. 1983;11:994–998.
  • Tanner MA, Wong WH. The estimation of the hazard function from randomly censored data by the kernel method. The Annals of Statistics. 1983;11:989–993.
  • Tsiatis AA, Davidian M, McNeney B. Multiple imputation methods for testing treatment differences in survival distributions with missing cause of failure. Biometrika. 2002;89:238–244.
  • van der Laan MJ, McKeague IW. Efficient estimation from right-censored data when failure indicators are missing at random. The Annals of Statistics. 1998;26:164–182.
  • Wang QH. Some bounds for the error of an estimator of the hazard function with censored data. Statistics & Probability Letters. 1999;44:319–326.
  • Wang QH, Linton O, Härdle W. Semiparametric regression analysis with missing response at random. Journal of the American Statistical Association. 2004;99:334–345.
  • Wang QH, Rao JNK. Empirical likelihood-based inference under imputation for missing response data. The Annals of Statistics. 2002;30:896–924.
  • Zhao LP, Lipsitz SR, Lew D. Regression analysis with missing covariate data using estimating equations. Biometrics. 1996;52:1165–1182. [PubMed]