A common feature of survival data is the presence of right censored observations. Censoring can occur, for example, if individuals withdraw from a study before dying or if a study ends before all subjects have died. Additionally, when multiple causes of death are operating, the time to death from one cause can be censored by a death from a different cause. For instance, in a clinical trial one might distinguish between deaths attributable to the disease of interest and deaths due to all other causes. Without loss of generality, we focus on a particular cause of death and treat all other causes as censoring mechanisms with respect to the death time of interest.

Let *T* and *C*^{(0)} be random variables representing the time to death from the cause of interest and the time to usual (right) censoring, respectively. Let *T*^{(1)}, *T*^{(2)}, ···, *T*^{(}^{r}^{)} be the times to death from all other causes. In our problem, *T* may be censored by *C*^{(0)}, *T*^{(1)}, ···, *T*^{(}^{r}^{−1)} or *T*^{(}^{r}^{)}. Let *C* = min(*C*^{(0)}, *T*^{(1)}, ···, *T*^{(}^{r}^{)}), where *C* denotes the censoring random variable. We assume that *T* and *C* are independent and we observe *X* = min(*T*, *C*) and *δ* = *I*(*T* ≤ *C*), where *I*(·) is the indicator function. Let *F*, *G*, and *L* be the cumulative distribution functions for *T*, *C*, and *X*, respectively. Finally, let *λ*(*t*) = *lim*_{ε}_{→0+} *P* (*t* ≤ *T* < *t* + *ε*|*T* ≥ *t*)/*ε* be the hazard function for *T*.

Censored survival time problems frequently are characterized in terms of hazard functions, and thus the estimation of

*λ*(

*t*) has received much attention. Suppose the data consist of

*n* independent and identically distributed pairs {(

*X*_{i},

*δ*_{i}):

*i* = 1, …,

*n*}. One type of nonparametric hazard estimation is based on kernel smoothers of the form:

where

*R*_{i} is the rank of

*X*_{i},

*K*(·) is a kernel function, and

*h*_{n} is a sequence of bandwidths. Clearly,

*λ*_{n}(

*t*) is a convolution of the kernel function and the nonparametric cumulative hazard estimator of

Nelson (1972). This class of estimators has been investigated by several authors, including

Blum and Susarla (1980),

Tanner (1983),

Ramlau-Hansen (1983),

Tanner and Wong (1983),

Regina and John (1985),

Diehl and Stute (1988), and

Wang (1999).

This paper addresses the problem in which cause of death is unknown for a subset of individuals, and thus some of the censoring indicators are missing. For example,

van der Laan and McKeague (1998) describe epidemiological studies in which death certificates were missing for some people, mainly due to emigration or inconclusive hospital case notes and autopsy results. They point out that it can be impossible to determine whether death was due to the cause of interest in these cases. Missing causes of death also arise in carcinogenicity experiments. In some studies only a subset of animals are examined for tumors to cut costs; occasionally tissues autolyze or are cannibalized by cage mates before a necropsy can be performed; and pathologists are not always able to determine each tumor’s role in causing death. For example,

Kalbfleisch and Prentice (1980) provide data on mice who died from leukemia, other known (non-leukemia) causes, or unknown causes; and

Dinse (1986) presents data on mice whose status of nonrenal vascular disease at death was classified as absent, incidental, fatal, or unknown. This last data set is analyzed in Section 6.

The general problem of analyzing censored survival data with missing cause-of-death data (or missing censoring indicators) has received much attention.

Dinse (1982) derived the nonparametric maximum likelihood estimator of the survival function in this situation; see, also, the estimators of

Dinse (1986),

Lo (1991),

McKeague and Subramanian (1998),

van der Laan and McKeague (1998), and

Subramanian (2004,

2006). Other authors have considered hypothesis testing and regression modeling.

Goetghebeur and Ryan (1990) derived a modified log rank test to compare survival rates in two groups;

Dewanji (1992) suggested an improvement to that approach; and

Goetghebeur and Ryan (1995) extended their earlier results to the proportional hazards regression model.

Tsiatis *et al.* (2002) used multiple imputation methods to evaluate treatment differences in survival. Recently,

Gao and Tsiatis (2005) developed a semiparametric procedure to estimate regression coefficients in a linear transformation competing risks model.

Klein and Moeschberger (2003, Chapter 6) point out the importance of kernel estimation for hazard functions in the presence of censored data. In this paper, we concentrate on nonparametrically estimating the hazard function,

*λ*(

*t*), by extending well-known kernel smoothing methods to allow for missing data.

Suppose that

*X* is always observed, but the censoring indicator

*δ* is missing for some subjects. Define a missingness indicator

*ξ* which is 1 if

*δ* is observed and is 0 otherwise. Therefore, we observe either {

*X*,

*δ*,

*ξ* = 1} or {

*X*,

*ξ* = 0}. Throughout this paper, we assume that

*δ* is missing at random (MAR), which implies that

*ξ* and

*δ* are conditionally independent given

*X: P*(

*ξ* = 1|

*X*,

*δ*) =

*P*(

*ξ* = 1|

*X*). The MAR assumption is common in statistical analyses involving missing data and is reasonable in many practical situations; see, for example,

Little and Rubin (1987, Chapter 1).

When some censoring indicators are missing, the hazard estimator in (

1) cannot be applied directly. One simple solution is to use only the complete cases, {

*X*,

*δ*,

*ξ* = 1}, and to ignore all subjects with missing indicators, {

*X*,

*ξ* = 0}. However, the resulting complete case (CC) estimator is highly inefficient if there is a significant degree of missingness; see, e.g.,

van der Laan and McKeague (1998). Also, the CC estimator is consistent and unbiased only when the censoring indicators are missing completely at random (MCAR), which is a special case of MAR where

*ξ* is independent of both

*X* and

*δ: P*(

*ξ* = 1|

*X*,

*δ*) =

*P*(

*ξ* = 1); see, e.g.,

Tsiatis *et al.* (2002).

Imputation has become a popular method for handling missing data; see, for example,

Rubin (1987),

Lipsitz *et al.* (1998),

Robins and Wang (2000), and

Wang and Rao (2002). The popularity of this approach stems largely from the fact that once the missing values are imputed, standard techniques for analysing complete data can be readily applied. The inverse probability weighted procedure is also widely used in missing data situations; see, for example,

Robins and Rotnitzky (1992),

Robins *et al.* (1994), and

Zhao *et al.* (1996). These two approaches are usually applied to regression problems with missing responses or covariates, but here we adapt them to handle missing censoring indicators.

This paper develops three kernel estimators for the hazard function: a regression surrogate estimator, an imputation estimator, and an inverse probability weighted estimator. The regression surrogate estimator is based on a particular expression for

*λ*(

*t*), and the imputation estimator and inverse probability weighted estimator are motivated by the regression surrogate estimator. All three estimators are of the form given in (

1), except that some or all of the

*δ*_{i} values are replaced by other quantities. The regression surrogate estimator replaces every

*δ*_{i}, known or unknown, by an estimator of the conditional expectation of

*δ*_{i} given

*X*_{i}. The imputation estimator replaces only the unknown values of

*δ*_{i} by estimators of their conditional expectations. The inverse probability weighted estimator replaces each unknown

*δ*_{i} by its estimated conditional expectation and each known

*δ*_{i} by a weighted sum of

*δ*_{i} and its estimated conditional expectation. Existing nonparametric methods for estimating hazard functions assume complete censoring information, while nonparametric methods that allow missing censoring indicators focus on the simpler problem of estimating survival functions rather than hazard functions. The main contributions of this paper are the development of nonparametric approaches for estimating hazard functions with missing censoring indicators and the theoretical results derived for the proposed estimators.

The paper is organized as follows. Section 2 defines three nonparametric hazard estimators. Section 3 shows that these estimators are uniformly strongly consistent and asymptotically normal under the MAR assumption, and derives asymptotic representations for the mean squared error (MSE) and mean integrated squared error (MISE). Section 4 gives a data-driven bandwidth selection procedure. Section 5 reports simulation results for evaluating finite sample performance; Section 6 illustrates our methods by applying them to data from an animal experiment; and Section 7 provides a few concluding remarks. Finally, the main results are proved in the Appendices.