PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Ann Appl Stat. Author manuscript; available in PMC Feb 23, 2011.
Published in final edited form as:
Ann Appl Stat. Sep 1, 2010; 4(3): 1602–1620.
doi:  10.1214/09-AOAS319
PMCID: PMC3043377
NIHMSID: NIHMS267885
BACKWARD ESTIMATION OF STOCHASTIC PROCESSES WITH FAILURE EVENTS AS TIME ORIGINS1
Kwun Chuen Gary Chan and Mei-Cheng Wang
Kwun Chuen Gary Chan, Department of Biostatistics, University of Washington, Seattle, Washington 98185, USA, kcgchan/at/u.washington.edu;
Stochastic processes often exhibit sudden systematic changes in pattern a short time before certain failure events. Examples include increase in medical costs before death and decrease in CD4 counts before AIDS diagnosis. To study such terminal behavior of stochastic processes, a natural and direct way is to align the processes using failure events as time origins. This paper studies backward stochastic processes counting time backward from failure events, and proposes one-sample nonparametric estimation of the mean of backward processes when follow-up is subject to left truncation and right censoring. We will discuss benefits of including prevalent cohort data to enlarge the identifiable region and large sample properties of the proposed estimator with related extensions. A SEER–Medicare linked data set is used to illustrate the proposed methodologies.
Key words and phrases: Marked process, left truncation, prevalent cohort, recurrent event process, recurrent marker process, survival analysis
Stochastic processes such as recurrent events and repeated measurements are often collected in medical follow-up studies in addition to survival data. Examples include recurrent hospitalizations, medical cost processes, repeated quality of life measurements and CD4 counts. Such processes often exhibit certain terminal behavior during a short time before failure events. For example, medical costs tend to increase suddenly before death, qualities of lives deteriorate before death and CD4 counts decrease before AIDS diagnosis.
Conventional statistical methodologies mainly focus on stochastic processes that are counting forward from initial events observed for every individual; see Nelson (1988), Pepe and Cai (1993), Lawless and Nadeau (1995), Cook and Lawless (1997), Lin et al. (2000) and Wang, Qin and Chiang (2001), among others, on recurrent event processes, Lin (2000) on medical cost processes and Pawitan and Self (1993) on CD4 count processes. The conventional views of stochastic processes, however, are not designed to study the terminal behavior of processes. Consider medical cost as an example. Calculating the mean of cost processes for a population defined at an initial event would include both survivors and nonsurvivors at any fixed time after the initial event, and the increase in medical cost based on survivors’ cost measurement is offset by nonsurvivors who do not contribute to the increase in medical cost after death. Unless the failure times are constant over a population, conventional forward processes do not serve the purpose of estimating the terminal behavior of stochastic processes.
In this paper we directly consider stochastic processes before failure events of interest, by introducing backward processes that start at failure events and counting backward in time. By aligning the origins of the processes to failure events, terminal behavior of stochastic processes could be naturally and directly studied by the backward processes. We will focus on one-sample nonparametric estimation of the mean of backward processes when the failure events are partially observed subject to left truncation and right censoring. Since failure events and processes right before failure events may not be observed, statistical methods are needed to correct a bias induced by missingness. Development of methods rely on a stochastic representation technique of a marked counting process generalizing that of Huang and Louis (1998) and the proposed estimator also generalizes a weighted estimator for left truncated and right censored data proposed by Gross and Lai (1996).
Throughout this paper we will consider medical costs as motivating examples. The SEER–Medicare linked data provide illustrative examples of medical cost process data collected in a left truncated and right censored follow-up sample. The Surveillance, Epidemiology and End-Results (SEER)–Medicare linked data are population-based data for studying cancer epidemiology and quality of cancer-related health services. The SEER–Medicare linked data consist of a linkage of two large population-based databases, SEER and Medicare. The SEER data contain information of cancer incidence diagnosed between 1973 and 2002. The Medicare data contain information on medical costs between 1986 and 2004. The linked data consist of cancer patients in the SEER data who were enrolled in Medicare during the study period of the Medicare data. Details of each data and linkage are discussed in Warren et al. (2002). Although the linkage criterion sounds simple, it creates a left truncated and right censored sample because the two data sets have different starting times. In the SEER–Medicare linked data, patients diagnosed with cancer before 1986 form a prevalent cohort, because only those patients who survived through 1986 were included. Patients diagnosed with cancer after 1986 form an incident cohort, because those patients were recruited at the onset of disease. Patients survived through 2004 were considered censored. A prevalent cohort is typically a left truncated and right censored sample and data from a combination of incident and prevalent cohorts are also left truncated and right censored.
This article is organized as follows. In Section 2 we will introduce backward processes to study terminal behavior of stochastic processes and discuss the differences from conventional forward process models. The proposed methods for estimating the mean function of backward processes will be discussed in Section 3, together with identifiability problems associated with incomplete follow-up, large sample properties of the proposed estimators and a method for constructing confidence bands for mean functions. We will also discuss two related extensions of the proposed procedure in Section 4, one is on distributional estimation of the backward processes, and the other is on estimation of derivatives of backward mean functions. Simulations and real examples analyzing a SEER–Medicare linked data set will be presented in Section 5. Section 6 will include several concluding remarks.
Let Y (t) be a stochastic process with bounded variation, where t is the time after an initial event, usually defined as the time of disease onset. We call equation M1 a forward stochastic process since the time index t in Y (t) starts at the initial event and moves forward with calendar time. On the other hand, a backward stochastic process is defined as equation M2, where T is the time from the initial event to a failure event of interest and the time origin for V (u) is the failure event. In the medical cost example, Y (t) is total medical cost within t time units after the initial event, and V (u) is total medical cost during the last u time units of life. Figure 1 shows the trajectories of forward and backward cost processes for 3 uncensored individuals in the SEER–Medicare linked data.
FIG. 1
FIG. 1
Trajectories of forward and backward cost processes for 3 uncensored individuals in the SEER–Medicare linked data. (a) Forward cost processes. Circles represent failure events. (b) Backward cost processes. Circles represent diagnoses of cancer. (more ...)
In Figure 1 we can see an increase in medical cost a short period before death. To study this terminal behavior of medical cost processes, it is natural to align the processes to a different time origin, the failure event, as shown in Figure 1(b). Since terminal behavior of stochastic processes usually incur during a short time period before death, relevant scientific questions center on a rather short period τ0 before death, say, 6 months or 1 year. τ0 is a prespecified time period related to scientific questions of interest. The backward stochastic processes at τ0 time units before failure events are only meaningfully defined for a subgroup of patients who survive at least τ0 time units, and the estimand of interest is E(V (u)|T ≥ τ0), for u [set membership] [0, τ0]. However, due to limited study duration, only a conditional version μτ01 (u) = E(V (u)|τ0T < τ1) can be nonparametrically identified, where τ1 depends on study design and data availability. τ1 can be taken as the maximum follow-up period, and the time period of interest τ0 is usually much shorter than τ1. We will further discuss implications of incident and prevalent sampling on the identifiability constraints in Section 3.2.
To distinguish between processes with time origins at initial events and failure events, throughout this paper t denotes a time index counting forward from initial events and u denotes a time index counting backward from failure events. The processes Y (t) and V (u) address different scientific questions and have different interpretations. Consider the medical cost example, where Y (t) measures medical cost from an initial event. Note that Y (t) will not increase after death, so that Y (t) = Y (T) for tT. The interpretation of forward mean function E(Y (t)) is generally confounded with survival performance. For example, if there are two groups of patients with the same spending per unit time when alive but different survival distributions, the group with longer survival time will have a higher mean forward cumulative cost. There may also be crossovers between mean forward cost curves, because patients with severe disease tend to spend more near disease onset but die in shorter time than patients with less severe disease. We shall see such an example from the SEER–Medicare data set in Section 5.2. On the other hand, the time origin of a backward process V (u) is defined to be a failure event, and the backward mean function can be interpreted as the mean of stochastic processes before failure events. In the medical cost example, when financial decision is a major concern (e.g., decision made by insurance company), then discounted forward cost may be more relevant. The backward processes essentially answer different types of questions related to end-of-life cost, and there is currently a lot of public health interest in comparing and evaluating palliative care. This work could provide valid statistical methods for public health researchers interested in estimating end-of-life medical cost, together with other applications.
3.1. Data structure
Let T be a failure time, C be a censoring time and W be a truncation time. (T,C,W) are defined relative to an initial event. Truncation time W is the time between the initial event and the time of recruitment. For incident cases, W = 0. For prevalent cases, W > 0 and survival data are observed only when TW, that is, the failure time is left truncated. Also, since censoring is only meaningfully defined for subjects who are eligible to be sampled, we assume that P(WC) = 1 as discussed in Wang (1991). Let X = min(T,C) and Δ = I(TC). In addition to observing the usual left truncated and right censored survival data (W,X,Δ), Y (t) is also observed from time of recruitment to death or censoring. We assume an independent censoring and truncation condition in which {V (·),T} is independent of {W,C}. This assumption does not impose any dependent structure between the process V (·) and the failure time T. In fact, V (·) and T are allowed to be arbitrarily dependent under this assumption and thus handle the case of informative failure events. The assumption is similar in nature to those imposed for nonparametric estimation of forward mean function with informative terminal events; see, for example, Lawless and Nadeau (1995), Lin et al. (1997), Strawderman (2000) and Ghosh and Lin (2000). Let S(t) = P(Tt) and G(t) = P(XtW|TW), by the independent censoring and truncation conditions G(t) = S(t) · P(CtW)/β where β = P(TW).
To estimate the mean of V (u) for u [set membership] [0, τ0], we only need the following minimal data [Huang and Louis (1998)]:
equation M3
That is, in addition to the survival data, we only need backward process data to be available for individuals whose failure events are uncensored. For subjects in a prevalent cohort, backward process data may not be fully available for individuals who experience failure events within τ0 from recruitment. In this case, we may treat recruitment time to be τ0 after the actual recruitment date and W + τ0 be a new truncation variable for the subjects in a prevalent cohort. This is equivalent to artificially truncating a small portion of data and it guarantees that V (u), u [set membership] [0, τ0], is observable for all uncensored observations with T ≥ τ0 in the prevalent cohort.
3.2. Identifiability
A backward stochastic process can be viewed as a marked process attached to a failure event. This is a generalization of marked variables considered by Huang and Louis (1998) in which random variables are observed at failure events. Because of limited study duration, marginal distribution of marked variables cannot be fully identified nonparametrically. The same applies to backward stochastic processes because we do not have data on backward processes for subjects with survival time greater than τ1, which is the maximum support of the censoring time. In view of this identifiability problem, together with the fact that stochastic processes within a prespecified time period of interest τ0 before failure events are only meaningfully defined for the subgroup of individuals who survives at least τ0 time units, we confine ourselves to estimate a conditional version of backward mean function, μτ01 (u) = E(V (u)|τ0T < τ1), for u [set membership] [0, τ0]. If the maximum support of T is at most τ1, then E(V (u)|T ≥ τ0) can be estimated for u [set membership] [0, τ0].
In an incident cohort, τ1 is usually the maximum follow-up duration, which is determined by study design. In a prevalent cohort, τ1 is the longest observation time, which is usually longer than the maximum follow-up time because subjects have already experienced the initial events before recruitment. An important implication of using prevalent cohort data is that it allows us to identify a larger portion, possibly all, of the right tail of the survival distribution. For example, Figure 2 shows the estimated survival probabilities for ovarian cancer patients in three different historic stages at diagnosis. For all three groups of patients, the full right tail of survival distributions can be identified when the full data set is considered, but not in the case when we only analyze the incident cohort. If we only analyze the incident cohort data, τ1 is 18 years, which is the maximum follow-up period for the incident cohort in the data set. When we include prevalent cohort data in the analysis, we can estimate E(V (u)|T ≥ τ0) nonparametrically.
FIG. 2
FIG. 2
Estimates of survival probabilities for ovarian cancer patients in the SEER–Medicare data, using only incident cohort data (bold) and using data from both incident and prevalent cohorts (nonbold). Solid curves represent localized stage at diagnosis, (more ...)
3.3. Proposed estimator
We propose an estimator for the backward mean function μτ01 (u) by using marked counting process arguments extending those of Huang and Louis (1998). Let Ni(t) = I(Xit, Δi = 1), i = 1,…,n, be counting processes for observed failure, Ri(t) = I(XitWi) be at-risk indicators, and
equation M4
be marked counting processes for observed failure with a random marker Vi(u). Define averaged processes equation M5. Furthermore, let ΛT (s) be the cumulative hazard function for T and equation M6. equation M7 can be interpreted as a hazard weighted cumulative mean of backward processes, which is called cumulative mark-specific hazard function in Huang and Louis (1998).
Note that
equation M8
If we have an estimate of equation M9, denoted by equation M10, then we can estimate E(V (u)I0T < τ1)) by equation M11, where Ŝ(t) is the product limit estimate using left truncated and right censored data [Tsai, Jewell and Wang (1987), Lai and Ying (1991)]. Since
equation M12
where the expectation is taken conditioning on equation M13 can be estimated by
equation M14
(3.1)
The backward mean function μτ01 (u) can then be estimated by
equation M15
(3.2)
More generally, we can estimate μt1,t2 (u) = E(V (u)|t1T < t2) for ut1 < t2 ≤ τ1 and u [set membership] [0, τ0], which can be estimated by
equation M16
The mean of V (u) can be estimated as long as T > u. However, if we estimate E(V (u)|uT < τ1), the subpopulation defined by conditioning changes with the time index u, and the estimand loses a desirable interpretation of being a mean process for a fixed underlying population. Although the introduction of the constant τ0 in the conditioning may not use information from part of the data, it defines a meaningful subpopulation such that the whole backward function can be studied for the same underlying population.
The following theorem states the large sample properties of the proposed estimator.
THEOREM 3.1. Assume that E (V20)) < ∞ and certain technical restrictions on the support of (T,C,W) hold [Wang (1991)]. For τ0t1 < t2 ≤ τ1, [mu]t1,t2 (u) → μt1,t2 (u) uniformly a.s. on [0, τ0]. Also, equation M17, where ξi(u) is defined in the Appendix and the random sequence converges weakly to a Gaussian process with covariance function
equation M18
(3.3)
where
equation M19
From (3.3), Ct1,t2 (u, v) can be consistently estimated by
equation M20
(3.4)
where
equation M21
3.4. Construction of confidence bands
From the large sample results, we can construct pointwise confidence intervals in the form [mu]t1,t2 (u) ± n−1/2z[sigma with hat]t1,t2 (u), where z is the standard normal critical value and equation M22. Since we are estimating mean functions of processes, it is also of interest to construct confidence bands for a given level of significance. We will replace z in a pointwise confidence interval by a larger value b to reach an appropriate simultaneous coverage probability. Although equation M23 converges to a Gaussian process, the limiting process may not have independent increment because V (u) is an arbitrary process. Thus, it may not be possible to compute the exact asymptotic distribution. To construct confidence bands, we approximate the limiting process by a multiplier bootstrap method described as follows:
  • Generate random multipliers {Gi, i = 1,…,n} which are independent standard normal distributed and independent of the data. Then, compute
    equation M24
  • Repeat step 1 until m versions of W (u) are obtained, denoted by {Wk(u), k = 1,…,m}.
  • Obtain b which is the (100 − α)-percentile of max(0, τ0) {|Wk(u)|}.
  • The confidence band for μt1,t2 (u), u [set membership] [0, τ0], can be calculated by [mu]t1,t2 (u) ± n−1/2b[sigma with hat]t1,t2 (u).
The above method uses the simulated samples W (u) to approximate the distribution of ξ(u), the influence function of [mu]t1,t2 (u). The method is motivated by the construction of confidence bands for survival function in a proportional hazards model proposed by Lin, Fleming and Wei (1994). By the permanence of the Donsker property [van der Vaart and Wellner (1996)], W (u) can be shown to converge to a Gaussian process. Also, conditional on observed data, E(W (u)W (v)) equals the right-hand side of (3.4) since E(GiGj) = 0 for ij and equation M25. Hence, E(W (u) W (v)) converges almost surely to Ct1,t2 (u, v) and W (u) has the same asymptotic distribution as equation M26.
In the medical cost example, μt1,t2 (u) is nonnegative and it is more meaningful to construct confidence intervals or confidence bands that are always nonnegative. For this purpose, we consider the log-transformation and the confidence bands have the form [mu]t1,t2 (u) exp[±n−1/2b*[sigma with hat]t1,t2 (u)/[mu]t1,t2 (u)]. b* can be found by the above algorithm with a slight modification in step 3, where b* is the (100 − α)-percentile of max(0,τ0) [|Wk(u)|/[sigma with hat]t1,t2 (u)]. The reason is that by the functional delta method, equation M27 is asymptotically equivalent to (μt1,t2 (u)/σt1,t2 (u)) × (1/μt1,t2 (u)) × ξ(u) = ξ(u)/σt1,t2 (u), whose distribution is approximated by W (u)/[sigma with hat]t1,t2 (u).
4.1. Distribution and percentile estimation
In a lot of applications, including medical cost, the distribution of V (u) is not symmetric and is often highly skewed. In these cases, apart from estimating the mean function, one might also be interested in estimating percentile and distribution functions of V (u). Here we extend the marked process approach of mean estimation to estimate a joint distribution function, pτ01 (m, t, u) = P(V (u) ≤ m, Tt0T < τ1) where τ0t < τ1. Let
equation M28
and equation M29. Following the arguments in Section 3, pτ01 (m, t, u) can be estimated by
equation M30
This joint distribution estimate can be used for correlation analysis between V (u) and T. Similar to the estimator of mean function, [p with hat]τ01 (m, t, u) is a consistent estimate of pτ01 (m, t, u).
Next, we consider estimating a pointwise qth-percentile function, equation M31, which is defined by
equation M32
for u [set membership] [0, τ0] and 0 < q < 1. To estimate equation M33, consider the estimating function
equation M34
(4.1)
It can be seen that [var phi]q (m0, u) converges in probability to 0 for equation M35. Thus, a natural estimator of equation M36 is the zero-crossing of [var phi]q (m, u). The existence of a solution is guaranteed because [var phi]q (m, u) is increasing in m and limm→−∞ [var phi]q (m, u) < 0 and limm→∞ [var phi]q (m, u) > 0, for 0 < q < 1. The estimation of equation M37 can be easily implemented in common statistical softwares by noting from (4.1) that this quantity can be estimated by a weighted empirical percentile of V (u) with weights equals to Ŝ(XiiI0Xi < τ1)/R(Xi).
4.2. Estimation of backward rate function
When the mean rate of change of stochastic processes before failure events is of scientific interest, one might want to estimate an associated quantity equation M38. In the medical cost example, r(u) is the mean rate of cost accrual per unit time at u time units before a failure event. r(u) is a measure of instantaneous change in the backward stochastic process. We call r(u) the backward rate function.
Like the estimation of backward mean functions, we can only estimate nonparametrically a conditional version equation M39. Similar to μτ01 (u), we have the following relationship:
equation M40
(4.2)
To estimate rτ01 (u), it suffices to estimate equation M41 at the jump points of [Lambda with circumflex]T, which are the uncensored survival times. For each uncensored individual, equation M42 can be estimated by
equation M43
where k(·) is a kernel function satisfying equation M44 and h > 0 is a bandwidth parameter, which can be chosen by minimizing an integrated mean square error. vi (u) is similar in nature to the estimator of the subject specific rate of recurrent event proposed by Wang and Chiang (2002). Substituting unknown quantities in (4.2) by their estimates, rτ01 (u) can be estimated by
equation M45
(4.3)
The estimator (4.3) can also be derived as the convolution smoothing estimator of [mu]τ01. It can be shown that
equation M46
5.1. Simulations
Finite sample performance of the proposed estimator in Section 3 and the empirical coverage of pointwise confidence intervals and overall confidence bands are evaluated by simulations. Data are generated 2000 times in each simulation, and each simulated data set consists of 100 or 400 observations. The confidence bands are constructed by simulating 1000 sets of random multipliers in each simulated data set.
The simulation follows data structure similar to the SEER–Medicare linked data. We generated survival time T from a gamma distribution with shape 3 and rate 1, truncation time W has half chance to be 0 and half chance to be generated from a uniform-(0, 20) distribution, and censoring time C = W + C′ where C′ is generated from a uniform-(0, 8) distribution. The subset with truncation time W = 0 represents an incident cohort and W > 0 a prevalent cohort with untruncated observations satisfying TW. Conditioning on T, we generated two independent latent variables Z1 and Z2 from a gamma distribution with shape 3 and rate T. The latent variables are used to induce correlation between survival time and stochastic processes. For each subject, we generate a recurrent event process P(·) from a Poisson process with rate 4Z1, and at each occurrence of recurrent events at u time units before failure event, a variable Q(u) is generated from a gamma distribution with shape Z2 × [3 + 3 × I(u < 1/3)] and rate 1. The process of interest is equation M47. The generated data has the same structure as medical cost data, where P represents counting process for recurrent hospitalizations, Q represents medical cost incurred at a particular hospitalization and V (u) is the total medical cost in the last u time units of life. The recurrent event process, medical cost process and failure time are correlated through latent variables. That is, medical cost processes are terminated by informative failure events. Our simulations generated negative correlation between end-of-life cost and failure time, which also matches with the SEER–Medicare linked data (see Section 5.2). Under this setting, we are interested in estimating E(V (u)|1 ≤ T < 20) for u [set membership] [0, 1].
We compare the proposed estimator with naive complete-case estimators that have been used in the medical literature. Supposing one uses an unweighted sample mean based on observed deaths for the analysis, the direction of bias for this naive analysis will depend on whether longer survivors or shorter survivors are being oversampled. In an incident cohort, naive analysis will oversample shorter survivors in general because the naive data set is right truncated by discarding the right censored observations. So the estimated mean end-of-life cost will be biased upward in the simulation. In a prevalent cohort, naive sample is subject to double truncation, but the effect from left truncation is more serious in the simulation and we oversample longer survivors in general, so the estimated mean end-of-life cost will be biased downward. The simulation results are shown in Table 1 and match with this reasoning. The proposed method can correct the bias caused by left truncation and right censoring. The unweighted complete case estimator has been used, for example, in Chan et al. (1995), for studying the frequency of opportunistic infections for HIV infected individuals before death.
TABLE 1
TABLE 1
Summary of the simulation study: Comparisons among the proposed estimator and naive estimators based on unweighted complete case analysis from incident and prevalent cohorts, and evaluation of variance estimates and pointwise coverage probabilities of (more ...)
The small sample bias of the proposed estimator and evaluation of the variance estimator is also shown in Table 1. We can see that the proposed estimator worked well in practical sample sizes. We also studied the empirical coverage of the 95% confidence bands. Let t* = min{u : Vi (u) > 0 for some i}. Since μt1,t2 (u) = 0 for u < t*, it is only meaningful to consider coverage probabilities for ut*. We considered the coverage of confidence band for u [set membership] [t*, 1]. The empirical coverage of the 95% confidence bands are 94% for both n = 100 and n = 400. The empirical coverage of the confidence bands are close to the nominal value for practical sample sizes.
5.2. Data analysis
The proposed methods are illustrated by analyzing the SEER–Medicare linked data. We investigated end-of-life-cost for ovarian cancer cases diagnosed at age 65 or older among Medicare enrolles. Total amount charged during hospitalization is considered as medical cost in the analysis; this includes charges not covered by Medicare. All medical expenditures are adjusted to January 2000 value by the medical care component of the consumer price index, available from the website of the U.S. Department of Labor (http://www.bls.gov/cpi/). We compare medical cost among individuals with different historic stages at diagnosis. There were 3766, 1400 and 15,104 subjects classified as localized, regional and distant stages at diagnosis respectively. The estimates of the survival probabilities are shown in Figure 2.
First, we compare the estimated mean forward cost functions among the three historic stages. A mean forward cost process is estimated by
equation M48
which can be viewed as a limiting case of the estimator of Lin et al. (1997) with the partition size tending to zero. If left truncation is absent, methods proposed by Bang and Tsiatis (2000), Strawderman (2000) and Zhao and Tian (2001) can also be extended to estimate forward mean functions. Figure 3 shows the estimates for mean forward cost functions up to the thirtieth year after initial diagnosis of cancer. Note that there is a crossover for three curves around ten years after diagnosis. The ten year estimated survival probabilities are 0.47, 0.25, 0.07 (s.e.: 0.03, 0.06, 0.05) for patients diagnosed with local, regional and distinct stages respectively. In the first ten years after diagnosis, cumulative cost reflects the severity of the cancer stage at diagnosis. Beyond the tenth year, the cumulative cost reflects the better chance of survival for the less severe stages of cancer. The conflicting nature between accumulation of cost and survival complicates the analysis and careful interpretation of the results are needed. Also, the forward cost functions cannot directly answer questions about end-of-life-cost, because individuals have different survival times and the increase in medical cost before failure events at a given time after disease onset is offset by nonsurvivors who do not contribute to any increase in medical cost after death.
FIG. 3
FIG. 3
Estimates of the mean forward cost functions for ovarian cancer patients. Solid curve represents localized stage at diagnosis, dashed curve represents regional stage and dotted curve represents distant stage.
In the SEER–Medicare data analysis we observe a negative correlation between end-of-life-cost and survival. Using the estimator of joint distribution in Section 4.1, the estimated Pearson correlation coefficient between V (1) and T (conditioned on T > τ0 = 1) is −0.65, −0.31 and −0.46 for localized, regional and distant stages respectively. We compare the estimated one-year mean backward cost functions among the three historic stages, for individuals surviving at least one year after onset of disease. The results are shown in Figure 4. Unlike mean forward cost functions, estimated backward cost functions are very similar in shape for the three historic stage groups. The results show that there is a terminal increase in medical cost before death. The estimated final-year medical cost of a patient is $31,802, $31,752, $38,377 (s.e.: $1229, $2205, $896) in January 2000 value for patients diagnosed with local, regional and distinct stages respectively. The estimated medical cost for the last three months of life of an ovarian cancer patient is $16,365, $16,284, $18,848 (s.e.: $692, $1236, $613) in January 2000 value for patients diagnosed with local, regional and distinct stages respectively. Figure 5 shows the backward rate of cost accrual, which is the end-of-life cost per unit time before death. The bandwidths for the estimates were chosen to minimize an integrated mean squared error. The results agree with Figure 4 that there is a terminal increase in medical cost before death.
FIG. 4
FIG. 4
Estimates of the mean backward cost functions for ovarian cancer patients. Solid curves represent the estimates. Dotted curves represent 95% simultaneous confidence bands. Dashed curves represent pointwise 95% confidence intervals.
FIG. 5
FIG. 5
Estimates of the backward rate of cost accrual. Solid curve represents localized stage at diagnosis, dashed curve represents regional stage and dotted curve represents distant stage.
In this paper we proposed statistical methods for studying the terminal behavior of stochastic processes before failure events. In particular, we discussed nonparametric methods for estimating the mean function of backward stochastic processes under incident and prevalent sampling designs. We also discussed identifiability issues related to estimation with incomplete follow-up data. In an incident sampling design, the right tail of survival distribution may not be identified because of limited study duration. Using prevalent sampling design, the identifiable region for the survival distribution could be enlarged and cost associates with those individuals can be identified.
We used the SEER–Medicare data as an example throughout this paper. Although the SEER–Medicare data contain both incident and prevalent cohorts, our method can be applied to data only from an incident cohort or a prevalent cohort. The proposed methods only require the stochastic process data to be available in a certain time interval before a failure event. Thus, prevalent data can be used alone for the proposed methods even though we do not have data on the stochastic process before patient enrollment.
The backward estimation procedure proposed in this paper could serve as a main building block for other analyses. For example, we can compute the ratio of end-of-life cost to lifetime cost that combine the proposed method and the existing methods for analyzing lifetime cost.
Although we use medical cost as an example, applications of the proposed methods do not only limit one to study medical cost, but can also be used to study the terminal behavior of other stochastic processes before failure events. Other applications include CD4 counts before AIDS diagnosis, frequency of hospitalizations before death and measurements of quality-of-life before death.
The main focus of this paper is one sample estimation of the backward mean function. The authors are extending the idea in this paper to regression models of backward mean functions and backward rate functions.
Acknowledgments
The content of this article is based on the first author’s Ph.D. dissertation conducted at Johns Hopkins University under the supervision of the second author. The authors would also like to thank Professor Norman Breslow for suggestions that improved the presentation of this paper.
APPENDIX: PROOF OF THEOREM 3.1
We apply empirical process theory to prove the asymptotic results. Since NV (t, u) is having bounded variations and E(NV (t, u)) < ∞ for (t, u) [set membership]0, τ1] × [0, τ0], we can apply the uniform strong law of large numbers [Pollard (1990)] to show that NV (t, u) converges a.s. uniformly to equation M49. Also, R(s) converges a.s. uniformly to G(s) [Woodroofe (1985)]. By Lemma 1 of Lin et al. (2000),
equation M50
uniformly on [τ0, τ1] × [0, τ0]. Also, since Ŝ(t) and equation M51 are uniform consistent estimates of S(t) and equation M52, uniform consistency of [mu]t1,t2 (u) also follows from Lemma 1 in Lin et al. (2000).
Defining equation M53 for t ≥ τ0 and u [set membership] [0, τ0], we have
equation M54
and
equation M55
where
equation M56
Upon algebraic manipulation, ξi(u) = ξ1i(u) + ξ2i(u) +ξ3i(u) reduces to
equation M57
Since ξi(u) can be written as sums and products of monotone functions of u, therefore, {ξi(u)} forms a manageable sequence [Pollard (1990), Bilias, Gu and Ying (1997)]. The weak convergence of equation M58 follows from the functional central limit theorem [Pollard (1990)].
Footnotes
1Supported in part by the National Institutes of Health Grant P01 CA 098252.
Contributor Information
Kwun Chuen Gary Chan, Department of Biostatistics, University of Washington, Seattle, Washington 98185, USA, kcgchan/at/u.washington.edu.
Mei-Cheng Wang, Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, USA, mcwang/at/jhsph.edu.
  • Bang H, Tsiatis AA. Estimating medical costs with censored data. Biometrika. 2000;87:329–343. MR1782482.
  • Bilias Y, Gu M, Ying Z. Towards a general asymptotic theory for Cox model with staggered entry. Ann. Statist. 1997;25:662–682. MR1439318.
  • Chan ISF, Neaton JD, Saravolatz LD, Crane LR, Osterberger J. Frequencies of opportunistic diseases prior to death among HIV-infected persons. Aids. 1995;9:1145–1151. [PubMed]
  • Cook RJ, Lawless JF. Marginal analysis of recurrent events and a terminating event. Stat. Med. 1997;16:911–924. [PubMed]
  • Ghosh D, Lin DY. Nonparametric analysis of recurrent events and death. Biometrics. 2000;56:554–562. MR1795021. [PubMed]
  • Gross ST, Lai TL. Nonparametric estimation and regression analysis with left-truncated and right-censored data. J. Amer. Statist. Assoc. 1996;91:1166–1180. MR1424616.
  • Huang Y, Louis TA. Nonparametric estimation of the joint distribution of survival time and mark variables. Biometrika. 1998;85:785–798. MR1666750.
  • Lai TL, Ying Z. Estimating a distribution function with truncated and censored data. Ann. Statist. 1991;19:417–442. MR1091860.
  • Lawless JF, Nadeau C. Some simple robust methods for the analysis of recurrent events. Technometrics. 1995;37:158–168. MR1333194.
  • Lin DY. Proportional means regression for censored medical costs. Biometrics. 2000;56:775–778. [PubMed]
  • Lin DY, Fleming TR, Wei LJ. Confidence bands for survival curves under the proportional hazards model. Biometrika. 1994;81:73–81. MR1279657.
  • Lin DY, Feuer EJ, Etzioni R, Wax Y. Estimating medical costs from incomplete follow-up data. Biometrics. 1997;53:419–434. [PubMed]
  • Lin DY, Wei LJ, Yang I, Ying Z. Semiparametric regression for the mean and rate functions of recurrent events. J. Roy. Statist. Soc. Ser. B Stat. Methodol. 2000;62:711–730. MR1796287.
  • Nelson W. Graphical analysis of system repair data. Journal of Quality Technology. 1988;20:24–35.
  • Pawitan Y, Self S. Modeling disease marker processes in AIDS. J. Amer. Statist. Assoc. 1993;88:719–726.
  • Pepe MS, Cai J. Some graphical displays and marginal regression analyses for recurrent failure times and time dependent covariates. J. Amer. Statist. Assoc. 1993;88:811–820.
  • Pollard D. Empirical Processes: Theory and Applications. Hayward, CA: IMS; 1990. MR1089429.
  • Strawderman RL. Estimating the mean of an increasing stochastic process at a censored stopping time. J. Amer. Statist. Assoc. 2000;95 MR1804243.
  • Tsai W-Y, Jewell NP, Wang M-C. A note on the product-limit estimator under right censoring and left truncation. Biometrika. 1987;74:883–886.
  • van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996. MR1385671.
  • Wang M-C. Nonparametric estimation from cross-sectional survival data. J. Amer. Statist. Assoc. 1991;86:130–143. MR1137104.
  • Wang M-C, Chiang CT. Non-parametric methods for recurrent event data with informative and non-informative censorings. Stat. Med. 2002;21:445–456. [PubMed]
  • Wang M-C, Qin J, Chiang CT. Analyzing recurrent event data with informative censoring. J. Amer. Statist. Assoc. 2001;96:1057–1065. MR1947253. [PMC free article] [PubMed]
  • Warren JL, Klabunde CN, Schrag D, Bach PB, Riley GF. Overview of the SEER–Medicare data: Content, research applications, and generaliz-ability to the United States elderly population. Med. Care. 2002;40:3–18. [PubMed]
  • Woodroofe M. Estimating a distribution function with truncated data. Ann. Statist. 1985;13:163–177. MR0773160.
  • Zhao H, Tian L. On estimating medical cost and incremental cost-effectiveness ratios with censored data. Biometrics. 2001;57:1002–1008. MR1950417. [PubMed]