Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC1475950

Formats

Article sections

- SUMMARY
- 1. INTRODUCTION
- 2. NOTATION AND REVIEW OF COMPETING RISK METHODS
- 3. AGE CONDITIONAL PROBABILITIES OF DEVELOPING CANCER ESTIMATED FROM CANCER REGISTRIES
- 4. EXAMPLES
- 5. SIMULATIONS
- 6. DISCUSSION
- REFERENCES

Authors

Related links

Stat Med. Author manuscript; available in PMC 2006 June 12.

Published in final edited form as:

PMCID: PMC1475950

NIHMSID: NIHMS4799

Michael P. Fay,^{1,}^{*}^{†} Ruth Pfeiffer,^{2} Kathleen A. Cronin,^{1} Chenxiong Le,^{3} and Eric J. Feuer^{1}

We propose an estimator of the probability of developing a disease in a given age range, conditional on never having developed the disease prior to the beginning of the age range. Our estimator improves the one described by Wun, Merrill and Feuer (*Lifetime Data Analysis* 1998; **4**, 169–186) that is currently used by the U.S. National Cancer Institute for the SEER Cancer Statistics Review. Both estimators use cross-sectional disease rates and provide an interpretation of these rates in terms of the age-conditional probability of developing disease in a hypothetical cohort. The difficulty of this problem is that rates are not available per person-years alive and disease free, but only per person-years alive. Wun *et al.* used *ad hoc* methods to handle this problem which did not properly account for competing risks, did not provide a measure of variability, and only allowed age ranges using prespecified 5-year age intervals. Here we solve the problem under a unified competing risks framework, which allows the calculation of the age-conditional probabilities for any age range. We generalize gamma confidence intervals to apply to our new statistic. Although our new method provides estimates which are numerically similar to that of Wun *et al.*, this paper provides a comprehensive theoretical basis for estimation and inference about the age-conditional probability of developing a disease.

In this paper we use cross-sectional cancer rates and death rates to estimate lifetime and age-conditional probabilities of developing different types of cancer in a hypothetical cohort. If rates per person-years alive and cancer free are available then the estimation of these probabilities is a straightforward application of competing risk methodology. The difficulty is that many disease registries (including the National Cancer Institute's Surveillance, Epidemiology and End Results [SEER] cancer incidence data and National Center for Health Statistics [NCHS] mortality data that we use as our example) provide only rates per person-years alive. We show how to write the age-conditional probability of developing cancer as a function of the available rates, under a simple, standard assumption. In addition, we generalize the gamma confidence intervals developed for linear combinations of independent Poisson random variables [1], to apply to these more complex estimators.

Previous work on this problem is described in Wun *et al.* [2] and historical references may be found there. Wun *et al.* [2] did not fully account for competing risks in their model. Although they did use the theory of competing risks for some parts of their derivation (see equation (3) of Wun *et al.* [2]), some omissions were made in fully utilizing the competing risks framework. For example, in deriving the probability of developing cancer among the total population from the incidence rate, Wun *et al.* [2] used the formula for a failure time to a single event instead of the proper formula that accounts for competing risks (see equation (9) of Wun *et al.* [2]). This paper presents a new method for calculating the age-conditional probability of developing cancer which comprehensively accounts for competing risks. In Section 6 we compare our method with that of Wun *et al.* [2].

In Section 2 we review competing risk methods and in the process introduce our notation. In Section 3.1 we derive our estimator for the age-conditional probability of cancer. In Section 3.2 we provide methods to calculate confidence intervals. In Section 4 we apply the method to data examples. In Section 5 we explore the properties of our confidence interval estimator through simulation. A concluding discussion is presented in Section 6.

Consider first the standard competing risk problem (see, for example, Kalbfleisch and Prentice [3]). We observe the time until one of several events, *T*, and an indicator of the type of event that occurred, *J*. In this paper, *T* is a random variable denoting the age at death and *J* has one of two values, *J* = d means death from the event of interest (for example, breast cancer), and J = o means death from other causes. For ease of exposition, we use the term ‘cancer’ to denote the event of interest. The cause specific hazard function for *J* = j is

Thus λ_{d}(*a*) is the rate of cancer deaths per person-years alive at age *a*, and λ_{o}(*a*) is the rate of other (that is, non-cancer) deaths per person-years alive at age *a*. The overall failure rate at age *a* is λ(*a*) = λ_{d}(*a*)+λ_{o}(*a*), and the overall survival function is . The probability of dying from cause *j* in the age interval [*x*, *y*) given survival until just prior to *x* is

where .

We also consider the statistically identical competing risks problem where *T** is the age at either first cancer or death before first cancer, and *J** is the indicator with *J** = c denoting that *T** is the age at first cancer and *J** = o denoting that *T** is the age at death if death occurs before the first cancer. The cause specific hazard functions are: , the rate of first cancer per person-years alive *and cancer free* at age *a*, and , the rate of deaths per person-years alive *and cancer free* at age *a*. Then, similar to above, the probability of getting a first cancer in the age interval [*x*, *y*) given alive and cancer free until just prior to *x* is

(1)

where and .

We wish to estimate *A*(*x*, *y*) as given in equation (1), but we cannot directly obtain estimators of either or λ*(*a*), the rates of cancer and total deaths, respectively, per person-years alive and cancer free. However, we can directly estimate the following rates per person-years alive at age *a*: λ_{c}(*a*), the rate of first cancer incidence, λ_{d}(*a*), the rate of cancer deaths; λ_{o}(*a*), the rate of other (that is, non-cancer) deaths. We assume that the rate of non-cancer deaths is the same for all people regardless of whether or not they have had a cancer, so that . After making this assumption, we show in the following that we can rewrite *A*(*x*, *y*) in terms of the functions λ_{c}(·), λ_{d}(·) and λ_{o}(·).

First consider the numerator equation (1). Rewrite λ_{c}(*a*) as

(2)

Using this equation we write the numerator of equation (1) as .

Rewrite the denominator, , where for *j* = c, o. Note that does not have a survival function interpretation (see Kalbfleisch and Prentice, reference [3, p. 168]). Because we have assumed that , we write , and the only outstanding problem is finding an estimator of . In the definition of , we rewrite the expression for using equation (2), and we obtain the recursive equation

(3)

To solve this recursive equation, first let *S*(*t*)=*S* _{d}(*t*)*S* _{o}(*t*), where for *j* = d, o. Using the assumption that , equation (3) becomes

Take log of both sides, then differentiate with respect to *t* to get

If *T* *** is a continuous random variable then and . Now integrate to obtain and , so that . Thus, under the assumption , *A*(*x, y*) can be expressed as

(4)

To obtain an estimate of *A*(*x, y*) using SEER incidence data and NCHS mortality data, we first divide the possible ages into *k*+1 intervals, [*a* _{i},*a* _{i+1}) where 0 = *a* _{0} < *a* _{1} < < *a* _{k} < *a* _{k+1} = ∞, and choose a calendar interval, [*t* _{1},*t* _{2}). We observe the number of first cancer incident cases (*c* _{i}), cancer deaths (*d* _{i}), and other deaths (*o* _{i}), occurring at ages in the interval [*a* _{i},*a* _{i+1}) during the calendar time [*t* _{1},*t* _{2}), for *i* = 0,…,*k*. Although the cancer incident cases and the deaths often come from the same population (see Table I), this is not necessary (see Table III). We also observe , which is (*t* _{2} − *t* _{1}) times the estimated number of people from the same population associated with event *j* (where *j* = c, d, or o) with ages in [*a* _{i},*a* _{i+1}) at the midpoint, (*t* _{1} + *t* _{2})/2, of the interval [*t* _{1},*t* _{2}), for *i* = 0,…,*k*. If *t* _{2} − *t* _{1} = 1, corresponds to the midyear population with ages in [*a* _{i},*a* _{i+1}).

We assume that the observed counts *c* _{i},*d* _{i},*o* _{i} are Poisson and the midinterval populations are fixed constants. For a motivation and discussion of this assumption see Brillinger [4] with discussion (see especially the discussion by Keiding). Assuming constant rates within age intervals, we estimate rates for ages *a* [*a* _{i},*a* _{i+1}) by , , and . These estimators replace their associated functions in equation (4) to obtain our estimator of *A*(*x, y*). In Appendix A we show the estimator using summation notation.

Because we are using cross-sectional data from finite populations to estimate hazard rates for a hypothetical cohort, these estimates may produce hazards that cannot possibly describe a real cohort. There are two types of these ‘impossible’ hazards. If no one in the oldest age group dies (that is, *d* _{k} = 0 and *o* _{k} = 0), then the resulting hazards describe an impossible cohort where the probability of living forever is non-zero. Another impossible cohort would result if the probability of dying of cancer by any age *a* is greater than the probability of getting cancer by that same age *a* (this is equivalent to ). These impossible cohorts would rarely occur in large populations.

In this section we modify the gamma confidence intervals, developed for linear combinations of independent Poisson random variables by Fay and Feuer [1], to create confidence intervals for *A*(*x, y*). First, we put all the Poisson counts into one (3*K* + 3) × 1 vector

Associated with each *z* _{i} is a random variable *Z* _{i} which we assume has a Poisson distribution with mean μ_{i}. Let μ = [μ_{1},μ_{2},…,μ_{3k+3}]^{′}. In the previous notation . Emphasizing the dependence of *A*(*x, y*) on μ, we write *A*(*x, y*) = *A*(*x, y,* μ). Using this notation, our estimator is *A*(*x, y, z*). For ease of exposition we write

(5)

and

where diag(** z**) is a diagonal matrix with the values of

Alternatively, numerical derivatives can be used. Letting

with , leads to our variance estimate

Our generalization of the gamma intervals [1] is to use the Taylor expansion as the linear combination of independent Poisson random variables. The only complication is that the weights may be negative and depend on the Poisson values. This complication does not effect the lower confidence limit, though; the 100(1 − α) per cent lower confidence limit is given by where is the *p*th quantile of the gamma distribution with parameters (that is, with mean *A*(** z**) and variance

Define *z*_{(M)} to be the vector value of either or for such that *A*(*z*_{(M)}) is maximized. Then the upper confidence limit is where and . Note that if we let the population and the mean μ get larger by the same constant, say *N*, then the generalized gamma intervals approach the usual delta method intervals (see, for example, Lehmann [5]) as *N* → ∞ (see Appendix B). For small μ these generalized gamma intervals perform better and are calculated straightforwardly even when some *z* _{i} = 0, while the delta method requires modification whenever some *z* _{i} = 0 in order to prevent estimates of zero variance. For the delta method, the variances corresponding to the elements ** z** are estimated by replacing elements with

Our examples use SEER cancer incidence data and NCHS mortality data associated with the corresponding SEER catchment areas (see Ries *et al.* [6]). We calculate our statistics for two types of cancer, invasive female breast cancer, one of the more common cancers, for all races from the expanded 11 SEER registries from *t* _{1} = 1 January 1996 until *t* _{2} = 31 December 1998, and acute lymphocytic leukaemia (ALL) for all races from the nine SEER registries active during *t* _{1} = 1 January 1990 until *t* _{2} = 31 December 1990. ALL was chosen because it is primarily a childhood cancer (see Table I) and provides an example which has high rates at young ages unlike many cancer sites, for example, breast cancer, which have increasing incidence for older age groups.

The raw data are listed in Table I, and *A*(*x, y, z*) with the associated 95 per cent confidence intervals for different values of

Estimated per cent developing cancer by age *y*, given no cancer before age *x* (with 95 per cent confidence intervals).

From Table II, the estimated probability of developing invasive breast cancer in one's lifetime is 0.1332 while the probability of developing breast cancer given alive and cancer free at 30 is 0.1348. It seems contradictory that by surviving from age 0 to age 30 without dying or getting breast cancer, a woman actually increases the probability of getting breast cancer in the remainder of her life. To gain more insight into this situation consider an example of a birth cohort of 100 females and assume that 12 will develop breast cancer over their life time. If by age 5 two of the girls have died of other causes and none have yet developed breast cancer, the risk of developing breast cancer after age 5 is 12/98(> 12/100) in this cohort.

We tested the coverage probabilities of our method in three situations. For the first two situations (female breast cancer and ALL) we assumed that the rates were exactly equal to the rates derived from Table I except we added 0.5 to zero counts. Then we simulated 10000 data sets assuming independent Poisson distributions with means equal to those counts (with 0.5 added to zeros). For the third situation, we checked our method for extremely low counts; we used incidence rates of eye and orbit cancer in the nine SEER areas in 1990 (after adding 0.5 to the zero value), and rates of eye and orbit cancer deaths and other deaths from the entire U.S. in 1990, and simulated these rates applied to the Vietnamese population in the nine SEER areas in 1990. The raw data are presented in Table III. Thus for example the expected value for *c* _{1} is 5914 × (28/1817468) = 0.0911, and we have many expected count values that are much less than 1. Then we simulated 10000 data sets assuming independent Poisson distributions with means equal to those expected counts. We calculated a 95 per cent confidence interval for each simulation. In Table IV we list both *E* _{L}, the percentage of the lower confidence limits that are greater than the true value and *E* _{U}, the percentage of the upper confidence limits that are less than the true value, where the *true value* refers to the estimator calculated from the counts with 0.5 added to the zeros.

Simulated error rates for 95 per cent confidence limits (ideal error rates are 2.5 per cent); 10000 simulations for each cancer/age range combination.

The situations with larger counts give better error rates, and the third situation with extremely low expected counts gives error rates that are very conservative. In each of the three situations the gamma intervals have error rates closer to the nominal 2.5 per cent than the delta method based confidence intervals, although there is essentially no difference in the first case. For ALL the asymmetric gamma confidence intervals produce more central confidence intervals (that is, the tails of the errors are more nearly equal) than the symmetric delta confidence intervals. For the eye and orbit situation, both methods perform very conservatively, but the gamma method is generally less conservative. In addition all the lower delta method confidence limits were less than 0, and most of the upper limits were greater than the gamma method upper limits. Thus, in all situations the gamma method performed better than the delta method.

In this paper we derived a new estimator of *A*(*x, y*), the probability of developing a first time cancer during the age interval [*x; y*), conditioned on being alive and cancer free at age *x*. We assumed that λ_{c}(*a*) is constant within an interval and computed which is not constant. However, it may have been more realistic to assume the hazards among the actual at risk populations, , are constant over the interval and computed the non-constant λ_{c}(*a*). Unfortunately, this approach does not appear to be tractable.

We have generalized the gamma confidence intervals [1] to apply to our new statistic. Although these intervals appear conservative in cases with extremely low counts, we have shown that the delta method which adds 0.5 to zero counts in the estimation of the variance of the counts performs worse. A more general way of performing the delta method is to assume that variances associated with zero counts are equal to some constant, 0 < δ < 1. The problem is that there is no obvious choice of δ; we have arbitrarily chosen δ = 0.5 in this paper. Note in the most extreme case where all counts are zero, the generalized gamma interval gives a non-zero upper limit, while the delta method gives an upper limit that approaches zero as δ → 0. Other methods, such as parametric bootstrap confidence intervals, suffer from the same problem of having no satisfactory method for handling zero counts. In the simple case of linear combinations of independent Poisson variables, Fay and Feuer [1] discuss similar issues comparing the gamma intervals and the approximate bootstrap confidence (ABC) intervals.

Our new estimator happens to be numerically similar to the existing method of Wun *et al.* [2]. Because our approach is new and not simply a modification of Wun *et al.* [2] and because the notations are very different between the two approaches, we have relegated the full comparison between the two methods to a technical report [7]. In that report, we show through Taylor series approximations that the two methods are similar. In addition, using the new method we recalculated Table I-17 of the *Cancer Statistics Review*, *1973*–*1998* [6] which gives lifetime risk of developing cancer calculated for each of 30 difference cancer categories on six subpopulations, and the new estimator differs by less than 2 per cent from that of Wun *et al.* [2] in every case (see [7]).

DEVCAN (Probability of DEVeloping CANcer) software [8] has been freely available to calculate the statistics of Wun *et al.* [2] and will be updated to calculate our new estimator in a future version. See http://srab.cancer.gov/DevCan/ for the most current version of the software.

We thank Chris Gunther and Alexander Korobkov for computing help.

We write our estimator of *A*(*x, y*) as *A*(*x, y,* ** z**) (see Section 3.2 for motivation of this notation).

Let and for , and . For convenience we regroup the ages after inserting group delimiters at *x* and *y*. Let the new delimiters be where . We let

and similarly and . In this notation, *A*(*x, y*) = *A*(*b* _{i+1}, *b* _{j+2}), and we estimate it with

Because or may equal zero and may equal infinity, we let . These integrals are

where the case λ = 0 and is one of the ‘impossible’ hypothetical cohorts (see Section 3.1). Thus, we obtain

Fay and Feuer [1] stated that the gamma intervals approach the standard normal intervals if (using the application of this paper) *A*(** z**) goes to infinity in such a way that

where the result follows since (Johnson and Kotz, reference [9, p. 170]), and Φ^{−1} (*p*) is the *p*th quantile of the standard normal distribution. One can similarly show that the upper confidence limits approach the standard normal limits.

^{‡}This article is a US government work and is in the public domain in the U.S.A.

1. Fay MP, Feuer EJ. Confidence intervals for directly standardized rates: a method based on the gamma distribution. Statistics in Medicine. 1997;16:791–801. [PubMed]

2. Wun L-M, Merrill RM, Feuer EJ. Estimating lifetime and age-conditional probabilities of developing cancer. Lifetime Data Analysis. 1998;4:169–186. [PubMed]

3. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 1980. pp. 163–178.

4. Brillinger DR. The natural variability of vital rates and associated statistics (with Discussion) Biometrics. 1986;42:693–734. [PubMed]

5. Lehmann EL. Elements of Large-Sample Theory. Springer; New York: 1999.

6. Ries LAG, Eisner MP, Kosary CL, Hankey BF, Miller BA, Clegg L, Edwards BK, editors. SEER Cancer Statistics Review, 1973–1998 2001. National Cancer Institute; Bethesda, MD: http://seer.cancer.gov/csr/1973_1998/accessed aton 3 September 2002.

7. Fay MP, Pfeiffer R, Cronin KA, Le C, Feuer EJ. Comparison of two methods for calculating age-conditional probabilities of developing cancer. Technical report #2002–01 2002. Statistical Research and Applications Branch, National Cancer Institute; http://srab.cancer.gov/reportsaccessed aton 3 September 2002.

8. National Cancer Institute and Information Management Services *DEVCAN: Probability of DEVeloping CANcer software* Version 4.1 2001. National Cancer Institute and Information Management Services, Inc.http://srab.cancer.gov/DevCan/accessed aton 3 September 2002.

9. Johnson NL, Kotz S. Distributions in Statistics: Continuous Univariate Distributions-1. Wiley; New York: 1970.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |