Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC 2006 June 12.
Published in final edited form as:
Stat Med. 2003 June 15; 22(11): 1837–1848.
doi:  10.1002/sim.1428
PMCID: PMC1475950

Age-conditional probabilities of developing cancer


We propose an estimator of the probability of developing a disease in a given age range, conditional on never having developed the disease prior to the beginning of the age range. Our estimator improves the one described by Wun, Merrill and Feuer (Lifetime Data Analysis 1998; 4, 169–186) that is currently used by the U.S. National Cancer Institute for the SEER Cancer Statistics Review. Both estimators use cross-sectional disease rates and provide an interpretation of these rates in terms of the age-conditional probability of developing disease in a hypothetical cohort. The difficulty of this problem is that rates are not available per person-years alive and disease free, but only per person-years alive. Wun et al. used ad hoc methods to handle this problem which did not properly account for competing risks, did not provide a measure of variability, and only allowed age ranges using prespecified 5-year age intervals. Here we solve the problem under a unified competing risks framework, which allows the calculation of the age-conditional probabilities for any age range. We generalize gamma confidence intervals to apply to our new statistic. Although our new method provides estimates which are numerically similar to that of Wun et al., this paper provides a comprehensive theoretical basis for estimation and inference about the age-conditional probability of developing a disease.

Keywords: competing risks, gamma confidence interval, hypothetical cohort, lifetime risk, surveillance, vital rates


In this paper we use cross-sectional cancer rates and death rates to estimate lifetime and age-conditional probabilities of developing different types of cancer in a hypothetical cohort. If rates per person-years alive and cancer free are available then the estimation of these probabilities is a straightforward application of competing risk methodology. The difficulty is that many disease registries (including the National Cancer Institute's Surveillance, Epidemiology and End Results [SEER] cancer incidence data and National Center for Health Statistics [NCHS] mortality data that we use as our example) provide only rates per person-years alive. We show how to write the age-conditional probability of developing cancer as a function of the available rates, under a simple, standard assumption. In addition, we generalize the gamma confidence intervals developed for linear combinations of independent Poisson random variables [1], to apply to these more complex estimators.

Previous work on this problem is described in Wun et al. [2] and historical references may be found there. Wun et al. [2] did not fully account for competing risks in their model. Although they did use the theory of competing risks for some parts of their derivation (see equation (3) of Wun et al. [2]), some omissions were made in fully utilizing the competing risks framework. For example, in deriving the probability of developing cancer among the total population from the incidence rate, Wun et al. [2] used the formula for a failure time to a single event instead of the proper formula that accounts for competing risks (see equation (9) of Wun et al. [2]). This paper presents a new method for calculating the age-conditional probability of developing cancer which comprehensively accounts for competing risks. In Section 6 we compare our method with that of Wun et al. [2].

In Section 2 we review competing risk methods and in the process introduce our notation. In Section 3.1 we derive our estimator for the age-conditional probability of cancer. In Section 3.2 we provide methods to calculate confidence intervals. In Section 4 we apply the method to data examples. In Section 5 we explore the properties of our confidence interval estimator through simulation. A concluding discussion is presented in Section 6.


Consider first the standard competing risk problem (see, for example, Kalbfleisch and Prentice [3]). We observe the time until one of several events, T, and an indicator of the type of event that occurred, J. In this paper, T is a random variable denoting the age at death and J has one of two values, J = d means death from the event of interest (for example, breast cancer), and J = o means death from other causes. For ease of exposition, we use the term ‘cancer’ to denote the event of interest. The cause specific hazard function for J = j is


Thus λd(a) is the rate of cancer deaths per person-years alive at age a, and λo(a) is the rate of other (that is, non-cancer) deaths per person-years alive at age a. The overall failure rate at age a is λ(a) = λd(a)+λo(a), and the overall survival function is S(a)=Pr[T>a]=exp(0aλ(u)du). The probability of dying from cause j in the age interval [x, y) given survival until just prior to x is


where S(a)=limε0S(aε).

We also consider the statistically identical competing risks problem where T* is the age at either first cancer or death before first cancer, and J* is the indicator with J* = c denoting that T* is the age at first cancer and J* = o denoting that T* is the age at death if death occurs before the first cancer. The cause specific hazard functions are: λc(a), the rate of first cancer per person-years alive and cancer free at age a, and λo(a), the rate of deaths per person-years alive and cancer free at age a. Then, similar to above, the probability of getting a first cancer in the age interval [x, y) given alive and cancer free until just prior to x is


where S(a)=exp{0aλ(u)du} and λ(a)=λc(a)+λo(a).


3.1. The estimator

We wish to estimate A(x, y) as given in equation (1), but we cannot directly obtain estimators of either λc(a) or λ*(a), the rates of cancer and total deaths, respectively, per person-years alive and cancer free. However, we can directly estimate the following rates per person-years alive at age a: λc(a), the rate of first cancer incidence, λd(a), the rate of cancer deaths; λo(a), the rate of other (that is, non-cancer) deaths. We assume that the rate of non-cancer deaths is the same for all people regardless of whether or not they have had a cancer, so that λo(a)=λo(a). After making this assumption, we show in the following that we can rewrite A(x, y) in terms of the functions λc(·), λd(·) and λo(·).

First consider the numerator equation (1). Rewrite λc(a) as


Using this equation we write the numerator of equation (1) as xyλc(u)S(u)du.

Rewrite the denominator, S(a)=Sc(a)So(a), where Sj(a)=exp(0aλj(u)du) for j = c, o. Note that Sj(a) does not have a survival function interpretation (see Kalbfleisch and Prentice, reference [3, p. 168]). Because we have assumed that λo(a)=λo(a), we write So(a)=So(a)=exp(0aλo(u)du), and the only outstanding problem is finding an estimator of Sc(a). In the definition of Sc(a), we rewrite the expression for λc(a) using equation (2), and we obtain the recursive equation


To solve this recursive equation, first let S(t)=S d(t)S o(t), where Sj(t)=exp(0tλj(u)du) for j = d, o. Using the assumption that λo(a)=λo(a), equation (3) becomes


Take log of both sides, then differentiate with respect to t to get


If T * is a continuous random variable then Sc(t)=Sc(t) and dSc(t)dt=λc(t)Sd(t). Now integrate to obtain Sc(a)Sc(0)=0aλc(t)Sd(t)dt and Sc(0)=1, so that Sc(a)=10aλc(u)Sd(u)du. Thus, under the assumption λo(a)=λo(a), A(x, y) can be expressed as


To obtain an estimate of A(x, y) using SEER incidence data and NCHS mortality data, we first divide the possible ages into k+1 intervals, [a i,a i+1) where 0 = a 0 < a 1 < (...) < a k < a k+1 = ∞, and choose a calendar interval, [t 1,t 2). We observe the number of first cancer incident cases (c i), cancer deaths (d i), and other deaths (o i), occurring at ages in the interval [a i,a i+1) during the calendar time [t 1,t 2), for i = 0,…,k. Although the cancer incident cases and the deaths often come from the same population (see Table I), this is not necessary (see Table III). We also observe ni(j), which is (t 2t 1) times the estimated number of people from the same population associated with event j (where j = c, d, or o) with ages in [a i,a i+1) at the midpoint, (t 1 + t 2)/2, of the interval [t 1,t 2), for i = 0,…,k. If t 2t 1 = 1, ni(j) corresponds to the midyear population with ages in [a i,a i+1).

Table I
Raw data.
Table III
Raw data, eye and orbit cancer, both sexes, 1990.

We assume that the observed counts c i,d i,o i are Poisson and the midinterval populations are fixed constants. For a motivation and discussion of this assumption see Brillinger [4] with discussion (see especially the discussion by Keiding). Assuming constant rates within age intervals, we estimate rates for ages a [set membership] [a i,a i+1) by λ^c(a)=cini(c), λ^d(a)=dini(d), and λ^o(a)=oini(o)=λ^o(a). These estimators replace their associated functions in equation (4) to obtain our estimator of A(x, y). In Appendix A we show the estimator using summation notation.

Because we are using cross-sectional data from finite populations to estimate hazard rates for a hypothetical cohort, these estimates may produce hazards that cannot possibly describe a real cohort. There are two types of these ‘impossible’ hazards. If no one in the oldest age group dies (that is, d k = 0 and o k = 0), then the resulting hazards describe an impossible cohort where the probability of living forever is non-zero. Another impossible cohort would result if the probability of dying of cancer by any age a is greater than the probability of getting cancer by that same age a (this is equivalent to 0aλd(u)du>0aλc(u)du). These impossible cohorts would rarely occur in large populations.

3.2. Confidence limits for A(x, y)

In this section we modify the gamma confidence intervals, developed for linear combinations of independent Poisson random variables by Fay and Feuer [1], to create confidence intervals for A(x, y). First, we put all the Poisson counts into one (3K + 3) × 1 vector


Associated with each z i is a random variable Z i which we assume has a Poisson distribution with mean μi. Let μ = [μ12,…,μ3k+3]. In the previous notation μ=[λc(a0)n0(c),λc(a1)n1(c),,λo(ak)nk(o)]. Emphasizing the dependence of A(x, y) on μ, we write A(x, y) = A(x, y, μ). Using this notation, our estimator is A(x, y, z). For ease of exposition we write A(x, y, z) = A(z) and A(x, y, μ) = A(μ), suppressing dependence on x and y. Using a Taylor series expansion




where diag(z) is a diagonal matrix with the values of z on the diagonal, representing an estimate of var(Z).

Alternatively, numerical derivatives can be used. Letting


with z(+)=[z1,,z1,1+z,z+1,,z3K+3], leads to our variance estimate


Our generalization of the gamma intervals [1] is to use the Taylor expansion as the linear combination of independent Poisson random variables. The only complication is that the weights may be negative and depend on the Poisson values. This complication does not effect the lower confidence limit, though; the 100(1 − α) per cent lower confidence limit is given by L=Gγ,β1(α2) where Gγ,β1(p) is the pth quantile of the gamma distribution with parameters γ=A(z)2V(z)andβ=V(z)A(z) (that is, with mean A(z) and variance V(z)). However, for the upper limit the method has to be altered. When finding the maximum discrete increase in A(z), it is possible that this may occur with a decrease in one of the Poisson values. Let


Define z (M) to be the vector value of either z(+) or z() for =1,,3K+3 such that A(z (M)) is maximized. Then the upper confidence limit is U=GγM,βM1(1α2) where γM=A(z(M))2V(z(M)) and βM=V(z(M))A(z(M)). Note that if we let the population and the mean μ get larger by the same constant, say N, then the generalized gamma intervals approach the usual delta method intervals (see, for example, Lehmann [5]) as N → ∞ (see Appendix B). For small μ these generalized gamma intervals perform better and are calculated straightforwardly even when some z i = 0, while the delta method requires modification whenever some z i = 0 in order to prevent estimates of zero variance. For the delta method, the variances corresponding to the elements z are estimated by replacing elements with z i = 0 with 0.5.


Our examples use SEER cancer incidence data and NCHS mortality data associated with the corresponding SEER catchment areas (see Ries et al. [6]). We calculate our statistics for two types of cancer, invasive female breast cancer, one of the more common cancers, for all races from the expanded 11 SEER registries from t 1 = 1 January 1996 until t 2 = 31 December 1998, and acute lymphocytic leukaemia (ALL) for all races from the nine SEER registries active during t 1 = 1 January 1990 until t 2 = 31 December 1990. ALL was chosen because it is primarily a childhood cancer (see Table I) and provides an example which has high rates at young ages unlike many cancer sites, for example, breast cancer, which have increasing incidence for older age groups.

The raw data are listed in Table I, and A(x, y, z) with the associated 95 per cent confidence intervals for different values of x and y are listed in Table II. As expected from Appendix B, the delta method confidence intervals are very similar to the gamma method confidence intervals for these data. We also calculated the gamma confidence intervals using the exact derivatives ([partial differential]A(t)/[partial differential]t) instead of the numerical ones (ΔA(z)), and these intervals (not shown) give essentially the same results as the gamma intervals listed in Table II (the values in terms of probabilities are equal up to ve significant digits).

Table II
Estimated per cent developing cancer by age y, given no cancer before age x (with 95 per cent confidence intervals).

From Table II, the estimated probability of developing invasive breast cancer in one's lifetime is 0.1332 while the probability of developing breast cancer given alive and cancer free at 30 is 0.1348. It seems contradictory that by surviving from age 0 to age 30 without dying or getting breast cancer, a woman actually increases the probability of getting breast cancer in the remainder of her life. To gain more insight into this situation consider an example of a birth cohort of 100 females and assume that 12 will develop breast cancer over their life time. If by age 5 two of the girls have died of other causes and none have yet developed breast cancer, the risk of developing breast cancer after age 5 is 12/98(> 12/100) in this cohort.


We tested the coverage probabilities of our method in three situations. For the first two situations (female breast cancer and ALL) we assumed that the rates were exactly equal to the rates derived from Table I except we added 0.5 to zero counts. Then we simulated 10000 data sets assuming independent Poisson distributions with means equal to those counts (with 0.5 added to zeros). For the third situation, we checked our method for extremely low counts; we used incidence rates of eye and orbit cancer in the nine SEER areas in 1990 (after adding 0.5 to the zero value), and rates of eye and orbit cancer deaths and other deaths from the entire U.S. in 1990, and simulated these rates applied to the Vietnamese population in the nine SEER areas in 1990. The raw data are presented in Table III. Thus for example the expected value for c 1 is 5914 × (28/1817468) = 0.0911, and we have many expected count values that are much less than 1. Then we simulated 10000 data sets assuming independent Poisson distributions with means equal to those expected counts. We calculated a 95 per cent confidence interval for each simulation. In Table IV we list both E L, the percentage of the lower confidence limits that are greater than the true value and E U, the percentage of the upper confidence limits that are less than the true value, where the true value refers to the estimator calculated from the counts with 0.5 added to the zeros.

Table IV
Simulated error rates for 95 per cent confidence limits (ideal error rates are 2.5 per cent); 10000 simulations for each cancer/age range combination.

The situations with larger counts give better error rates, and the third situation with extremely low expected counts gives error rates that are very conservative. In each of the three situations the gamma intervals have error rates closer to the nominal 2.5 per cent than the delta method based confidence intervals, although there is essentially no difference in the first case. For ALL the asymmetric gamma confidence intervals produce more central confidence intervals (that is, the tails of the errors are more nearly equal) than the symmetric delta confidence intervals. For the eye and orbit situation, both methods perform very conservatively, but the gamma method is generally less conservative. In addition all the lower delta method confidence limits were less than 0, and most of the upper limits were greater than the gamma method upper limits. Thus, in all situations the gamma method performed better than the delta method.


In this paper we derived a new estimator of A(x, y), the probability of developing a first time cancer during the age interval [x; y), conditioned on being alive and cancer free at age x. We assumed that λc(a) is constant within an interval and computed λc(a) which is not constant. However, it may have been more realistic to assume the hazards among the actual at risk populations, λc(a), are constant over the interval and computed the non-constant λc(a). Unfortunately, this approach does not appear to be tractable.

We have generalized the gamma confidence intervals [1] to apply to our new statistic. Although these intervals appear conservative in cases with extremely low counts, we have shown that the delta method which adds 0.5 to zero counts in the estimation of the variance of the counts performs worse. A more general way of performing the delta method is to assume that variances associated with zero counts are equal to some constant, 0 < δ < 1. The problem is that there is no obvious choice of δ; we have arbitrarily chosen δ = 0.5 in this paper. Note in the most extreme case where all counts are zero, the generalized gamma interval gives a non-zero upper limit, while the delta method gives an upper limit that approaches zero as δ → 0. Other methods, such as parametric bootstrap confidence intervals, suffer from the same problem of having no satisfactory method for handling zero counts. In the simple case of linear combinations of independent Poisson variables, Fay and Feuer [1] discuss similar issues comparing the gamma intervals and the approximate bootstrap confidence (ABC) intervals.

Our new estimator happens to be numerically similar to the existing method of Wun et al. [2]. Because our approach is new and not simply a modification of Wun et al. [2] and because the notations are very different between the two approaches, we have relegated the full comparison between the two methods to a technical report [7]. In that report, we show through Taylor series approximations that the two methods are similar. In addition, using the new method we recalculated Table I-17 of the Cancer Statistics Review, 19731998 [6] which gives lifetime risk of developing cancer calculated for each of 30 difference cancer categories on six subpopulations, and the new estimator differs by less than 2 per cent from that of Wun et al. [2] in every case (see [7]).

DEVCAN (Probability of DEVeloping CANcer) software [8] has been freely available to calculate the statistics of Wun et al. [2] and will be updated to calculate our new estimator in a future version. See for the most current version of the software.


We thank Chris Gunther and Alexander Korobkov for computing help.


We write our estimator of A(x, y) as A(x, y, z) (see Section 3.2 for motivation of this notation).

Let aix<ai+1 and aj<yaj+1 for x<y,ij, and jk. For convenience we regroup the ages after inserting group delimiters at x and y. Let the new delimiters be 0=b0b1b2bk+3= where b0=a0,,bi=ai,bi+1=x,bi+2=ai+1,,bj+1=aj,bj+2=y,bj+3=aj+1,,bk+3=ak+1=. We let


and similarly S^d(b)=exp{0bλ^d(u)du} and S^o(b)=exp{0bλ^o(u)du}. In this notation, A(x, y) = A(b i+1, b j+2), and we estimate it with


Because λ^(b) or λ^d(b) may equal zero and b+1 may equal infinity, we let ϕ(λ,)=bb+1exp((ub)λ)du. These integrals are


where the case λ = 0 and b+1= is one of the ‘impossible’ hypothetical cohorts (see Section 3.1). Thus, we obtain



Fay and Feuer [1] stated that the gamma intervals approach the standard normal intervals if (using the application of this paper) A(z) goes to infinity in such a way that V(z)A(z)−1 remains constant. This is not helpful for our situation (nor is it particularly helpful for studying directly standardized rates as in Fay and Feuer [1]). Here assume that the mean counts, μ, and the person-years, n={n0(c),n1(c),,nk(o)}, both increase by the same factor, say N. Since A(μ) is a function of the rates only, this value does not change as N increases; however, as one would expect, the variance estimates will change by a factor of N −1. We write the lower confidence limit in terms of the chi-square distribution {V(2NA)}(χ2)2NA2V1(α2), where A = A(z) and N −1 V = V(z). The difference of the lower gamma confidence limit and the standard normal lower limit approaches zero as N → ∞:


where the result follows since limv(χ2)v1(p)v2v=Φ1(p) (Johnson and Kotz, reference [9, p. 170]), and Φ−1 (p) is the pth quantile of the standard normal distribution. One can similarly show that the upper confidence limits approach the standard normal limits.


This article is a US government work and is in the public domain in the U.S.A.


1. Fay MP, Feuer EJ. Confidence intervals for directly standardized rates: a method based on the gamma distribution. Statistics in Medicine. 1997;16:791–801. [PubMed]
2. Wun L-M, Merrill RM, Feuer EJ. Estimating lifetime and age-conditional probabilities of developing cancer. Lifetime Data Analysis. 1998;4:169–186. [PubMed]
3. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley; New York: 1980. pp. 163–178.
4. Brillinger DR. The natural variability of vital rates and associated statistics (with Discussion) Biometrics. 1986;42:693–734. [PubMed]
5. Lehmann EL. Elements of Large-Sample Theory. Springer; New York: 1999.
6. Ries LAG, Eisner MP, Kosary CL, Hankey BF, Miller BA, Clegg L, Edwards BK, editors. SEER Cancer Statistics Review, 1973–1998 2001. National Cancer Institute; Bethesda, MD: aton 3 September 2002
7. Fay MP, Pfeiffer R, Cronin KA, Le C, Feuer EJ. Comparison of two methods for calculating age-conditional probabilities of developing cancer. Technical report #2002–01 2002. Statistical Research and Applications Branch, National Cancer Institute; aton 3 September 2002
8. National Cancer Institute and Information Management Services DEVCAN: Probability of DEVeloping CANcer software Version 4.1 2001. National Cancer Institute and Information Management Services, Inc. aton 3 September 2002
9. Johnson NL, Kotz S. Distributions in Statistics: Continuous Univariate Distributions-1. Wiley; New York: 1970.