Example: Breast cancer incidence data
We will develop this commentary using as a concrete example the incidence of invasive female breast cancers in the United States. For this purpose, we obtained age-specific case and population data from the National Cancer Institute’s Surveillance, Epidemiology, and End Results 9 Registries Database (SEER9) for the 36-year time period from 1973 through 2008 (November 2010 submission) (35
In general, for any given cancer and population group, the matrix Y = [Ypa, p = 1, …, P, a = 1,…A] contains the number of cancer diagnoses in calendar period p and age group a, and the matrix O = [Opa, p = 1, …P, a = 1, …, A] contains the corresponding person-years. The observed incidence rates per 100,000 person-years are λpa = 105 Ypa/Opa, and the expected log rates are ρpa = log(E(Ypa)/Opa).
It is instructive to think of the rate matrix in terms of its corresponding Lexis diagram (), which makes visually clear how the diagonals of matrices Y
, from upper right to lower left, represent successive birth cohorts indexed by c
, from the oldest (c
= 1) to the youngest (c
= C P
− 1). From this perspective, it becomes clear that a new cohort enters prospective follow-up with each consecutive calendar period. For this reason, one can think of a registry as a “cohort of cohorts.” Because cancer registries are operated in perpetuity, over time, a substantial number of birth cohorts are followed. Our example includes C
= 24 nominal 8-year cohorts born from 1892 through 1984 (referred to by mid-year of birth).
Figure 1 Rate matrix or Lexis diagram (20-22) for invasive female breast cancer. Data from the National Cancer Institute’s Surveillance, Epidemiology, and End Results 9 Registries Database (SEER 9) for cases diagnosed between 1973 through 2008 (35). Sixteen (more ...)
The APC model: formulation
APC analysis is based on a log-linear model for the expected rates with additive effects for age, period, and cohort:
The generic additive effects in equation (1)
can be partitioned into linear and non-linear components (28
). There are number of equivalent ways to make this partition while incorporating the fundamental constraint that c p
. Two of the most useful (36
) are the age-period form
and the age-cohort form
Notation and parameters are summarized in . Importantly, all the parameters in equations (2)
) can be estimated from the data without imposing additional constraints, and fitted rates from both forms are identical.
Some estimable parameters and functions in the APC model*
There is a close correspondence between APC parameters and estimable functions in and fundamental aspects of the data investigated using the standard descriptive toolbox. Before highlighting some of these connections below, we hopefully can shed further light on the much discussed identifiability problem.
Identifiability: “problem” or uncertainty principle?
The aspect of identifiability in question concerns whether log-linear trends in rates can uniquely be attributed to the influences of age, period, or cohort, quantified by parameters αL, πL
, and γL
. Mathematically, it has been shown by Holford (28
) that one cannot do this without imposing additional unverifiable assumptions, because the three time scales are co-linear (cohort equals period minus age, c
). This issue has often implicitly been held out as a unique and unfortunate limitation of the APC model. In fact, the same issue affects time-to-event analysis of any cohort study.
To see this, consider the following thought experiment. Suppose one enrolls a cohort of exchangeable persons of identical age (e.g., the 1956 birth cohort in ) and follows them longitudinally over a decade for cancer. At the end of the study, one observes that the log incidence rate increases linearly with age. It is natural to attribute this trend entirely to the effects of ageing, and equate the age-associated slope to the value of a parameter αL.
However, suppose one had also assembled an identical cohort of persons of the same age, but this study had been conducted ten years earlier. It is possible that the age-associated slopes of the two studies would be very different, if disease-causing exposures out of experimental control had been increasing or decreasing in prevalence over time. Hence, the observed age-associated slope actually estimates parameter (αL
) or longitudinal age trend (LAT in ) (32
), where αL
is the component of the trend that is attributable to aging and πL
is the component of the trend due to the net impact of unknown and uncontrollable exposures over successive calendar-periods.
A similar issue affects any cross-sectional analysis. To “control” for the effects of ageing, suppose one studied in succession over time an event rate in persons of the same age (e.g., age group 65-69 years in ), to estimate the slope of the time-trend πL
. By definition, each successive group in this cross-sectional study was born a year later. Hence, both unknown factors and factors out of experimental control associated with birth cohort could also play a role. Therefore, the observed slope over time actually estimates a parameter (πL
) or net drift in (29
), where πL
is the component of the trend that is attributable to calendar time and γL
is the component of the trend attributable to the successive cohorts enrolled in the study.
These simple thought experiments, , and illustrate an important ‘uncertainty principle’ regarding the measurement of absolute rates in cohorts. Interestingly, this principle is seldom considered in the context of most epidemiological cohort and case-control studies, perhaps because these studies have a fairly narrow accrual window and often focus on relative rates rather than absolute rates. In contrast, this issue is often centralin the analysis of registry data, because the follow-up has sufficient breadth and depth to reveal long-term secular trends in the population associated with age, period, and cohort. Indeed, a unique role of registry studies is to identify and quantify such trends, thereby providing direction and guidance regarding the needs for targeted analytical studies.
Estimable functions: separating signal from noise
The APC model provides a unique set of best-fitting log incidence rates, pa
or equivalently ca
, obtained by plugging in maximum likelihood estimators into equations (2)
), respectively. The corresponding variances are readily calculated. In our experience the fitted rates have an appealing amount of smoothing, and we use them routinely in our studies (36
), especially for rare cancer outcomes. Experience suggests that for “moderate” sized rate matrices (in terms of A
), the APC model smoothes the data conservatively, about as much as a 3-point moving average, yielding around a 40-60% reduction in the width of the confidence intervals. Of course, the precise amount of noise reduction depends on a number of technical details including whether over-dispersion is present or accounted for.
This application of the APC model is illustrated in for the breast cancer data. The age-standardized rates (ASRs) over time calculated using the observed rates are nearly identical to the ASRs calculated using the APC fitted rates. However, the point-wise confidence intervals for the fitted rates are substantially narrower, by around 40% averaged over the 10-year time period.
Figure 2 Age standardized rates (ASRs, 2000 standard US population) for invasive female breast cancer. Data from the National Cancer Institute’s SEER 9 Database. ASRs calculated using observed rates (grey) and age-period-cohort fitted rates (red). Point (more ...)
Estimable functions: connections to the classical approaches
The APC parameter called the net drift ( and equations (2)
)) estimates the same quantity as the EAPC of the ASR, i.e. the overall long-term secular trend. The point estimates for these quantities are almost identical for the breast cancer data in ; net drift = 0.83% per year (95% CI: 0.78 to 0.85%/yr) and EAPC = 0.78% percent per year (0.18 to 1.39%/yr). However, for this example, the estimated confidence bands are much narrower for the net drift.
We introduced a novel estimable function called the fitted age-at-onset curve to summarize the longitudinal (i.e. cohort-specific) age-associated natural history ( and ) (46
). By construction, the fitted curve extrapolates from observed age-specific rates over the full range of birth cohorts to estimate past, current, and future rates for the referent cohort, e.g., the 1932 cohort in this example. The fitted age-at-onset curve provides a longitudinal age-specific rate curve that is adjusted for both calendar-period and birth-cohort effects. We view it as an improved version of the cross-sectional age-specific rate curve, improved because the cross-sectional curve is not adjusted for period and cohort effects (47
). The fitted curve has proven very useful in practice (38
Figure 3 Cohort-specific age-specific incidence rates for invasive female breast cancer. Data from the National Cancer Institute’s SEER 9 Database, stratified by 8-year birth cohorts. The age-period-cohort (APC) fitted age at onset curve (red line and (more ...)
Finally, period deviations in the APC model () identify changes over time; such change points are often analyzed non-parametrically using joinpoint regression methods (25
). Similarly, cohort deviations can provide an explanation for joinpoint patterns in age-specific rates over time.
APC analysis: beyond the basics
There are many useful extensions to the basic APC model. Estimable functions are amenable to formal hypothesis tests (29
). Parameters associated with age, period, and cohort can be smoothed (49
). Parametric assumptions about the shape of the age incidence curve derived from mathematical models of carcinogenesis can be incorporated (50
). Other extensions have included parametric (33
) and nonparametric (51
) assessments of changes in period and cohort deviations, and simultaneous modeling of a moderate or large number of strata, such as geographic areas, using Bayes and Empirical Bayes methods (53
Recently, we developed novel methods to compare age-related natural histories and time-trends between distinct event rates assuming that separate APC models hold for each (36
). Using this approach one can formally contrast the incidence of a given tumor such as breast cancer in two populations, say Black versus White women (46
), or the incidence of two tumor subtypes in the same population, say, ER positive versus ER negative breast cancers ((46
), supplemental Figure). We demonstrated that two event rates are proportional over age, period, or cohort if and only if certain sets of APC parameters are all equal across the respective event-specific models (36
). We also developed corresponding tests of proportionality and estimators of rate ratios.
A number of authors have forecast future cancer rates using the APC model (54
). Projections quantify the future implications of current trends, for example, the impact of a net drift of 1% versus 2% over time, or the future impact of recent changes in birth cohort patterns.