|Home | About | Journals | Submit | Contact Us | Français|
Knowledge of the likelihood that a screen-detected cancer case has been overdiagnosed is vitally important for treatment decision making and screening policy development. An overdiagnosed case is an excess case detected because of cancer screening. Estimates of the frequency of overdiagnosis in breast and prostate cancer screening are highly variable across studies. In this article we identify features of overdiagnosis studies that influence results and illustrate their impact using published studies. We first consider different ways to define and measure overdiagnosis. We then examine contextual features and how they affect overdiagnosis estimates. Finally, we discuss the effect of estimation approach. Many studies use excess incidence under screening as a proxy for overdiagnosis. Others use statistical models to make inferences about lead time or natural history and then derive the corresponding fraction of cases that are overdiagnosed. We conclude with a list of questions that readers of overdiagnosis studies can use to evaluate the validity and relevance of published estimates and recommend that authors of publications quantifying overdiagnosis provide information about these features of their studies.
Cancer screening is a deeply established component of efforts to reduce cancer mortality in the US. The goal of screening is the detection of disease at an early and treatable stage to prevent morbidity and mortality associated with late stage disease presentation. Mammograms are now routine elements of a woman’s health care program and the majority of older men are screened regularly with PSA. However, in recent years there has been growing concern about the adverse effects of cancer screening, and in particular the problem of overdiagnosis.
Overdiagnosis occurs when screening detects a tumor that would not have presented clinically in the absence of screening. Thus, an overdiagnosed case is an excess case in the sense that it is only identified because of screening. Treatment of such a case is harmful because it cannot, by definition, improve disease outcomes. Accurately quantifying the frequency of overdiagnosis is important for informed decision making and clinical policy development since the scale of harms relative to benefits determines the individual and societal value of screening.
Overdiagnosis has long been a concern in prostate cancer screening. Early prostate autopsy studies revealed a high prevalence of latent prostate cancer among older men (1), and subsequent work (2) indicated that roughly 75% of all prostate cancers are latent and never surface clinically, indicating an enormous pool of individuals at risk of overdiagnosis. Concerns about overdiagnosis in breast cancer screening are more recent but have been growing. Several high-profile articles published in the last few years (3–5) have focused attention on the possibility that overdiagnosis in breast cancer screening may be an issue of much greater magnitude than previously recognized.
Clearly, knowing how many cancers are overdiagnosed by screening is necessary to inform policy and clinical practice. However, in both the prostate and breast cancer literatures, overdiagnosis estimates are highly variable across studies. In prostate cancer, estimates range from as low as 23% (6) to over 60% (7) of screen-detected cases. In breast cancer, estimates are also inconsistent, ranging from 10% or fewer (8, 9) to 30% or more (3, 4, 10) cases overdiagnosed. To reach a consensus for breast cancer, The Lancet recently commissioned an independent review of the evidence (11), but the resulting report concluded that, despite the plethora of studies, further research was still needed to accurately assess the magnitude of overdiagnosis.
In this article we examine features of published studies that influence estimates of overdiagnosis. The most influential features are definition (12) and measurement (8), study design and context (13), and estimation approaches (14, 15). Understanding how these features affect results will facilitate proper interpretation and use of published estimates of overdiagnosis by policy makers, clinicians, and patients. Prior reviews and methodology publications (8, 14–16) have discussed the importance of these features and even highlighted how certain choices in study design and execution can bias outcomes, but these have tended to focus on a specific cancer (either breast or prostate). Our goal is not to recapitulate these articles; rather, we integrate studies from the breast and prostate cancer literature to emphasize general principles that go beyond disease-specific considerations. To identify these studies, we searched PubMed from January 1, 1995, to November 30, 2012, using the search terms “overdiagnosis” and either “prostate cancer” or “PSA screening.” In parallel, we also searched on the terms “overdiagnosis” and either “mammography” or “breast cancer screening.” Our search identified 172 publications in the prostate literature and 195 publications in the breast literature, from which we selected examples of quantitative studies to highlight different definitions and measures, study designs and contextual factors, and estimation approaches. The selected studies are summarized in Tables 1 and and22.
There are two major concepts of overdiagnosis in the literature. The first and most commonly used defines overdiagnosis as a screen-detected cancer that would have remained latent for the remainder of the patient’s lifetime in the absence of screening. According to this definition, an overdiagnosed case is a true excess case of cancer, essentially “caused” by screening. Such a cancer may be biologically indolent and therefore clinically non-progressive. Alternatively, it may be progressive, but the patient’s life expectancy at the time of screen detection may be short enough that death due to other causes occurs before the disease can cause symptoms. Figure 1 shows that when the time interval from screen detection to clinical presentation (diagnosis in the absence screening), also known as the lead time, is longer (for instance, if the tumor is biologically indolent), there is a greater chance that other-cause death will occur first and therefore a higher risk of overdiagnosis. Conversely, for a given lead time, a higher risk of other-cause death (due to advanced age or poor health), implies a higher risk of overdiagnosis. Estimates of the frequency of overdiagnosis based on this definition generally increase with age at screen detection.
The second concept defines as overdiagnosed only biologically indolent tumors, based on their clinical and/or pathologic characteristics. This definition does not take into account life expectancy at the time of screen detection, and the resulting estimates do not exhibit the same age dependency as those based on the first definition. In general, the two definitions may yield very different estimates of overdiagnosis. In the case of prostate cancer, for example, a study (17) that defined overdiagnosis in terms of clinical and pathologic tumor features concluded that most prostatic cancers detected by PSA screening were likely to be clinically significant, but studies that estimated the frequency of overdiagnosis based on the first concept indicated that it is not uncommon (Table 1). A similar divergence of estimates using the different definitions was noted by Bach in studies of overdiagnosis due to lung cancer screening (12).
In the rest of this article we use the first definition since this is the one used in most studies of overdiagnosis in breast and prostate cancer.
Studies vary considerably in how the frequency of overdiagnosis is measured and presented (8, 15). De Gelder et al (8) cite no fewer than seven different measures of the extent of overdiagnosis. The many options arise because estimates of overdiagnosis are generally presented as a ratio with the numerator being the estimated number of cases overdiagnosed and with many choices for the denominator. Some studies (6, 18) report overdiagnosis as a fraction of screen-detected cases. Others (4, 19) present the number overdiagnosed as a fraction of the total number of cases detected or the total number invited to screening. Many (5, 10) consider the number overdiagnosed relative to the number of cases expected without screening as an expression of the magnitude of screening-induced excess diagnoses. At least one study (20) presents the ratio of overdiagnosed cases to deaths prevented by screening, but this measure effectively conflates screening harm and benefit and is highly sensitive to follow-up duration (21). Different metrics can produce very different results. De Gelder et al (8) concluded that estimates of overdiagnosis could vary by a factor of 3.5 when different denominators were used. However some studies present results using only a single metric, and this can make comparison with other studies that use a different metric difficult if not impossible.
The study design and context are important sources of variation in published overdiagnosis frequencies. Here, study design and context refers to: (1) the type of study (observational or clinical trial), (2) the population used to estimate overdiagnosis, and (3) the diagnostic intensity (fraction of the population tested and/or biopsied) as reflected by the incidence with and without screening.
The type of study is important because it will dictate the specific protocol used for screening and whether there is a concurrent control group. Screening trials generally implement protocols calling for regularly scheduled examinations whereas opportunistic screening in a population setting may be more variable. The specific screening strategy used can strongly influence the frequency of overdiagnosis (22). The presence of a concurrent control group can be of great value when estimating overdiagnosis using information on the excess incidence of cancer under screening since it will provide an appropriate baseline for computing the extent of the excess. The absence of a control group necessitates projecting what baseline disease incidence rates would have been in the absence of screening and this presents its own set of challenges.
The population under study is important because populations vary in terms of their underlying disease prevalence and natural history. Puliti et al (15) highlight this point in their review of European studies of overdiagnosis due to mammography screening. Differences in underlying disease risk across populations are a concern when estimating overdiagnosis based on excess incidence in screened versus unscreened countries or regions.
The difference in diagnostic intensity as reflected by incidence with and without screening is important because an overdiagnosed case is an excess case relative to no screening; lower incidence in the absence of screening creates a larger latent pool of cases in the population and a greater potential for overdiagnosis. Consequently, higher incidence in the presence of screening due to more complete compliance with screening invitations or better adherence to biopsy recommendations will increase the reach into the latent pool and with it the likelihood of overdiagnosis.
In the case of prostate cancer screening, design and contextual factors are responsible for much of the difference between overdiagnosis estimates from the US population and from the European population based on the European Randomized Study of Screening for Prostate Cancer (ERSPC), one of the two large randomized trials of prostate cancer screening. Draisma and colleagues (13, 23) found that overdiagnosis among men aged 50–84 was 66% based on a model developed for the Rotterdam section of the ERSPC but 42% when the same model was adjusted to reflect incidence in the US setting (24). There are several identifiable contextual reasons for this difference. First, prostate cancer incidence was considerably higher in the US than in ERSPC centers before screening was introduced (25). Thus, at the start of the ERSPC trial, there was a relatively greater pool of latent cases with the potential to be screen detected than in the US population. Second, participants on the screening arm of the ERSPC were highly compliant with the trial protocol; compliance with biopsy referral was 86% on average across ERSPC centers (20) compared with approximately 40% in the US (26). Furthermore, the ERSPC centers generally had a lower threshold for referral to biopsy (PSA > 3 μg/L) than was standard in the US. These differences in contextual factors produced a greater potential for and a correspondingly higher estimate of the frequency of overdiagnosis in the ERSPC compared with the US population setting.
The definition of an overdiagnosed case as an excess or extra diagnosis gives rise to estimation approaches based on the excess incidence of disease in the presence of screening. There are two main approaches. The first uses the observed excess incidence—the difference between incidence in the presence and incidence in the absence of screening—as a proxy for overdiagnosis. We refer to this as the excess-incidence approach. The second uses disease incidence under screening to make inferences about the lead time or the natural history of the disease and estimates the corresponding frequency of overdiagnosis. We refer to this as the lead-time approach. The motivation for each approach is shown in Figure 2, which is loosely based on a study by Feuer and Wun (27) that linked patterns of breast cancer incidence with lead time and overdiagnosis.
Figure 2 presents three scenarios in which screening is introduced into a population with a background annual incidence rate of 100 cases per 100,000 individuals. In the first scenario, the screening test has a 1-year lead time and in the second and third scenarios the screening test has a 2-year lead time. Scenario 3 is the only one in which there is overdiagnosis, with 3 of every 13 (23%) screen-detected cases being overdiagnosed. In each scenario, the introduction of screening generates a peak in disease incidence, which is followed by a decline in incidence. The initial part of the incidence peak arises from a series of incidence “gains” caused by screen detection of cases from the latent incidence pool. Incidence gains are greatest in the beginning and then reach a steady state as the latent pool declines and screening use in the population stabilizes. The initial gains are followed by a pattern of incidence deficits as those cases whose diagnosis was advanced by screening are no longer present to be detected. The lead time determines the interval between the initial gains and the subsequent deficits and, consequently, the magnitude of the corresponding incidence peak. For a disease with a given latent prevalence and a test with a given sensitivity, a longer lead time causes the incidence peak to be higher and wider. Regardless of the lead time, if there is no overdiagnosis, incidence eventually returns to its original background level as annual gains and deficits equalize. In the figure, the background incidence is constant, but in reality the level may well change over time.
Figure 2 demonstrates that the excess incidence under screening is highly informative about overdiagnosis and lead time; both the excess-incidence and lead-time approaches to estimation are based on this observation. However, the figure also shows that both the excess-incidence and lead-time approaches have important caveats and limitations.
The excess-incidence approach is the predominant approach used in the breast cancer literature, and the methodological considerations of this approach have been the topic of several recent reviews (14–16). This approach may yield a biased result, particularly if the early years of screening dissemination are included. This is because, as shown in the third panel of the figure, the excess incidence in the early years of screening consists of a mixture of overdiagnosed and non-overdiagnosed cases and therefore will overestimate the frequency of overdiagnosis. In this figure, the excess incidence estimate of the fraction overdiagnosed among screen-detected cases amounts to 40% (47/117) when including the first six years of the screening program, but 23% (6/26)—the correct answer—when restricting the estimate to the last two years. Similarly, De Gelder et al (8) estimated overdiagnosis based on breast cancer incidence within different intervals relative to the implementation of screening in the Netherlands and found that estimates based on years before the program was fully implemented were four times higher than those based on years after this point. Thus, it is important for excess-incidence estimates of overdiagnosis to be appropriately timed relative to the dissemination of screening.
Second, if the incidence under screening is computed based only on age groups eligible for screening, it will overestimate excess cases because it will not account for deficits in older age groups due to screen detection while at younger ages. Thus, excess incidence calculations must account for what is commonly termed a “compensatory drop” in incidence at older ages.
Third, the baseline or control incidence trend must reflect the incidence for the screened population that would be expected in the absence of screening. In the figure, the baseline incidence is set to a constant level, but in reality this is frequently an unreasonable assumption. It can be challenging to project baseline incidence correctly when it is not observable (e.g., via a concurrent control group from the same population). In this case, baseline incidence may be based on historical trends (5, 10) or concurrent trends among age groups not eligible for screening (4, 10). In countries where screening programs have been implemented in some regions but not in others, baseline incidence may be based on the trends in the regions not offered screening (5, 28). In some cases trends or changes in disease risk factors (e.g. over time or across regions) may have to be assessed and used to adjust estimated baseline incidence. Further, to account for lead time, the observed incidence in a screened population of age A should theoretically be compared with baseline incidence in a population of age A+L, where L is the lead time for that age.
Variation across studies in how the baseline incidence is estimated can significantly impact results. Even within a single study the effects of different choices in the modeling of baseline incidence can be considerable. As an example, Morrell et al (10) used two different methods to project expected breast cancer incidence in the absence of screening in 1999–2001, after organized mammography screening had become well established in New South Wales, Australia. The first method interpolated between the oldest and youngest age groups in the population who did not undergo screening. The second method extrapolated pre-screening incidence trends accounting for changes in breast cancer risk factors such as obesity and use of hormone replacement therapy. Corresponding overdiagnosis estimates expressed relative to the projected incidence without screening were 42% and 30%, indicating that the baseline incidence corresponding to the interpolation method was lower than the baseline incidence corresponding to the extrapolation method.
Rather than using empirical differences between observed and baseline incidence, the lead-time approach uses modeling techniques to infer the lead time and the corresponding fraction of cases overdiagnosed from the pattern of excess incidence under screening. This approach has the advantage that it can be applied to data from the beginning of screening dissemination. However, like the excess-incidence approach, it requires an estimate of the expected baseline incidence in the absence of screening, and results can be sensitive to this estimate. In addition, this approach requires knowledge of screening dissemination and practice patterns in the population.
While the statistical literature provides theoretical underpinnings (29, 30) and some standard methods (31), there are many different ways to estimate lead time and overdiagnosis by modeling. Some modeling studies use observed incidence under screening to estimate elaborate models of the underlying progression of disease, effectively imputing times of disease onset, progression to metastasis, and transitions from a latent to symptomatic state (8, 13). In these models, the baseline incidence is generally estimated along with the underlying disease progression. Then, the fraction overdiagnosed given specified screening patterns is calculated empirically from the imputed disease histories and age-specific risks of other-cause death. Other studies (6, 18, 19) simply focus on estimating the lead time and overdiagnosis frequency that are most consistent with the observed trends in disease incidence under screening.
Like the excess-incidence studies, choices made regarding model structure and assumptions will affect results. As an example, Draisma et al (23) used three different models to estimate lead time and overdiagnosis corresponding to the peak in prostate cancer incidence observed in the early 1990s. The models used similar assumptions about the background trend in incidence but had quite different underlying structures. The resulting overdiagnosis estimates ranged from 23% to 42% of screen-detected cases.
Tables 1 and and22 provide examples of prostate and breast cancer studies that used the lead time and excess-incidence approaches. In the case of prostate cancer (Table 1), estimates derived using the excess-incidence approach are generally considerably higher than those derived using the lead-time approach. It is difficult, however, to say whether the higher estimates are due to context or estimation method, since most excess-incidence studies were conducted in non-US settings, while most lead-time studies were based on US data. The single US-based excess-incidence study that we reviewed reported that 1.3 million cases had been overdiagnosed from 1987 (when PSA screening began) through 2005 (32). This corresponds to an overdiagnosis frequency of 37% among all detected cases based on inflating counts of prostate cancer incidence in SEER to the US population. However, this study included the early years of PSA screening which likely inflated the results. In contrast, the lead-time approach used by Draisma et al (23) results in a range from 9% to 19% overdiagnosed among all detected cases, corresponding to an absolute number overdiagnosed that is at most about half of the 1.3 million (32). Table 2 shows an even clearer dichotomy between excess-incidence and lead-time studies in breast cancer screening, with the lead-time studies generating overdiagnosis estimates that are markedly lower than those from the excess-incidence studies.
Our examination of variation in study features and methods leads us to wonder whether it is possible to compare and integrate results across published studies of overdiagnosis. Clearly the conceptual and analytic choices made by study investigators can dramatically impact overdiagnosis estimates. For consumers of the overdiagnosis literature, therefore, a necessary step is to understand what choices were made. In Table 3 we propose a list of questions (Table 3) that readers of overdiagnosis studies may ask to clarify these choices so as to better understand study results.
The first set of questions addresses the definition of overdiagnosis used in the study and the measure of its frequency. The second set of questions addresses contextual factors. To shed light on differences in diagnostic intensity without versus with screening, information about diagnostic practices in the absence and presence of screening will be of value. The final set of questions addresses estimation approach. This is undoubtedly the most complex of the features that we have considered and requires careful examination. The main limitation of the excess-incidence approach is that observed excess incidence is not an unbiased estimate of the incidence of overdiagnosis. Often, ad-hoc adjustments need to be applied to the empirical measures, and understanding these adjustments is key to evaluating these studies. The main limitation of the lead-time approach is that the links between model choices, assumptions, and results are often not transparent, and this can make evaluation of these studies difficult. Prior publication of the model in the peer-reviewed statistics or biostatistics literature can be a strong positive indicator of model validity and ongoing efforts (33) aim to improve and standardize model reporting in the interests of greater transparency.
Given that there are identifiable features of overdiagnosis studies that will influence and even bias results, what does this mean for policy makers, clinicians, and patients? First, knowledge of these features should help all consumers of overdiagnosis studies to avoid using clearly biased estimates. Second, the fact that studies use different measures of overdiagnosis should direct consumers to the ones that most meet their needs. Policy makers may want to focus on studies that present results in terms of the number of overdiagnoses per invited participant; a patient newly diagnosed following a screening test may be more interested in overdiagnoses expressed as a fraction of screen-detected cases. Recognition of the importance of contextual factors should enable consumers to select those studies that are most suited to their setting and their screening protocol. For example, in considering prostate cancer screening policies for the US population setting, it will not be appropriate to use overdiagnosis estimates from a trial conducted in Europe with a PSA cutoff that is lower than that typically used in this country and with a much higher rate of compliance with biopsy referral. Finally, knowledge of the limitations of the different approaches may help with selecting estimates based on the approach that uses relevant data sources and makes clinically reasonable assumptions.
In conclusion, we remain far from a consensus regarding how best to estimate the likelihood of overdiagnosis. Our goal has not been to make conclusions about the frequency of overdiagnosis in breast and prostate cancer screening but to help consumers of overdiagnosis publications navigate this growing, confusing, and often controversial literature. Our focus on overdiagnosis as a harm of screening has precluded discussion of screening benefit, an equally controversial topic and one in which study features and methods also almost certainly influence results. We encourage investigators publishing overdiagnosis studies to ensure that their reports address the questions in Table 3 and, if possible, include the numbers needed to translate results across commonly used metrics. Doing so will provide the transparency to adequately compare and integrate across studies of what is possibly the most important potential harm of screening.
Financial support: This work was supported by Award Numbers U01CA157224 (RE, RG, LM) and U01CA088283 and U01CA152958 (JSM) from the National Cancer Institute and the Centers for Disease Control and Prevention. Additional funding provided by Award Numbers KO5CA96940 and P01CA154292 (JSM). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute, the National Institutes of Health, or the Centers for Disease Control and Prevention.