The lead time and the likelihood of overdiagnosis are quantities that are critical in the assessment of the likely benefits and costs of any screening test; yet, in the case of PSA screening, results have been variable and confusing. This article is the first, to our knowledge, to closely examine the reasons for discrepancies across studies. Our results clearly show that the context or population used to derive the estimates, the definition of lead time used, and the estimation methodology all have important roles.
We considered three definitions of lead time that have been used in previous publications and showed that results differ depending on the definition used. The uncensored definition yields the longest estimated lead times and the non-overdiagnosed definition the shortest. We feel strongly that for future studies to be correctly interpreted, analysts should specify the definition used in their publications. Other definitions have also been reported. For example, McGregor et al. (
14) defined overdiagnosis as the detection by screening of disease that would not have led to prostate cancer death. Because the majority of prostate cancer patients do not die of the disease (
32,
33), the estimates of overdiagnosis due to PSA screening reported by McGregor et al. were considerably higher than ours, exceeding 80%.
The definition of lead time may be constrained or even dictated by the study design. In studies that use stored serum samples, for example, mean lead time is estimated empirically as the average time from the first abnormal PSA test result to prostate cancer diagnosis among the cancer patients with serum samples in the repository. Gann et al. (
5) used this method to estimate a mean lead time of 5 years that was based on one serum sample per patient, and Pearson et al. (
34) estimated a mean lead time of 3 years by use of serial serum samples. Note that the lead times estimated in these studies refer to patients who were clinically diagnosed during the study (excluding overdiagnosed cancers), that is, corresponding to non-overdiagnosed lead time as shown in . However, this approach has some deficiencies. First, the estimates could be seriously affected by the limited follow-up time, for example, 10 years in Gann et al. (
5). Tornblom et al. (
8), for example, studied prostate cancer incidence in Gothenburg (Sweden) in a cohort of men aged 67 years in 1980 and who had a blood sample taken in 1980. They estimated a median lead time of 7.8 years with 12 years of follow-up and 10.7 years with 20 years of follow-up for PSA levels of 3 ng/mL and greater. Second, this approach assumes that cancer would have been identified by biopsy examination at the time of the abnormal PSA test.
There are also different definitions of overdiagnosis. From an epidemiological or public health perspective, the standard definition is the one that we used in this analysis, namely, the event of other-cause death before the date of clinical diagnosis. However, the clinical literature has suggested an alternative definition, namely, the detection of “clinically insignificant” disease—tumors smaller than 0.2 cm
3, organ confined, and with Gleason score less than 7 (
35). By this definition, the frequency of overdiagnosis is substantially lower than that reported in the present article (
36). However, autopsy studies have shown that tumors that are clinically significant in this sense have a considerable chance of going undiagnosed during a lifetime, as recently reviewed (
37). Therefore, we argue that this alternative definition of overdiagnosis, although potentially useful in the future, is likely premature now.
Regarding the issue of context, comparing the results from the MISCAN ERSPC and MISCAN SEER models is revealing. Lead time and overdiagnosis estimates from the original model that was based on the Rotterdam data were comparable with those published for PSA screening in the Netherlands (
10). Clearly, prostate cancer and PSA screening in the US population seem to be different from the trial setting in Rotterdam (see also ). Two sets of parameters were changed: In the SEER model, the sensitivity of the screening test was lower than that in the ERSPC model, and the hazard of clinical diagnosis higher, implying an earlier diagnosis in the absence of PSA screening. The lower sensitivity is justified by the lower PSA cutoff at 3 ng/mL in Rotterdam vs 4 ng/mL in the 1990s in the United States, and probably more important, by the higher biopsy compliance rate (90%) in the ERSPC Rotterdam study than in the Prostate, Lung, Colorectal, and Ovarian (PLCO) cancer screening trial (approximately 40%) (
29), which is supposedly representative of US practice. Partially counterweighting these differences may be adherence to the less sensitive sextant biopsy scheme in ERSPC Rotterdam, whereas US biopsy practices gradually adopted extended-core schemes. For the assumed earlier diagnosis in the absence of PSA screening, there is less evidence, but it allowed a higher predicted incidence rate in 1985–1987 without raising incidence over the entire study period. Because lead time and overdiagnosis are defined relative to clinical diagnosis, this assumption also resulted in lower estimates, consistent with the other models. This exercise shows that baseline clinical incidence and the intensity of screening follow-up, both of which may differ across populations, may be important drivers of reported estimates of lead time and overdiagnosis in different studies.
Another source of variation could be caused by model parameterization. In the multiparameter MISCAN model, it is likely that different combinations of parameter values might fit the data equally well, which might impact on lead time and overdiagnosis estimates. By contrast, in the more parsimonious UMich model, parameters are well identified and have narrow confidence intervals (
13). However, the impact of this source of variation is likely to be much smaller than that of model structure and assumptions. In this respect, the UMich model differs from both the MISCAN and FHCRC models in that its parameter estimates are based on SEER incidence only, whereas in the other models, data from other sources were also used for parameter estimation.
Finally, we discuss the role of the methods used to estimate lead times and overdiagnosis. In the present investigation, the specific model used plays a relatively minor role. The models yielded lead time and overdiagnosis estimates that were fairly consistent. It is important to note that these estimates depend on a common assumption in all three models—the dissemination of PSA screening is assumed to be the main causal factor of incidence trends since 1985. Although the models do reproduce overall incidence trends, the fit is not perfect. For example, the observed reduction of distant disease incidence is only partially reproduced by the models, replicating results of Etzioni et al. (
38), who, using a different model (not calibrated to stage-specific incidence), also found that the model-projected decline in distant-stage incidence was less extreme than that observed in SEER. Also, the estimates of the mean uncensored lead time and overdiagnosis frequency are higher than those reported by Telesca et al. (
39). Assuming observed incidence to be the sum of a smooth incidence trend in the absence of screening and an excess incidence that is a function of screening patterns and exponentially distributed lead times, they obtained estimates of mean uncensored lead times of 6.34 years for whites and 7.67 years for blacks. Telesca et al. (
39) also showed that their estimates, which were based on population incidence, are sensitive to assumptions about background incidence. Thus, the specific modeling approach used can be influential, although our experience suggests that context and lead time definition are probably more important in explaining the heterogeneity of published lead time and overdiagnosis estimates across studies.
This study has several limitations. The estimates depend on the following assumptions: 1) All incidence trends since 1985 are due to PSA screening, which amounts to assuming an unobserved flat incidence rate in the absence of screening. This assumption may be reasonable, but we do not have independent evidence to support it. 2) We assumed that Mariotto's model of PSA testing practice (
19), which we used, is about screening tests. In the construction of her model, all follow-up PSA tests taken after diagnosis were eliminated as well as PSA tests occurring within 3 months of a previous PSA test. A fraction of the remaining tests might be diagnostic tests that were used to confirm a suspicion for prostate cancer. The size of this fraction is unknown, but it would imply that the screening rate is lower than we assumed. Finally, it is clear that these models were not perfect in predicting observed incidence. Incidence as predicted by the models show a lag of 1 or 2 years with respect to observed incidence, and the models fail to explain fully the decline in distant disease. Consequently, the estimates of mean lead time and overdiagnosis rate will not be perfect either, although it is not clear in what direction they might be biased.
In conclusion, we have presented estimates of lead time and overdiagnosis from three models with different natural history descriptions and estimation strategies, but all applied to the US (SEER 9) population and used common inputs for PSA screening trends and pre-PSA clinical incidence. We have highlighted the critical roles of lead time definition, population context, and estimation methodology. We propose that future studies of lead time clearly define the specific measure used (non-overdiagnosed, censored, and uncensored) and describe key inputs (background incidence, screening protocols, biopsy compliance and sensitivity) that might differ across populations and hence might explain differing estimates of lead time and overdiagnosis associated with PSA screening. We hope that our findings will help explain the substantial variability in the reported estimates of these important measures.