|Home | About | Journals | Submit | Contact Us | Français|
To determine if U.S. radiologists accurately estimate their own interpretive performance of screening mammography and how they compare their performance to their peers’.
174 radiologists from six Breast Cancer Surveillance Consortium (BCSC) registries completed a mailed survey between 2005 and 2006. Radiologists’ estimated and actual recall, false positive, and cancer detection rates and positive predictive value of biopsy recommendation (PPV2) for screening mammography were compared. Radiologists’ ratings of their performance as lower, similar, or higher than their peers were compared to their actual performance. Associations with radiologist characteristics were estimated using weighted generalized linear models. The study was approved by the institutional review boards of the participating sites, informed consent was obtained from radiologists, and procedures were HIPAA compliant.
While most radiologists accurately estimated their cancer detection and recall rates (74% and 78% of radiologists), fewer accurately estimated their false positive rate and PPV2 (19% and 26%). Radiologists reported having similar (43%) or lower (31%) recall rates and similar (52%) or lower (33%) false positive rates compared to their peers, and similar (72%) or higher (23%) cancer detection rates and similar (72%) or higher (38%) PPV2. Estimation accuracy did not differ by radiologists’ characteristics except radiologists who interpret ≤1,000 mammograms annually were less accurate at estimating their recall rates.
Radiologists perceive their performance to be better than it actually is and at least as good as their peers. Radiologists have particular difficulty estimating their false positive rates and PPV2.
There is a movement to assess physicians’ clinical performance in everyday practice to allow identification of poorly performing physicians and to guide quality improvement and professional development . However, little is known about how well physicians understand their own performance measures. A recent review found that thirteen out of twenty studies demonstrated little, no, or an inverse relationship between self-assessment measures and physician indicators of clinical performance . However, the physician populations evaluated in these studies were not concurrently receiving information on their performance to give them the ability to accurately self-assess themselves.
Radiologists who interpret mammography are an example of a unique specialty that has received standardized performance information in the U.S. since the 1990’s through the Mammography Quality Standards Act (MQSA) requirement that mammography facilities “collect and review outcome data for all mammograms performed” . The extent to which radiologists review and retain information provided by the audit reports is unknown. A concern is that if radiologists are not accurate in their perceptions of their performance, they will not know if, and in which areas, their performance needs improvement. Criteria have been established in screening mammography for identifying low performing radiologists who might benefit from educational interventions .
A 2001 survey of U.S. radiologists, from 3 mammography centers, found that they overestimated their rates of recommending further evaluation after screening mammograms . We have extended this prior research by assessing a larger number of radiologists from 3 more geographic sites than the previous report after a five year interval of performance feedback. We also evaluate interpretive performance for a broader array of performance measures, including false-positive and cancer detection rates.. In this report we also assess radiologists’ perceptions of how their interpretative performance compares to the performance of other U.S. radiologists. Peer comparison feedback has resulted in improvements in physician performance in other areas [6, 7] and comparisons to others may be important to fully appreciate any performance gaps. In this study we tested the hypothesis that radiologists would not be able to accurately estimate their performance, but would be better at rating how their performance compares to their peers. We also hypothesized that some radiologist characteristics, including receipt of audit reports, clinical experience, and fellowship training, would increase radiologists’ accuracy at estimating their own interpretive performance.
Six mammography registries from the National Cancer Institute-funded Breast Cancer Surveillance Consortium (BCSC; http://breastscreening.cancer.gov) contributed data for this study: San Francisco (SF), North Carolina (NC), New Mexico (NM), New Hampshire (NH), Vermont (VT), and Western Washington (WA). These registries collect patient demographic and clinical information each time a woman receives a mammography examination at a participating facility. This information is linked to regional cancer registries and pathology databases to determine cancer outcomes. Data from the registries were pooled at the BCSC Statistical Coordinating Center (SCC) for analysis.
Radiologists who interpreted mammograms at a BCSC facility between January 2005 and December 2006 were invited to participate in a mailed survey in 2005 (NH, VT, and WA) or 2006 (SF, NC, NM), using survey methods previously described (, Survey: http://breastscreening.cancer.gov/collaborations/favor_ii_mammography_practice_survey.pdf). Included in the analysis were mammograms indicated to be routine screening and interpreted by a surveyed radiologist between January 1, 2000 and December 31, 2005 or December 31, 2006, depending upon when the respective site administered the survey. This date restriction was made because radiologists were asked to estimate their screening interpretive performance since 2000 at the time of taking the survey. Of those 212 survey completers (68.6% complete rate=214/312), 38 additional radiologists were removed due to not performing any screening mammography during the study period for a total of 174 radiologists eligible for the study.
Each registry and the SCC received Institutional Review Board (IRB) approval to consent radiologists to take the survey and to link their survey data to BCSC performance data. Sites also received IRB approval for either active, or passive, consenting processes or a waiver of consent to collect women-level and mammography data, link data, and perform analytic studies. All procedures are compliant with Health Insurance Portability and Accountability Act (HIPAA) and all registries and the SCC have received a Federal Certificate of Confidentiality and other protection for the identities of women, physicians, and facilities that are subjects of this research.
For analyses comparing actual performance outcomes and radiologists’ estimated performance, we restricted to the 155 radiologists (155/174= 89%) that provided an estimate to at least one of the survey questions on recall, false-positive, or cancer detection rates. These performance outcomes were based on the initial assessment of 1,082,639 screening mammograms interpreted by the 155 radiologists. Analysis of the positive predictive value of biopsy recommended (PPV2) was based on the final assessment after all imaging was performed up to 90 days after the screen and prior to the breast biopsy from 990,574 screening mammograms interpreted by 132 radiologists (132/174=76%).
Data on radiologist characteristics were obtained from the survey, which included the following questions on screening performance: 1) “For the following questions about screening mammograms, please estimate your values since 2000 for recall rate (%), PPV2 (%), False-positive Rate (%), and Cancer Detection Rate (per 1000)” and 2) “How do you think your current screening performance compares to others in the U.S.?” with a likert scale for response of “Much Lower”, “Lower”, “Same”, “Higher”, or “Much Higher”. The following definitions were provided in the survey: “recall rate is % of all screens with a positive assessment leading to immediate additional work-up”; “positive predictive value of biopsy recommended (PPV2) is % of all screens with biopsy or surgical consultation recommended that resulted in cancer”; “false-positive rate is % of all screens interpreted as positive and no cancer is present”; and “cancer detection rate is # of cancers detected by mammography per 1000 screens”. We define the estimated performance values as the radiologist’s perceived performance. We categorized radiologist’s perceived relative performance compared to others in the US into 3 categories: 1) lower (much lower or lower), 2) same, and 3) higher (higher or much higher).
The survey included demographic, experience, and clinical practice characteristics in the prior year, including age, sex, affiliation with an academic institution, fellowship training, years of mammography experience, hours working in breast imaging per week, number of continuing medical education (CME) hours in breast imaging in the last 3 years, self-reported number of mammograms interpreted in the prior year and if the radiologist received audit reports with information on their own interpretive performance (yes/no). We also included responses to questions evaluating the radiologist’s perspective of their confidence in use of numbers and statistics and frequency of using numbers and statistics when discussing a positive mammogram with a patient, which were both measured on a 5-point likert scale. We defined numeric confidence as reporting being confident, or very confident, in both interpreting medical literature and audit reports. We categorized how often a radiologist reports using numbers and statistics when discussing a positive mammogram with a patient into three categories: 1) never or rarely, 2) sometimes, and 3) often or always.
We defined a mammogram as being associated with a cancer if the woman was diagnosed with invasive breast cancer or ductal carcinoma in situ (DCIS) within one year of the mammography examination and before the next screening mammogram. We used standard definitions developed by the BCSC to define a mammogram assessment as positive and to calculate interpretative performance measures .
To evaluate how well the radiologists estimated their performance measures compared to their actual performance measures we created a binary outcome accurate perceived performance defined as accurate if they estimated their performance within +/− 5 percentage-points of their actual performance (i.e. if actual false-positive rate is 10% then a radiologist whom provides an estimated false-positive rate between 5% and 15% would be defined as accurate) and inaccurate otherwise. The +/−5 percentage-point cut-off was chosen a priori and allowed for radiologists to round their estimated performance and still be defined as accurate.
To visually assess how well radiologists were able to estimate their actual performance outcomes, we plotted perceived versus actual performance for each of the four performance outcomes. We calculated the concordance correlation coefficient  to summarize how well perceived performance agreed with actual performance. For cancer detection rate we conducted the analysis without six outliers that estimated cancer detection rate as greater than 25 per 1,000, because they were unduly influential.
For each performance measure, we calculated the mean perceived performance and percent of radiologists who had accurate perceived performance, evaluated their unadjusted associations with radiologist characteristics, and calculated 95% confidence intervals(CI) using linear and logistic regression, respectively. For each perceived performance question, we calculated the frequency that radiologists provided no response, and the frequency perceived performance was >5 percentage-points below actual performance, an accurate estimate, and >5 percentage points above actual performance.
We calculated the unadjusted associations between a radiologist’s perceived relative performance and their perceived and actual performance. For actual performance outcomes we applied weighted regression with weights corresponding to number of mammograms interpreted by a radiologist for a given analysis population. Specifically, for actual recall and cancer detection rates, weights were the number of screening mammograms interpreted by the radiologist; for false-positive rate weights were the number of mammograms without a diagnosis of cancer; and for PPV2 weights were the number of mammograms interpreted by the radiologist as positive at the end of the imaging work-up.
All confidence intervals and p-values are two-sided based on the Wald statistic with statistically significant associations at the P<0.05 level. Data analyses were conducted using SAS® software, Version 9.2 (SAS institute, Cary, NC).
Of the 174 radiologists who returned the survey and had interpreted screening mammograms during the study period, the percentage providing estimates of their performance varied by the measure: recall rate 89%(155/174); false-positive rate 63%(110/174); cancer detection rate 76%(133/174); and PPV2 76%(132/174).
Most radiologists accurately estimated their recall (78%) and cancer detection (72%) rates, but only 19% and 26% accurately estimated their false-positive and PPV2 rates, respectively (Table 1). While all 155 radiologists completed the survey question estimating their recall rate, the number responding to the other measures was much lower (n=110 false-positive rate, n=127 cancer detection rate, n=132 PPV2). The most common reason for low accuracy of perceived false-positive rate was due to not providing an estimate.
Only 50% of radiologist who interpreted ≤1,000 mammograms annually accurately estimated recall rate compared to 73% of radiologists who interpreted 1001–2000 mammograms and 80% who interpreted >2000 mammograms (Table 1). Those who reported never, or rarely, using numbers or statistics when discussing mammography results with patients were less accurate in estimating their own cancer detection rate (66%) compared to those who sometimes (83%) or often/always (90%) use numbers or statistics when communicating with patients. No other radiologist characteristics were statistically significantly related to perceived performance accuracy.
Among all radiologists, 19.4%(30/155) underestimated and 2.6%(4/155) overestimated their actual recall rate, while 35%(53/155) overestimated their false-positive rate, 50%(57/114) overestimated PPV2, and 11%(17/155) overestimated their cancer detection rate (Table 2). Among radiologists who provided a performance measurement, 49%(53/108) overestimated false-positive rate and 13%(17/131) overestimated cancer detection rate.
Figure 1 shows the relationship between actual and perceived performance. Recall rate was estimated by radiologists with the most accuracy (concordance correlation (Corr) =0.55) compared to PPV2 (Corr=0.32), cancer detection rate (Corr=0.13, excluding 6 outliers), and false-positive rate (Corr=0.02).
Radiologists in general perceived their screening performance as equal, or better, relative to others (Table 3). Radiologists reported having similar (43%) or lower (31%) recall rates compared to other radiologists and similar (52%) or lower (33%) false-positive rates. For cancer detection rate, radiologists perceived having similar (72%) or higher (23%) values and similar (72%) or higher (38%) values for PPV2. Only a few radiologists viewed their cancer detection rate (5%) or PPV2 (3%) as worse than their peers.
Radiologists with low, medium, or high perceived relative performance for recall rate and PPV2 had correspondingly low, medium, or high actual and estimated performance measures (Figure 1). There was no obvious relationship with perceived relative performance and actual, or estimated, false-positive or cancer detection rates. Table 3 similarly shows radiologists who perceive their recall rate as being lower relative to their peers have lower mean actual, and perceived recall rates of 7.8%, and 6.0%, compared to 15.0%, and 13.5%, respectively, among those who perceive having a higher relative recall rate. Similar patterns were observed for PPV2 and cancer detection rate, but only for actual, and not perceived false-positive rate. Further, radiologists who perceived their recall rate as lower relative to their peers were more accurate in estimating their actual recall rate (92% accurate) compared to those who perceive their recall rates as the same (66% accurate) or higher (83% accurate). Similar results occurred for false-positive rate and PPV2, but not for cancer detection rate in which those whom perceived having a lower cancer detection rate relative to their peers were less accurate in estimating their actual cancer detection rate (63% accurate) compared to those perceiving having the same or higher (both 75% accurate).
Radiologists reading mammograms are mandated to produce performance data (audit reports) for their MQSA certified institution, and these measures are deliberately designed to be used as a performance improvement metric. However, in our study we noted that radiologists were relatively good at estimating their recall and cancer detection rates, but most were unable to accurately estimate their false-positive rate or PPV2. Radiologists tended to underestimate their recall rate and overestimate their false-positive, cancer detection, and PPV2 rates. Many radiologists perceive themselves as having better interpretative performance then they actually do. This is an important finding, because without an accurate understanding of their performance, it is unrealistic to expect radiologists to know whether improvement is needed and which areas are most in need of improvement. Performance feedback should include both definitions of the performance measures and display results relative to national guidelines or peer performance to assist highly motivated physicians to improve if needed.
While almost all radiologists (96%) reported receiving audit reports, receipt of these data did not appear to fully inform them of their own performance on the outcome measures in this study. We had hypothesized that receipt of audit reports, clinical experience and fellowship training would all improve radiologists’ accuracy at estimating their own interpretative performance, but we found minimal evidence of this relationship. Only radiologists with a higher volume of mammograms had a positive effect on accurately estimating recall rate and radiologists who more frequently used numbers or statistics when discussing mammography results with patients were more accurate in estimating their cancer detection rate.. While audit report information was available to all of the study radiologists since they work at a BCSC facility, the audit data varied across sites with most sites providing information at the radiologist level, but others at the facility level only . The type of information provided also varies across sites, with all reports providing recall rates and none reporting false-positive rates. An important next step would be to evaluate how audit reports are actually reviewed and considered by individual radiologists. Research on physician behavior change indicates that predisposing physicians to change requires showing them the gap between their own performance and that of national targets [12, 13]. Prior work in this area suggests that the format of audit reports may make a difference in how physicians use data to improve their clinical practice .
Another important finding of our study was how few radiologists were able to accurately estimate their false-positive rate. Only 62% of radiologists even attempted to provide an estimate and among those who did, only 28% provided an accurate estimate. The provided values suggest some radiologists may have confused false-positive rate and specificity when completing the survey, even though definitions were provided. However, even if we assume that radiologists who provided very high estimates of their false-positive rate (i.e., >50%) were actually providing estimates for their specificity and calculate false-positive rate from those values as 100-specificity, still only 36% accurately estimated their false-positive rate. Further, a recent study on this same population evaluating if radiologists could predict a reasonable goal for these performance measures, only 22% reported goals for false-positive rate within the range recommended by the American College of Radiology . False-positive rates are much higher in the US relative to Canada and European countries with screening programs  likely due in part to malpractice concerns. However, this represents an opportunity for improvement in order to minimize the negative consequences of over-treatment, anxiety and cost for women . It is also possible that radiologists interpreting screening mammograms do not typically conduct the diagnostic work-ups on these same patients, thus they are not always aware of the outcome. It will be difficult to motivate radiologists to reduce their false-positive rates (while maintaining sensitivity) if they do not understand what their false-positive rates currently are, how it is calculated, or how they compare with their peers.
Only one previous study compares estimated versus actual mammography performance, but it was restricted to three geographic regions in the US and did not evaluate false-positive and cancer detection rates . Our analysis was also conducted on a more recent survey, such that radiologists had 5 additional years of audit feedback about their performance, and had more cumulative years of actual performance data from the BCSC to accurately estimate their performance. Our results are not directly comparable to this early report as the statistical methods were different and the screening population was more restrictive in the previous study.
Our study has several strengths. We had a good response to our survey tool (68.6%) and we examined radiologists’ perceived interpretive performance compared to their peers and also their actual mammography performance data from clinical practices in six geographically distinct regions in the US. This suggests that the findings of our study are generally applicable across the country.
One weakness of our study was the low response rate for estimating false-positive and cancer detection rates. This may indicate that participants were not comfortable estimating these two measures. We suspect that radiologists purposely skipped these questions, as all but one radiologist who did not answer these questions provided responses for the subsequent survey questions. Having variable audit data across BCSC sites is also a limitation. However, since almost all study radiologists reported receiving audit reports, radiologists could have looked up their performance measures in their BCSC audits; thus, our results could overestimate the percentage of US radiologists who can accurately describe their interpretive performance.
For most performance measures, radiologists overestimate their ability, including perceiving their screening interpretive performance as better than their peers and have particular difficulty in estimating their false-positive rate and PPV2. Given the study findings, opportunities for improving radiologists’ understanding of their performance could include a standardized facility and physician audit reporting form such as a “Radiologist Report Card” for screening mammography interpretation, with clear reporting of recall rate, PPV2, false positive rate, and cancer detection rate relative to national guidelines or peer cohort. Similarly, providing a website that allows for radiologists to compare themselves to their peers in the United States and other countries may improve their ability to understand their own interpretive performance measures, and know if, and in which areas, their interpretive performance needs improvement.  Development of widely available CME specific to MSQA reporting of performance measures, including radiologist’s individual audit data relative to peers’, could also be an effective tool for self assessment, and potentially ultimately improve clinical interpretation. The routine submission of local data to the National Mammography Database, developed by the American College of Radiology, https://nrdr.acr.org/Portal/NMD/Main/page.aspx (accessed 12/1/11), is another powerful tool which should assist radiologists in their understanding of their own performance relative to their peers
Radiologists perceive their performance to be better than it actually is and at least as good as their peers. Radiologists have particular difficulty estimating their false positive rates and PPV2. Future study of strategies to improve audit feedback to and education of radiologists is warranted, but encouragement for radiologists to join the ACR National Mammography Database would answer many of these findings.
This work was supported by the National Cancer Institute and Agency for Healthcare Research and Quality (R01 CA107623), the National Cancer Institute (K05 CA104699; Breast Cancer Surveillance Consortium: U01CA63740, U01CA86076, U01CA86082, U01CA70013, U01CA69976, U01CA63731, U01CA63736 U01CA70040), the Breast Cancer Stamp Fund, and the American Cancer Society, made possible by a generous donation from the Longaberger Company’s Horizon of Hope® Campaign (SIRGS-07-271-01, SIRGS-07-272-01, SIRGS-07-273-01, SIRGS-07-274-01, SIRGS-07-275-01, SIRGS-06-281-01). The collection of cancer data used in this study was supported in part by several state public health departments and cancer registries throughout the U.S. For a full description of these sources, please see: http://breastscreening.cancer.gov/work/acknowledgement.html. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health. We thank the participating women, mammography facilities, and radiologists for the data they have provided for this study. A list of the BCSC investigators and procedures for requesting BCSC data for research purposes are provided at: http://breastscreening.cancer.gov/.