|Home | About | Journals | Submit | Contact Us | Français|
To compare radiologists’ performance during interpretation of screening mammograms in the clinic to their performance when reading the same examinations in a retrospective laboratory study.
This study was conducted under an Institutional Review Board approved HIPAA compliant protocol where informed consent was waived. Nine experienced radiologists rated an enriched set of examinations that they personally had read in the clinic (“reader-specific”) mixed with an enriched “common” set of examinations that none of the participants had read in the clinic, using a screening BI-RADS rating scale. The original clinical recommendations to recall the women for a diagnostic workup, or not, for both reader-specific and common sets were compared with their recommendations during the retrospective experiment. The results are presented in terms of reader-specific and group averaged “sensitivity” and “specificity” levels and the dispersion (spread) of reader-specific performance estimates.
On average radiologists performed significantly better in the clinic as compared with their performance in the laboratory (p=0.035). Inter reader dispersion of the computed performance levels was significantly lower during the clinical interpretations (p<0.01).
Retrospective laboratory experiments may not represent well either expected performance levels or inter- reader variability during clinical interpretations of the same set of examinations in the clinical environment.
Important progress has been made in our understanding of the use of retrospective observer performance studies in the evaluation of diagnostic imaging technologies and clinical practices as well as the methodologies needed for the analysis of such studies [1–8]. A frequently used approach is a Receiver Operating Characteristic (ROC) type study that provides information on how sensitivity varies as specificity changes while accounting for reader and case variability [9–12].
The most relevant question of interest in all these studies is not whether or not the results can be generalized to cases, readers, abnormalities, and modalities under the study conditions, but rather if the results of a given study lead to valid inferences on the potential impact of different technologies or practices in the actual clinical environment. Experimental conditions that are required in the vast majority of observer performance studies could affect human behavior in a manner that would limit the clinical relevance of inferences made . There are very few data attempting to assess the possibility of a “laboratory effect” in observer performance studies and how it could impact the generalizeability of results .
Since large observer variability has been reported in many studies, in particular during the interpretation of mammograms [15–19], we performed a comprehensive large observer study designed to compare radiologists’ performance during interpretation of screening mammograms in the clinic to their performance when reading the same examinations in a retrospective laboratory study.
Nine board-certified Mammography Quality Standards Act (MQSA) qualified radiologists (with 6–32 years experience in interpreting breast imaging procedures who perform over 3000 breast examinations per year) were selected to participate in the study based on the amount of screening examinations read during the time period from which we selected the examinations. Each reader interpreted between 276 and 300 screen-film mammography (SFM) examinations that were ascertained under an Institutional Review Board approved, Health Insurance Portability and Accountability Act (HIPAA) compliant protocol. Informed consent was waived.
Radiologists read examinations three times over a period of 20 months (September, 2005 – May, 2007). These included a mode we termed clinic - Breast Imaging Reporting and Data System (BI-RADS) , a mode rated under the ROC paradigm with an abnormality presence probability rating scale of 0–100, and a Free-Response ROC (FROC) mode . The study was “mode balanced” in that three radiologists read each of the modes first, three radiologists read each of the modes second and three radiologists read each of the modes third (last) using a block randomization scheme. The results of the clinic - BI-RADS mode are the focus of this paper because it is most similar to the clinical practice. In the future, we plan to report the results of the two other modes (ROC and FROC) and their relationship to the results from the clinic-BI-RADS mode.
Four view screen film mammograms (i.e., “current” examination) as well as a comparison examination (2 or more years prior when available, or one year prior when the only available examination) as used during the original clinical interpretation were made available to the radiologists during the readings. Radiologists interpreted each examination as they would in the clinic and rated the right and left breast separately. The set read by each radiologist included a “common” set of 155 SFM examinations originally read in the clinic by other radiologists not participating in the study and a “reader-specific” set of examinations that had been clinically read by him/her between 2 and 6 years previously. “Common” and “reader-specific” examinations were mixed and radiologists read all cases in one mode before moving to the next in the mode balanced, case randomized study that was managed by a comprehensive computer program. All ratings were recorded electronically and saved in a database.
The distribution of examinations in the different categories was designed so that approximately 25% depict positive findings (associated with verified cancers), approximately 10% depict verified benign findings, and approximately two thirds of the examinations rated as either negative during screening or recalled for a suspected abnormality but rated as negative during the diagnostic work-up that followed. Actually negative examinations included: 1) those originally given a BI-RADS rating of 1 or 2 (not recalled at screening) and verified for not having cancer at least one year thereafter, and 2) those originally given a BI-RADS rating of 0 and later found to be negative during a subsequent diagnostic workup. Actually positive examinations included: 1) all available examinations depicting pathology confirmed cancers detected as a result of the diagnostic follow-up of a recall, and 2) all false negative examinations, namely those examinations actually depicting an abnormality that had been originally rated as negative (BI-RADS 1) or benign (BI-RADS 2) but later verified as positive for cancer within one year. Negative examinations were selected in a manner that approximately one third of the examinations did not have a prior examination during the original interpretation to reflect our approximate clinical distribution. Therefore, 63% (635/1013) of the actually negative and 83% (293/354) of the actually positive examinations had a “prior” comparison examination.
Actually positive and negative examinations were obtained from databases of the total screening population that are carefully maintained for quality assurance purposes and from our tumor registry. Actually negative examinations were selected consecutively from our total screening population beginning with the first day of each calendar quarter from 2000 and continuing through 2003 (when the inclusion criteria and verification conditions were met) until the required predetermined number of cases in each category was reached. The time frame for searching for actually positive examinations included all screening examinations between 2000 and 2004 in order to assure the inclusion of as many consecutively screening detected cancers by each of the nine participants, as possible. Examinations were rejected if any of the current or prior films were missing, “wires” marking scars of previous biopsies were visible, if there was any indication (marking) of a palpable finding at the time of the screen and were placed during the examination (e.g. BB) hence, were visible on the images, or if a screening examination had been converted to a diagnostic examination during the same visit because of symptoms reported or discovered at the time of the screen. As a result of the selection protocol a total of 354 “positive” examinations (both screening detected or missed but proven cancers) depicting the abnormalities in question were included in the study and 107 examinations were rejected (45 with BB markings for palpable masses and 62 with wires marking scars and/or previous biopsies). The distributions of negative and positive examinations with depicted abnormalities that were ultimately included in the study are summarized in Table 1. The average age of women whose examinations were selected was 53.96 and ranged between 32 and 93.
The use of computer-aided detection (CAD) was introduced into our clinical practice in mid - 2001. Therefore, 752/1367 (55%) overall cases resulting from 671/1212 (55 %) in the “reader-specific” and 81/155 (52 %) in the “common” set of cases had been originally read in the clinic with CAD. However, we did not supply CAD results since we previously determined that the impact of CAD on recall and detection rates in our practice, including these very radiologists, was small , and CAD results had not been consistently kept throughout the period of interest.
Each examination was assigned a random identification number, cleaned, and all identifying information, including time marks, were taped over with black photographic tape. Study ID labels were affixed to all films. Prior films were identified and specifically marked with the number of months between the “prior” and “current” examination.
Observers were unaware of the specific aims of the study (i.e., they were not told they had previously read either all or some of the examinations in the clinic) and received a general and a mode specific “Instruction to Observers” document . The document included a general overview of the study set up, the process for reviewing and rating examinations during a session and that prior examinations would be provided, if applicable, and labeled with the approximate number of months between the relevant (“current” and “prior”) examinations. The document also described in detail how certain abnormalities (e.g. asymmetric density) should be scored, and that the set of examinations was enriched without any specific numbers or proportions. The “clinic –BI-RADS” mode instructions specifically stated that the reader was expected “to read and rate (interpret) the examinations as though they are being read in a screening environment”. The readers were not made aware of the specifics of each mode until the time the reader would be scheduled to start that mode. A training and discussion session was implemented prior to the commencement of interpretations. The training and discussion included a clear definition of abnormalities of interest and how these should be rated as well as the protocol for using the computerized rating forms.
Examinations to be read within each session included a randomized mix of examinations from the “reader-specific” and “common” sets. For example, as shown in Table 2, reader #1 read a total of 297 examinations (155 common + 142 reader-specific), reader #2 read 295 examinations (155 common + 140 reader-specific), etc. Note that each reader read a different number of “reader-specific” examinations since the set of examinations included reader specific examinations unique only to that particular reader. For each reading session a randomized examination list was generated by a computer program assigning an examination number to a specific slot number on the viewing alternator. All examinations to be read during the specific session were loaded onto the film alternator according to the examination list generated by the computerized scheme. After matching the case number, observers reported their recommendations for each examination on a computerized scoring form. The number of examinations interpreted during each reading session varied from 20 to 60 depending on what each participant’s schedule would allow and their own pace of reading but on average about 15% of the examinations were read per session. Answers could be changed while viewing an examination until the “done” command was entered and final ratings were recorded.
During the “clinic – BI-RADS” mode observers were first presented with a choice of rating the examination and each breast as “negative” (1), “definitely benign” (2), or recommended for “recall” (0). If a “benign” or a “recall” rating was entered, observers were asked to identify the type of abnormality (s) in question (i.e. “mass”, “microcalcifications”, “other”) and could list more than one abnormality. If a “recall” rating was entered a list of recommended follow up procedures appeared and observers had to select one or more recommended procedures (e.g., spot CC/spot 90, spot CC/whole breast, magnification CC/90 degrees, exaggerated CC, tangential for calcifications, and/or ultrasound).
We focused our analysis on an examination based rating, namely an examination in which only one breast contains malignancy is treated as “true positive” if a “recall” rating was given to either breast. For the purpose of this analysis, primary sensitivity (or True Positive Fraction-TPF) is estimated as a proportion of the “positive” examinations out of all examinations depicting verified cancers and specificity (or 1-False Positive Fraction-FPF) is estimated as a proportion of the “negative” examinations out of all verified “cancer free” cases. In our primary analysis we summarize performance over readers as a simple average.
We conducted a statistical analysis which accounts for both the correlations between ratings on the same examinations and for heterogeneity between observers’ levels of performance. The difference between performance levels in the clinic, namely the actual ratings (0,1,2) during the prospective clinical interpretation of each examination, and the laboratory retrospective observer study was conducted using hypotheses testing in the framework of a generalized linear mixed model using proc GLIMMIX SAS software (SAS Institute v.9.13, Cary, NC).
We tested whether the average performance in the clinic and laboratory could be described by a single ROC curve. For this purpose, reader-specific and common sets of the data were analyzed separately. We also verified the results of this analysis by performing an analysis conditional on the examinations which have discordant ratings between the clinic and the laboratory.
In a separate analysis, we compared the dispersions (the average distance to the mean performance level) of the computed reader-specific operating characteristics. The comparison of performance dispersion was conducted only on reader-specific subsets using Levene’s test for paired data [24, 25]. We assessed the differences in spread of specificities, sensitivities, and distances from reader-specific to reader-averaged operating point.
Since a fraction of the actually benign examinations should have led to a recall recommendation in the clinic regardless of the ultimate outcome hence affecting recall rates, we also analyzed the data after excluding the 425 examinations (388 in the “reader-specific” and 37 in the “common” sets of cases) with verified benign findings. In addition, since screening BI-RADS ratings were available for each breast separately, both from the actual clinical interpretations as well as the retrospective laboratory study, we also computed and compared the breast based performance levels (sensitivity and specificity) and dispersion in performance levels among the nine radiologists. Namely, each breast (left and right) was considered a diagnostic unit, rather than a case based analysis in which the most suspicious finding (hence, the corresponding rating) for either of the breasts is taken into account as the examination’s final recommendation (or outcome).
We also assessed the possible influence, if any, of the use of CAD in 55% (671/1212) during the original clinical interpretations of the “reader-specific” set examinations, on the results of our two primary analyses. Namely, the comparison of the average-performance levels and the comparison of dispersions in performance levels among the nine radiologists. We estimated and compared the trend of the readers for performing on different ROC curves in the clinic and the laboratory for the two groups of cases initially evaluated “with” and “without” CAD. We computed the dispersions of reader-specific performance levels in the clinic and the laboratory for each of the “with” and “without CAD” groups of examinations and verified the significance of the difference in dispersions adjusted for possible CAD effects. Last, we assessed whether there was an interaction between the possible effect of using CAD (or not) and the possible effect of the inclusion (or exclusion) of actually benign examinations in the analysis.
Table 2 provides the computed performance levels in the clinic and the laboratory for each of the 9 radiologists. Both mean sensitivity and specificity were higher in the clinic as compared with the laboratory (Sensitivity: 0.919 versus 0.895; Specificity: 0.626 versus 0.528, Fig 1), although the levels for either sensitivity alone or specificity alone did not achieve statistical significance (p>0.1). This tendency was observed in both reader-specific and common sets. Four readers did achieve higher sensitivity in the laboratory albeit, with a corresponding lower specificity.
Although the differences between sensitivity and specificity alone were not statistically significant, there were statistically significant differences between the clinic and laboratory for performing on different ROC curves due to the simultaneous decreases in the laboratory in both sensitivity and specificity levels (p=0.035). The results of the un-conditional analysis of the common set of examinations agreed with the results for reader-specific sets (p=0.027). A conditional model based test on discordant examinations only was significant (p<0.01), supporting of the hypothesis that combined performance was higher in the clinic as compared with the laboratory.
There was a substantial difference in the spreads of the actual operating points in the clinic and laboratory on the reader-specific sets (fig. 1). The sample standard deviations of reader-specific specificities differed by a factor of 7.8 (0.0253 versus 0.1976); and the standard deviations of reader-specific sensitivities differed by a factor of 2.6 (0.0382 versus 0.0999). There was a significant difference (p<0.01) between the dispersions (average distance to the mean performance levels) of the computed reader-specific operating points (0.0395 and 0.1870) in the clinic and laboratory, respectively.
After excluding benign examinations, the differences between specificity levels in the clinic and laboratory increases as compared with the complete dataset, and the test for performing on different ROC curves was statistically significant (p<0.01) for both reader-specific and common sets. The difference in spreads on reader-specific sets remain significant (p<0.01) after exclusion of actually benign examinations.
The breast-based analyses demonstrated the same trend as the examination-based results. Both average sensitivity and average specificity levels in the clinic were higher than those in the laboratory (Sensitivity: 0.901 versus 0.847; Specificity: 0.792 versus 0.730). The sample standard deviations of reader-specific specificities differed by a factor of 4.3 (0.0254 versus 0.1096); and the standard deviations of reader-specific sensitivities differed by a factor of 1.8 (0.0473 versus 0.0867).
The use of CAD (or not) did not significantly (p=0.61) affect the observation that performance levels in the clinic were superior to that in the laboratory. As related to the possible effect of the use of CAD on variability, the spread in performance levels (average distance from the mean performance level) in the clinic for the set interpreted “with” CAD was not smaller than that for the set interpreted “without” CAD (0.1252 and 0.0986, respectively). The ratios of performance dispersions between the clinic and the retrospective laboratory experiment were similar for the set of examinations read in the clinic “with” CAD and the set of examinations read in the clinic “without CAD”. The adjusted difference in dispersions of performance levels in the clinic and the laboratory was statistically significant (p=0.025). We note however that our study did not allow for an efficient unbiased assessment of the possible effect of CAD on performance levels of individual readers or the dispersion in performance levels among readers as in studies when the same cases are read either prospectively or retrospectively “with” and “without” CAD by the same readers. There were no interactions (p=0.31) between the effect of using CAD and the effect of inclusion of actually benign examinations in that the inclusion or exclusion of examinations depicting benign findings was similar whether CAD was used or not.
Several retrospective studies demonstrated that radiologists’ performance is relatively poor when interpreting screening mammograms and radiologists’ inter-reader variability is substantial [15–19, 26]. Inferences generated from these studies have been quoted numerous times and used as one of the primary reasons for the need for corrective measures . However, there are no substantial data regarding the “laboratory effect” or the correlation between performance in the clinic and laboratory experiments. This type of a study is difficult to design since in most areas we do not have adequate quantifiable estimates of performance levels in the clinic. This is not the case in screening mammography where the BIRADS ratings can be used to estimate radiologists performance levels in recalling, or not, women that ultimately are found to have breast cancer (or not). Hence, screening mammography examinations were used in this study because the endpoint is typically binary (e.g., recommendation to recall the woman for additional work-up, or not) and the majority of those not recalled can be verified through periodic follow up.
The laboratory results in our study are similar to and consistent with those reported by Beam et al. [16, 19]. However, on average, radiologists performed better in the clinic as compared with their performance in a laboratory retrospective experiment when interpreting the examinations they themselves had read in the clinic. Reading “order effect”, if any, would increase the observed differences in that all clinical readings were done first. These seemingly surprising results can be explained if one accepts that in the laboratory radiologists are aware that there is no impact on patient care; hence, the reporting pattern of at least some readers may change substantially. In addition, in the laboratory, they are not affected by the pressure to reduce recommendations for a recall per practice guidelines hence on average their recall rate is higher . Interestingly, their average performance on examinations they had actually read in the clinic was better than their performance on examinations other radiologists had read in the clinic. This could be due to remembering some of the examinations but is unlikely since in addition to mixing these examinations with others they did not read, there was a long time delay between the two readings  and they had interpreted a very large number of examinations in between. It is quite possible that there is a “self selection” bias, namely, if a radiologist is better at detecting certain types of depictions of cancers, then over time the type and distribution of cancers he/she detects is affected; hence, when all cancers actually detected by a particular radiologist are used in a retrospective study this set will be different than the type and distribution of cancers detected by other radiologists. Therefore, he or she will also be better at detecting “their own” type of cancers in a retrospective study. This finding suggests that continuous training (feedback) on cases missed by the individual radiologist rather that those missed by others may prove to be a better approach to continuing improvements in performance. Other clinical information (e.g. patient and/or family history) may also affect clinical decisions but it is not expected to be an important factor in the screening environment evaluated here.
We note that we define “sensitivity” differently from other studies  in that here it is radiologists’ sensitivity to actually depicted abnormalities. Also, we do not have a full account of all false negative findings and examinations “lost to the system” because some women with cancer may have relocated or decided to be treated elsewhere. These cases are not accounted for. Hence, our results are conditional on the dataset and readers in this study and our conclusions will have to be independently validated. Our observations may be applicable solely to experienced high volume radiologists .
Lastly, the significantly higher performance in the clinic observed here may be contributing to the difficulty in demonstrating actual significant improvements due to the use of CAD in some of the observational studies [22, 30].
The examinations used in this study were sampled in a manner that could have improved apparent estimates of sensitivity in the clinic because of the possible incomplete sampling of false negative cases, making them potentially not representative of the true performance levels. However, we expect that on a relative scale the observed relationship between clinical and laboratory performance levels to be similar in a true representative randomly selected sample sets of examinations. This study may have implication on the clinical relevance of retrospective observer studies designed to assess and compare different technologies and/or practices.
In conclusion, when deciding whether to recall a woman for additional diagnostic examinations, experienced radiologists performed on average significantly better and as important more consistently in the clinic than in the laboratory when interpreting the same examinations. Radiologists’ inter-reader spread in performance levels was significantly lower during prospective clinical interpretations when the same clinical rating scale was utilized.
The authors thank Glenn Maitz, Jill King, Amy Klym, and Jennifer Stalder for their diligent and meticulous effort on this project.
Funding: This is work is supported in part by Grants EB001694 and EB003503 (to the University of Pittsburgh) from the National Institute for Biomedical Imaging and Bioengineering (NIBIB), National Institute of Health.