This test-set study is the largest to date to examine agreement between radiologist recall and a consensus-derived gold standard interpretation. It is the first to examine the factors of interpretive difficulty and finding type at the woman, breast, and lesion level. As expected, agreement was high for obvious findings, and markedly lower for subtle findings, most notably among non-cancer cases. Overall, agreement by finding type was relatively high for cancer cases compared to non-cancers, but calcifications, architectural distortions, and asymmetries all contributed to lower agreement. Much of the difference in agreement appeared to be due to nomenclature, particularly for architectural distortion and asymmetries. Most unnecessary recalls by the participating radiologists were for asymmetries.
Our analysis of woman-, breast-, and lesion-level findings suggested that radiologists may arrive at the same recall decision even if they are not basing their decision on the same lesion. We also found that woman- and breast-level agreement was consistently stronger than lesion-level. The implications of these findings may be a similar clinical course for any cases for which there is at least breast-level, but particularly lesion-level agreement, because in radiologists recommend follow-up that is pursued to a final interpretation. Accuracy at the breast level may indeed lead to appropriate work-up and treatment, however there will be times when the lesion is benign or a cancer lesion is missed, or when a more efficient workup and/or better clinical course would result if the specific lesion is correctly identified on screening.
Nonetheless, specific lesions may require particular courses of clinical care, and differences in finding type may alter management. An advantage of our multilevel approach is that we were able to identify the most challenging finding types, which will inform strategies to improve interpretive performance. By focusing on non-mass findings, particularly in intermediate and subtle cases, responsive educational interventions can be designed to yield clinically important improvements.
This study provides a unique perspective on variability in mammography interpretation by measuring agreement with a well-defined standard interpretation. Several other studies examined interobserver agreement, particularly for assessing performance and use of the BI-RADS lexicon [5
]. A focus on recall/no recall, as in this study has high clinical relevance as it determines whether a woman will undergo further follow-up, which provides additional information for a final assessment. Previous studies of lesion agreement focused on detailed characterization of mass characteristics [5
], and we found the highest level of agreement between study radiologists and the expert panel for masses. Our results showed the lowest agreement for architectural distortion and asymmetric densities, consistent with studies that suggest that classification of finding type rather than detection reduces interobserver agreement [5
Several important aspects of this study should be noted in interpreting the results. First, radiologists were in a testing situation rather than a usual clinical care setting. Gur et al. reported significantly lower performance level among radiologists (n=9) in the laboratory compared to the clinic, and lower inter-reader dispersion in a clinical setting [17
]. In contrast, another study comparing clinical and test-set performance in 27 mammographers found no correlation among settings and results [18
]. Our study used the same cases and test situation for the gold standard interpretations and the study radiologists’ interpretations, possibly providing a more valid comparison. Our findings in a testing environment are congruent with Venkatesan, et al., who examined the positive predictive value (PPV) of specific findings in actual practice and showed that asymmetries had the lowest PPV, i.e
., that most recalled findings were benign [19
Radiologists were instructed to report the most significant finding for the mammogram. In some cases, although study radiologists noted the same lesions as the expert panel as indicated by the “clicks” on the screen, they differed with the gold standard interpretation in assignment of the “most significant” lesion; i.e. the final lesion type ascribed to the case. This speculation also is consistent with the greater agreement at the woman level, i.e., although there was greater disagreement in which finding was most significant, the clinical importance assigned to the different findings led to the same action.
The images were converted from analog to digital, with some loss of image quality. The digitization process used in the study (from the American College of Radiology) is the same as was used for the Committee on Mammography Interpretive Skills Assessment (COMISA) exam (now MCR - Mammographic Case Review); however in our study, cases with findings of interest were not specifically chosen based on feature image quality. Although all participants were invited to use a study laptop, participating radiologists may have reviewed the cases on personal computers, although the software program required minimum viewing criteria.
A major strength of our study is the relatively large number of participating radiologists. In addition, the custom-made software contained important features for viewing and interpreting, including availability of comparison films, pan and zoom features, and a standard mammography image set for each exam (left and right MLO, left and right CC, and comparison images). Another key strength was the development of gold standard interpretations through a rigorous consensus process with three nationally recognized mammography experts. We had an explicit goal of creating a test set representative of clinical practice. Thus, we randomly selected exams from clinical practice, thereby including some difficult cases, which introduced more variability, but increased generalizability.
This study provides important insights into the types of mammographic cases that contribute the most to interpretation variability. By understanding the extent to which case difficulty and finding type affects interpretive agreement, we can develop targeted training modules and educational interventions that yield the greatest improvement in radiologist interpretive performance. Our analysis of mammography interpretation agreement between radiologists and an expert panel suggests that mammography training should focus on identification and correct interpretation of asymmetric densities and architectural distortion.