The false positive rate of CAD requires radiologists to dismiss high numbers of CAD marks compared to the numbers of cancers detected in a screening population [1
]. With a false positive rate of approximately two marks per four-image mammogram, a typical screening population of 1,000 women would generate 2,000 false positive marks to be dismissed while detecting approximately five cancers based on a mixed incidence/prevalence of five cancers per 1,000 screening mammograms. With each of the cancers detected in two views (craniocaudal and mediolateral oblique), there would be 200 marks to dismiss for each view of a detected cancer. Although these are large numbers of marks to dismiss, our study is the first that we are aware of to provide context regarding ease of dismissing marks. Our study showed that 12% (v7.2) or 16% (v5.0) of these marks were average, hard, or very hard to dismiss. With recall rates from screening mammography approaching these rates of average-to-very-hard marks to dismiss, rather than a 200:1 ratio of distracting/relevant marks based on all false positives compared to views of detected cancers, CAD could be considered to approximately have a 1:1 ratio of distracting/relevant marks based on average-to-very-hard marks to dismiss compared to views of lesions warranting recall.
Zheng et al. [31
] have shown the impact of CAD sensitivity and false positive rate on radiologist performance with the four combinations of a CAD system with sensitivities of 90% or 50% and false positive rates of two or eight marks per four-image mammogram case (multiplying their per image false positive rates by 4). Their results showed that a CAD system with 90% sensitivity and two false positives per case improved radiologist performance, a CAD system with 90% sensitivity and eight false positives per case or a CAD system with 50% sensitivity and two false positives per case had no significant impact on radiologist performance, while a CAD system with 50% sensitivity and eight false positives per case was detrimental to radiologist performance. Since CAD sensitivity is about 90% or higher [22
], Zheng’s results are another indicator that distracting/relevant marks ratio of the CAD systems reported in our study and other studies with current CAD that have about two false positives per case [25
] should improve radiologist performance [3
Although both versions of the CAD system in our study did not have statistically different specificities when considering all marks, v5.0 had higher specificity in terms of only masses than v7.2, and v7.2 had higher specificity in terms of only calcifications than v5.0. Although CAD specificity is not reported in the literature as commonly as false positive rate, four studies assessed specificity. Two studies were based on mostly or all [25
] two-image cases; thus, comparison with our results is difficult. Two other studies used four-image cases, but only reported specificity of CAD for calcifications. Brem et al. [22
] reported 63% and 58% specificities for calcifications in non-dense and dense breasts, and Yang et al. [26
] reported 67% and 69% specificities for calcifications in non-dense and dense breasts; in both studies, there was no statistical difference in specificity for calcifications in non-dense and dense breasts. Our study had similar results with 57.4%, 70.5% and 69.2%, 82.1% specificities for v5.0 and v7.2 in non-dense and dense breasts, respectively. Interestingly, both versions have slightly lower, but not statistically significant, specificities in terms of all marks, masses only, and calcifications only in non-dense breasts than in dense breasts.
False positive rate is commonly reported in the CAD literature, with earlier studies reporting means of 3–5 false positives per four-image case [3
] and more recent studies reporting means of 2–3 false positives per four-image case [26
]. Recent studies estimate a range of 0.45 to 0.55 false positive marks per image from digital mammograms [26
], indicating that the false positive rate is comparable between film screen and digital mammograms with a mean of 2–3 marks per four-image mammogram. With v7.2, The et al. [28
] reported a mean false positive rate of 2.3 false positives per four-image case with digital mammography, which is the same as our mean of 2.3 false positives per four-image case with screen-film mammography.
In our study, v5.0 and v7.2 did not differ significantly in mean and median numbers of all marks and marks only on masses but differed significantly in mean and median number of marks only on calcifications. The most relevant comparisons to our results come from studies with the same versions of the CAD system we studied. With v5.0, Malich et al. [29
] and Brem et al. [23
] reported mean false positive rates for all marks of 2.5 and 2.4 per four-image case (multiplying Malich’s rates by 2 since they used two-image unilateral cases and multiplying Brem’s rates by 4 since they reported per image rates), which compare well with our mean of 2.3 false positives per four-image case. The Malich and Brem studies yielded mean false positive rates for masses and calcifications of 1.7, 0.8 and 1.6, 0.8, which also compare well with our mean results of 1.6, 0.8 with v5.0.
Interestingly, both CAD versions in our study have slightly higher, but not statistically significant, mean and median numbers of all marks, marks only on masses, and marks only on calcifications in non-dense breasts than in dense breasts. Brem did not stratify false positive rate by breast density, but Malich did, and although they showed a trend towards lower false positive rates with lower breast densities, the only statistically significant result they reported was a lower mean false positive rate for masses with entirely fatty breasts compared to the other breast density categories. Fewer CAD marks and higher specificity in dense breasts are likely the result of difficulty in identification or differentiation of patterns from dense tissue material.
Version 5.0 and 7.2 did not differ significantly in the proportion of marks classified as very easy or easy to dismiss—neither among the 76 marks that were identified in both versions nor among marks that were identified in one or the other version but not both. Therefore, it appears that although the false positive rate remains at a mean of 2.3 false positive marks per case, the majority of these marks are quickly evaluated by the radiologist and easily dismissed.
Although the total number of marks did not change between v5.0 and v7.2, the fact that only about one third of the marks were common is interesting. Although reproducibility may be a factor due to redigitization of the same screen-film mammograms several years later, this result may largely be a reflection of the significant changes in the CAD algorithms from v5.0 to v7.2. Although CAD sensitivity is not reported in this study, the manufacturer’s strategy from v5.0 to v7.2 was to increase CAD sensitivity while maintaining the CAD false positive rate. Since detection of masses is fundamentally more challenging for CAD than detection of calcifications, the false positive rate for calcifications was intentionally reduced to allow for an increase in the false positive rate for masses without a change in the overall false positive rate. This strategy could potentially facilitate increases in CAD sensitivity for masses while maintaining CAD sensitivity for calcifications (Hoffmeister JW, personal communication).
The retrospective nature of this study resulted in some limitations. First, all mammograms evaluated by CAD were screen film. Although there is an increasing trend toward digital mammography, this study still reflects the majority of practices. Second, the assessment of ease of dismissing CAD marks was made by one radiologist, who knew the marks were all false positive. However, the consistency of comparison between the two versions, several years apart, is an important result of this study.