Early work involving CADx models in mammography interpretation dates back to 1993. A summary list for primary mammography CADx models is presented in .
Summary of computer-aided diagnostic models in mammography interpretation.
Early work of CADx research used artificial neural networks (ANNs) and Bayesian networks (BNs). The first CADx model was proposed by Wu et al
., who developed an ANN to classify lesions detected by radiologists as malignant or benign [18
]. They demonstrated that their simple ANN, which was built using 14 radiologist-extracted mammography features and trained on a small set of data, achieved higher area under the curve (AUC) of the receiver operating characteristic (ROC) curve than a group of attending radiologists without computer aid (0.89 vs 0.84). Baker et al
. later built more complex ANN models, where the inputs included Breast Imaging Reporting and Data System (BI-RADS) descriptors as well as variables related to the patient’s medical history [19
]. Their approach was later extended and evaluated by others [20
]. Fogel et al
. also built one of the early ANN models that prospectively examined suspicious masses as a second opinion to radiologists [24
]. Kahn et al
. developed one of the first BN models to classify mammographic lesions as benign and malignant [25
]. They used radiologist-extracted mammography features as the input to their model and demonstrated that BNs had a potential to help radiologists making diagnostic decisions.
Jiang et al
. trained an ANN to differentiate malignant and benign clustered microcalcifications [26
]. The microcalcifications were initially identified by the radiologists and eight features of these microcalcifications were automatically extracted by an image-processing algorithm. The training and testing data included 107 cases (40 malignant) from 53 patients. This retrospective study only included microcalcifications that underwent biopsy. Five radiologists participated in the observer study. ROC analysis was used to assess performance. The average cumulative AUC values for the ANN and the radiologists were 0.92 and 0.89, respectively. While the cumulative AUCs did not have a significant difference (p = 0.22), the comparison of AUCs over the 0.90 sensitivity threshold yielded statistically significant differences (p < 0.05). Jiang et al
. later extended this model to classify lesions as malignant or benign for multiple-view mammograms [27
]. They found that the use of a CADx model decreased the number of biopsied benign lesions while increasing the biopsy recommendations for malignant clusters. In a follow-up study, Jiang et al
. demonstrated that, in addition to its diagnostic power, their ANN model had the potential to reduce the variability among radiologists in the interpretation of mammograms [28
]. In another study, they compared their CADx model with independent double readings on 104 mammograms (46 malignant) containing clustered microcalcifications and reported more significant improvements in the ROC performance when the CADx model was used as compared with the independent double readings [29
]. More recently, Rana et al
. applied the CADx model developed by Jiang et al
. on screen-film mammograms [26
] to full-field digital mammograms [30
]. They concluded that their CADx model maintained consistently high performance in classifying calcifications in full-field digital mammograms without requiring substantial modifications from its initial development on screen-film mammograms.
Markopoulos et al
. compared three radiologists’ diagnostic accuracies with or without computer aid [31
]. The computer analysis utilized an ANN in diagnosis of clustered microcalcifications on mammograms. This retrospective study included 240 suspicious microcalcifications (108 malignant), which were identified by radiologists and extracted by an image-processing algorithm. The inputs to the ANN included eight features of the calcifications. Biopsy was the reference standard. The AUC of the CADx was 0.937, which was significantly higher than that of the physician with the highest performance (AUC = 0.835, p = 0.012). The authors concluded that CADx models also have the potential to help improve the diagnostic accuracy of radiologists.
Huo et al
. also used ANNs to classify mass lesions detected on screen-film mammograms [32
]. They automated the feature extraction process to reduce the intra-observer variability [28
]. In a follow-up study, Huo et al
. used different sets of data for training and testing instead of a single database [35
]. Their database included 50 biopsy-proven malignant masses, 50 biopsy-proven benign masses and ten cysts proved by fine needle aspiration. The inputs to the ANN included four characteristics of masses (margin, sharpness, density and texture) that were automatically extracted by an image processing algorithm. When the CADx model was used, the average AUC of the radiologists increased from 0.93 to 0.96 (p < 0.001), demonstrating the generalizability of CADx models to distinct datasets. More recently, Li et al
. converted the CADx model developed by Huo et al
. on screen-film mammograms to apply to full-field digital mammograms [36
]. They evaluated the performance of this CADx model using the AUC at various stages of the conversion process and concluded that CADx models had a potential to aid physicians in the clinical interpretation of full-field digital mammograms.
Floyd et al
. proposed a case-based reasoning (CBR) approach, in which the classification is based on the ratio of the matched malignant cases to total matches in the database [37
]. The primary advantage of the CBR method over an ANN is the transparent reasoning process that leads to the system’s diagnosis. However, a key limitation of CBR is that a new case might not have any match in the database. This CBR analysis included 500 (174 malignant) cases. Of these 500 cases, 232 were masses alone, 192 were microcalcifications alone and 29 were combinations of masses and associated microcalcifications. The inputs to the CBR included ten features from the BI-RADS lexicon (five mass descriptors and five calcification descriptors) and a descriptor from clinical data. Biopsy was the reference standard. Two radiologists were asked to describe each lesion using the BI-RADS lexicon. The input dataset contained both retrospective (206 cases) and prospective (194 cases) data. The performance of the CBR model was compared with that of an ANN. While the ANN slightly outperformed the CBR (AUC = 0.86 vs 0.83, respectively), the study did not report statistical significance of this difference.
Elter et al
. evaluated two novel CADx approaches that predicted breast biopsy outcomes [38
]. The study retrospectively analyzed cases that contained masses or calcifications but not both. The dataset included 2100 masses (1045 malignant) and 1359 calcifications (610 malignant) that were extracted from mammograms in a public database and double reviewed by radiologists. The positive cases included histologically proven cancers, while negative cases were followed up for a 2-year period. The inputs to the CADx model included patient age and five features from the BI-RADS lexicon (two mass descriptors and three calcification descriptors). Elter et al
. used two types of CADx systems: a decision tree and a CBR. An ANN was also implemented to compare its performance to that of the two proposed models. The models were evaluated based on ROC analysis. Contrary to the findings by Floyd et al
], they found that the CBR out-performed the ANN (AUC = 0.89 vs 88, respectively, p < 0.001), while the ANN performed better than the decision tree (AUC = 0.88 vs 0.87, respectively, p < 0.001). The authors concluded that both systems could potentially reduce the number of unnecessary biopsies with more accurate prediction of breast biopsy outcomes. However, the differences in AUC performances were small, raising the possibility that they may not be clinically significant.
Chan et al
. retrospectively evaluated the effects of a linear discriminant classifier on radiologists’ characterization of masses [34
]. The dataset included 253 mammograms (127 malignant). Biopsy was the reference standard. The findings were initially identified by a radiologist and 41 features of these findings (texture and morphologic features) extracted by an image-processing algorithm were used as inputs to the linear discriminant classifier. Six reading radiologists evaluated the mammograms with and without CADx. The classification performance was evaluated by ROC analysis. The average AUC of the reading radiologists without CADx was 0.87 and improved to 0.91 with CADx (p < 0.05). Hadjiiski et al
. performed similar studies to evaluate a CADx model and particularly investigated the extent of increase in diagnostic accuracy when more mammographic information was available [39
]. Specifically, they evaluated two scenarios: the increase in the performance of CADx when trained on serial mammograms [39
] and the increase in the performance of CADx when trained with interval change analysis, which used interval change information extracted from prior and current mammograms [40
]. For both scenarios, they reported superior AUCs for the radiologists with CADx when compared with the radiologists without CADx (for the first scenario AUC = 0.85 vs 0.79, respectively, p = 0.005; and for the second scenario AUC = 0.87 vs 0.83, respectively, p < 0.05) and, thus, a significant improvement of the radiologists’ diagnostic accuracy.
Gupta et al
. retrospectively studied 115 biopsy-proven masses or calcification lesions (51 malignant) using a linear discriminant analysis (LDA)-based CADx model [41
]. The images and case records were obtained from a public database. This study compared the performance of the LDA while using different descriptors for one mammographic view and two mammographic views. The attending radiologists described each abnormality using BI-RADS descriptors and categories. The inputs to the CADx model included patient age and two features from the BI-RADS lexicon (mass shape and mass margin). While the CADx with two mammographic views outperformed that with one mammographic view (AUC = 0.920 vs 0.881, respectively), the difference was not statistically significant (p = 0.056).
Wang et al
. built and evaluated three BNs [42
]. One of the BNs was constructed based on a total of 13 mammographic features and patients’ characteristics. The other two BNs were hybrid classifiers, one of which was constructed by averaging the outputs from two subnetworks of mammographic-only or non-mammographic features. The third classifier used logistic regression (LR) to compute the outputs from the same subnetworks. This retrospective study included 419 cases (92 malignant). The verification of positive cases included biopsy and/or surgical reports, while negative cases were followed up for at least a 2-year period. The input features included four mammographic findings and nine descriptors from clinical data. The features were manually extracted by radiologists. The AUC for the BN that incorporated all 13 features was 0.886 and the AUCs for the BNs that included only mammographic features and patient characteristics were 0.813 and 0.713, respectively. The BN that included the full feature set was significantly better than both of the hybrid BNs (p < 0.05).
Recently, Chhatwal et al
] and Burnside et al
] developed a LR and BN, respectively, based on a consecutive dataset from a breast imaging practice consisting of 62,219 mammography records (510 malignant). The input features included 36 variables based on BI-RADS descriptors for masses, calcifications, breast density, associated findings and patients’ clinical descriptors. The input dataset was recorded in the national mammography database format, which allowed the use of these models in other healthcare institutions. Contrary to most studies in the literature, they included the nonbiopsied mammograms in their training dataset and used cancer registries as the reference standard instead of the biopsy results. They analyzed the performance of the CADx models using ROC analysis and concluded that their CADx models performed better than that of the radiologists in aggregate (AUCs = 0.963 and 0.960 for LR and BN, respectively, vs 0.939 for the radio logist; p < 0.05). More recently, Ayer et al
. developed an ANN model using the same dataset and demonstrated that the ANN model achieved slightly a higher AUC (0.965) than that of the LR and BN models as well as the radiologists [45
]. Additionally, Ayer et al
. extended the performance analysis of the CADx models from discrimination (classification) to calibration metrics, which assessed the ability of this ANN model to accurately predict the cancer risk for individual patients.
Bilska-Wolak et al
. conducted a preclinical evaluation of a previously developed CADx model, a likelihood ratio-based classifier, on a new set of data [46
]. The model retrospectively evaluated 151 new and independent cases (42 malignant). Biopsy was the reference standard. Suspicious masses were detected and described by an attending radiologist using 16 different features from the BI-RADS lexicon and patient history. The authors evaluated the CADx model based on ROC analysis and sensitivity statistics. The average AUC was 0.88. The model achieved 100% sensitivity at 26% specificity. The results were compared with an ANN model created using the same datasets. The AUC of the ANN was lower than that of the likelihood ratio-based classifier. Bilska-Wolak et al
. concluded that their CADx model showed promising results that could reduce the number of false-positive mammograms.