In this study, we confirm the conclusions of previous research by using ILP and calculated conditional probabilities on a dataset of structured mammographic findings. Both the ILP method and conditional probabilities showed that high mass density is a potentially important predictor of breast malignancy. Most importantly, we validated these conclusions in an independent dataset of mammographic findings collected prospectively during routine clinical practice.
One of the final steps in the knowledge discovery process is the evaluation of hypotheses generated using data mining techniques, which serves several important roles, including determining if the hypothesis is novel, medically important, and generalizable outside of the mined dataset. Although ILP has been used in the past1
to show that high mass density may be an important predictor of malignancy, that research only raised the possibility of the association but did not quantify results or test the conclusion in an independent dataset as we have done here. This step is critical for both verifying that the association is non-random or dataset-specific and proving its generalizability. In order to use the results from data mining techniques, validation analysis must be conducted to provide convincing evidence of the utility of the discovered knowledge.
The association between high mass density and malignancy is interesting because, although experts have asserted the association in the past,19
it has not been proven in the literature. The only study to evaluate this association concluded that the contribution of mass density to predicting malignancy was less than previously thought. Our study, which is larger and performed in a sample of consecutive biopsies collected prospectively, showed that mass density is an important adjunct descriptor that can be used to help stratify the risk of malignancy.
The conclusions of this study are clinically important for several reasons. First, in the setting of mammography, identifying novel predictors of breast cancer is important as the positive predictive value (PPV) of biopsy remains quite low (10–35%). By identifying additional features that aid in the diagnosis of cancer, we may be able to improve the PPV of biopsy. Second, our research shows the importance of clinical structured reporting and the collection of structured data. These data can be mined for the purpose of knowledge discovery and used for validation of discovered hypotheses. Finally, our study shows that we can take advantage of the computer's ability to survey large amounts of data to generate hypotheses, as well as the human's expertise in evaluating those hypotheses.
There were limitations to our study. First, although the datasets used in our study were from different institutions, they were collected in the same geographic region, thus making these datasets possibly more homogeneous than datasets from distinct geographic regions. Second, our method relies on the input of a radiologist to determine the importance of generated rules. In the future, we may be able to further automate the rule selection process using metrics, which would promote efficient rule selection. In this way, techniques such as ILP may help identify future research areas. For example, mammography data that include biopsy results could help determine the pathophysiologic factors that link high mass density with malignancy, which may lead to improvements in diagnosis and treatment. Still, we believe the results of this study are valid, and important conclusions can be drawn from this line of research.