|Home | About | Journals | Submit | Contact Us | Français|
Several recent genome-wide association studies have identified genetic variants associated with breast cancer. However, how much these genetic variants may help advance breast cancer risk prediction based on other clinical features, like mammographic findings, is unknown. We conducted a retrospective case-control study, collecting mammographic findings and high-frequency/low-penetrance genetic variants from an existing personalized medicine data repository. A Bayesian network was developed using Tree Augmented Naive Bayes (TAN) by training on the mammographic findings, with and without the 22 genetic variants collected. We analyzed the predictive performance using the area under the ROC curve, and found that the genetic variants significantly improved breast cancer risk prediction on mammograms. We also identified the interaction effect between the genetic variants and collected mammographic findings in an attempt to link genotype to mammographic phenotype to better understand disease patterns, mechanisms, and/or natural history.
Large multi-relational databases containing variables that confer disease risk are increasingly available, providing the opportunity for informatics tools to better stratify individuals for appropriate healthcare decisions and explore disease mechanism and behavior. Coincident to this, policy-makers have recommended that interventions, like breast cancer screening with mammography, be increasingly based on individualized risk and shared decision-making1,2. Targeting at risk individuals for intervention after mammographic screening has the potential to decrease recommendations for breast biopsy in women most likely to have an unnecessary procedure for benign findings. Recent large-scale genome-wide association studies have identified 22 susceptibility loci associated with breast cancer (Table 1). In addition, there is a long history of development and codification of features observed by radiologists on mammography that also predict a woman’s risk of breast cancer. However, genetics and mammography abnormality findings have not yet been used together to predict risk. Furthermore, the opportunity to use this data to interpret genotype/phenotype association, explain family aggregation of breast cancer, and shed light on disease mechanism or natural history is just becoming possible.
There have been several attempts to incorporate these genetic variants into the Gail model3 which is a standard clinical breast cancer risk model including the number of first-degree relatives with a diagnosis of breast cancer, age at menarche, age at first live birth and the number of previous breast biopsies. Seven associated SNPs, when added to the Gail model, increase the area under the receiver operating characteristic (ROC) curve from 0.607 to 0.6324,5. When ten associated SNPs are added to the Gail model, the area under the ROC curve of the risk model increases from 0.580 to 0.618 on another dataset6. However, the Gail model does not include any mammography features which are clinically used by radiologists. Therefore, it is still unknown how much these genetic variants improve breast cancer diagnosis and clinical decision-making after an abnormal mammogram.
The first purpose of this study is to examine the impact of genetic information on improving breast cancer risk prediction on mammograms. We incorporate genetic polymorphisms with the descriptors that radiologists observe on mammograms while making medical decisions, the American College of Radiology Breast Imaging Reporting and Data System (BI-RADS)7, version 4, including the shape and the margin of masses, the shape and the distribution of microcalcifications, background breast density and other associated findings as defined by this standard lexicon in breast imaging. We also include a small number of predictive variables not included in BI-RADS currently. Specifically, we employ these mammographic findings (49 mammography descriptors) and the 22 genetic variants associated with breast cancer in 404 case subjects and 399 control subjects from a personalized medicine data repository at the Marshfield Clinic. We train a Bayesian network using Tree Augmented Naive Bayes (TAN)8 on the mammographic findings, with and without the 22 genetic variants.
The second purpose of this study is to identify the interaction effect between the genetic variants and the mammographic findings toward risk prediction, in order to understand the genotype/phenotype relationships that may elucidate disease patterns that may not be otherwise evident in this complex multi-relational data. The interaction between the genetic variants and the mammographic findings also sheds light on how the associated SNPs function to increase or decrease the risk of breast cancer. Specifically, we calculate the conditional mutual information between the mammography features and the genetic variants given the class variable on the entire dataset.
[Subjects] The Personalized Medicine Research Project21 at the Marshfield Clinic was used as the sampling frame to identify breast cancer cases and controls. The project was reviewed and approved by the Marshfield Clinic IRB. Subjects were selected using clinical data from Marshfield Clinic Cancer Registry and Data Warehouse. We employed a retrospective case-control design. Women with a plasma sample available, a mammogram, and a breast biopsy within 12 months after the mammogram were included in the study. Cases were defined as women having a confirmed diagnosis of breast cancer obtained from the institutional cancer registry. Controls were confirmed through the electronic medical records (and absence from the cancer registry) as never having had a breast cancer diagnosis. In our case cohort, we included both invasive breast cancer (ductal and lobular) as well as ductal carcinoma in situ. In order to construct case and control cohorts that were similar in age distribution, we employed an age matching strategy. Specifically, we selected a control whose age was within five years of the age of each case. Of note, we decided to focus on high-frequency/low-penetrance genes that affect breast cancer risk as opposed to low frequency genes with high penetrance (BRCA1 and BRCA2) or intermediate penetrance (CHEK-2). High-frequency/low-penetrance SNPs generally have frequencies for the rarest allele of > 25% as opposed to the low-frequency, high-penetrance mappings with population frequencies of < 1%. We excluded individuals who had a known high penetrance genetic mutation.
[Genetic Variants] Our study included 22 genetic variants which have been identified by recent large-scale genome-wide association studies. Table 1 summarizes detailed information about the 22 SNPs, including the IDs, the original publications associating them with breast cancer, their chromosomes, the minor alleles and the allelic odds-ratios of the SNPs in the Marshfield Clinic population. The seven SNPs used in Gail study4,5 were also included in our study. Nine of the ten SNPs used in Wacholder et al study6 were included in our study, and the remaining SNP rs7716600 from that study had a proxy rs10941679 in our study. We observed that each SNP only confers a slight increase or decrease in the risk of breast cancer, in accordance with prior literature. Among the 22 associated SNPs, 11 are associated with an increased risk of breast cancer (OR>1.0) and 11 are associated with a decreased risk of breast cancer (OR<1.0). When we built the models with the genetic variants, we coded each genetic variant as whether the subject carries the minor allele, rather than the specific genotype the subject carries.
[Mammography Features] The American College of Radiology developed the BI-RADS lexicon7 to homogenize mammographic findings and recommendations. The BI-RADS lexicon consists of a number of mammography descriptors, including the characteristics of masses and microcalcifications, background breast density and other associated findings, which can be organized in a hierarchy as shown in Figure 1. Datasets containing mammography descriptors have been used to build several successful breast cancer risk models and classifiers22,23. Mammography data was originally recorded as free text reports in the Marshfield database, and thus it was difficult to directly access the information contained therein. We used a parser to extract mammography features from the text reports; the parser has been shown to outperform manual extraction24,25. After extraction, every mammography feature takes the value “present” or “not present” except that the variable mass size is discretized into three values, “not present”, “small” and “large”, depending whether there is a reported mass size and whether any dimension of the reported mass size is larger than 30mm.
Each mammogram also has a BI-RADS category assigned by the radiologist who read the mammogram. The BI-RADS category indicates the radiologist’s opinion of the absence or presence of breast cancer. In our study, the BI-RADS assessment category can take values, with an order of increasing probability of malignancy, of 1, 2, 3, 0, 4a, 4, 4b, 4c and 5. We used the BI-RADS assessment category as the predictions from the radiologists. Our experiment only included diagnostic mammograms, and all the screening mammograms were excluded. Since most of the subjects have multiple diagnostic mammograms in the electronic medical records, we selected one mammogram for each subject as follows, to mimic the scenario of the most important doctor visit before diagnosis. For cases, we selected the mammograms within one year prior to diagnosis. For controls, we selected the mammograms within one year prior to biopsy. If there were still multiple mammograms left for each subject, we selected the mammogram with a more suspicious BI-RADS category, with subsequent tiebreakers being, in order, recency and the number of extracted mammography features.
We build breast cancer risk models using Bayesian networks, which have been used with mammography data to improve breast cancer diagnosis and clinical decision-making for physicians involved in breast cancer care26,27. Bayesian networks are directed acyclic graphs that allow efficient and effective representation of the joint probability distribution over a set of random variables. Each vertex in the graph represents a random variable, and edges represent conditional independence between the variables. In this paper, we use a special type of Bayesian network model, namely TAN8, which is an effective, provably efficient supervised learning model which captures the strongest pairwise interactions between the features in a compact way. Training a TAN model starts with learning a Naive Bayes model with the case/control output being the class variable and all the other variables being the features. Naive Bayes assumes that all features are conditionally independent of one another given the class28. Because this assumption may be too strong, the TAN learning algorithm next builds a maximum spanning tree over the feature variables with the weight between two variables being the conditional mutual information between two features conditional on the class variable. Eventually the parameters in the model, namely the conditional probability tables, are estimated from the data. In our experiments, we use the TAN implementation in WEKA29.
In total, we construct three TAN models built on different sets of features. The first model is built purely on the 49 mammography features, namely the breast imaging model. The second model is based purely on the 22 associated SNPs, namely the genetic model. The third model is built on the 49 mammography features and the 22 associated SNPs together, namely the combined model. We treat the BI-RADS category scores from the radiologists as the predictions from the radiologists, namely the baseline clinical assessment. We construct ROC curves for each model, and use the area under the curve (AUC) as a measure of performance of the models. We also provide the precision-recall (PR) curves for the models. We evaluate the models in the 10-fold cross-validation fashion. The 404 cases and 399 controls are randomly divided in 10 folds. In each round of the 10-fold cross-validation, we select one fold as the testing data and the remaining nine folds as the training data, so that each fold is used exactly once for testing.
We further evaluate the interaction between the SNPs and the mammography features toward predicting the class label (case or control). Specifically, we calculate the conditional mutual information (CMI) between the 22 SNPs and the 49 mammography features given the class label. We also calculate the 95% confidence intervals for the CMI between each SNP and each mammography feature via bootstrapping. We randomly draw samples with replacement from the 404 cases and the 399 controls, and calculate the conditional mutual information. We bootstrap for 1,000 times and calculate the corresponding 1,000 CMI values. We sort the 1,000 CMI values from the smallest to the largest, and report the 26-th smallest value and the 26-th largest value as the boundaries of the 95% confidence interval.
We succeeded in identifying 404 cases for which we could match a mammogram within a year prior to a biopsy. We then identified age-matched controls; however at the end of data collection and verification, 5 of the controls were confirmed to have breast cancer leaving us with 399 controls for which we could match a mammogram within a year prior to a biopsy. Among the 404 cases, there are 401 Caucasian cases, two Asian Hmong cases and one case whose race information is unknown. Among the 399 controls, there are 395 Caucasian controls, three Caucasian/American Indian controls, and one Caucasian/Asian Hmong control. We summarize the distribution of the ages and family breast cancer history of the cases and the controls in Marshfield population in Table 2. There are more young people (age < 50) in the case group than in the control group, and the proportion of elderly people (age ≥ 65) is roughly the same in the case group and in the control group. For the family history of breast cancer, we observe a considerable larger proportion of people with family history in the case group than in the control group, which demonstrates the family aggregation of breast cancer.
The ROC curves and the PR curves for the baseline clinical assessment, the breast imaging model, the genetic model and the combined model are provided in Figure 2, respectively. For each type of model we vertically average30 its ROC curves from the ten replications of the 10-fold cross-validation to obtain the final curve; we do likewise for the PR curves. The area under the ROC curves for the genetic model, the breast imaging model and the combined model are 0.603, 0.693 and 0.731, respectively. The ROC curve of the combined model almost completely dominates the ROC curve of the breast imaging model, which suggests that adding the 22 genetic variants can help to improve the breast cancer risk prediction based on mammographic findings. We perform a two-sided paired t-test on the area under the ten ROC curves of the breast imaging model and the area under the ten ROC curves of the combined model from the 10-fold cross-validation, and the difference between them is significant with a P-value 0.021. From the PR curves, it is also observed that the combined model dominates the breast imaging model and the baseline clinical assessment at the high recall region (> 0.8) in which we would like to operate, and therefore which we want to optimize.
For each SNP, we summarize the mammography feature with the highest conditional mutual information, the conditional mutual information value and the corresponding 95% confidence intervals in Table 3. Most of the interaction effect is moderate with small CMI values. There are four noteworthy interaction pairs between the genetic variants and the mammography features toward breast cancer risk prediction with the conditional mutual information above 0.01. The four interaction pairs are (1) SNP rs1045485 (rs17468277) and pleomorphic calcifications (CMI=0.0141), (2) SNP rs2180341 and dystrophic calcifications (CMI=0.0115), (3) SNP rs2981582 and diffuse calcifications (CMI=0.0112) and (4) SNP rs4666451 and oval masses (CMI=0.0100).
We found that adding the 22 genetic polymorphisms to the 49 radiologist-reported mammographic findings statistically significantly increased the accuracy (as measured by AUC-ROC) of our Bayesian network model, despite a small sample size. In our preliminary exploration of genotype/phenotype relationships, we identified 4 potentially noteworthy interacting pairs between the genetic variants and the mammographic findings. These observations imply that radiologists may benefit from the availability of patient genotype information when they are making their interpretations of mammogram results.
Statistical models by Gail4,5 and Wacholder et al6 added genetic risk factors to epidemiologic risk factors and found modest improvements in predictive performance. All of these studies used all or a portion of the carefully validated and widely disseminated Gail model (a logistic regression model) as the baseline model. The variables included in the most recent analysis6 were the number of first-degree relatives with a diagnosis of breast cancer, age at menarche, age at first live birth, study entry year, and the number of previous breast biopsies. These investigators added 10 common genetic variants associated with breast cancer in 5,590 case subjects and 5,998 controls. They found that the AUC of the non-genetic model was 0.580, whereas the model with genetic component added (10 SNPs) revealed an AUC of 0.618. Importantly, no model to date has included mammography features describing breast findings. It is not surprising that our discriminative abilities are superior to prior models because we are using highly predictive features from mammography including abnormality descriptors and breast density, which to date, have not been included in previous models. Therefore, we are encouraged by our promising preliminary results.
With the contingency tables in Table 4, we further explore the four interacting pairs from a clinical standpoint:
There are some unavoidable limitations in our study, due to the inherent difficulty of collecting a rich multi-relational dataset. First, the sample size is small compared with large-scale genome-wide association studies10,18,16,15,9. However, other studies do not include mammography features or abnormality data4,5,6. Second, all the mammogram reports in the original database are in free text, rather than structured reports. Although we extract the features with an accurate parser24,25, this extra step introduces noise, and in particular may miss important features for certain subjects. Third, for each SNP we pick the mammography feature with highest CMI value among the 49 mammography features. Although we evaluate the 95% confidence intervals for the CMI’s, the selected mammography features may appear promising by chance. This risk of false positive association generated by this CMI analysis is further exacerbated by small sample size. However, exploring genotype/phenotype relationships is only a secondary goal of this project and we approach this analysis with caution, realizing we need more data and refined methodologies to validate our findings and to eliminate selection bias or the multiple comparison effect.
Our study represents the first exploration of breast cancer risk prediction using genetic polymorphisms along with mammography features. The fact that genetic risk factors improve risk prediction to a statistically significant degree raises the possibility that stratification based on these risk factors may provide an opportunity to personalize care. In addition, we plan to further develop the concept of exploring genotype/phenotype relationships to shed light on disease processes that may, in the future, improve diagnosis and treatment. Though we fully realize the necessity of increasing our sample size to validate these promising preliminary results, we are cautiously optimistic of the power of multi-relational databases, like the one we have constructed, both to test risk prediction hypotheses and engage in data-mining that would not otherwise be possible.
The authors acknowledge the support of the Wisconsin Genomics Initiative, NCI grant R01CA127379-01 and its ARRA supplement 3R01CA127379-03S1, NIGMS grant R01GM097618-01, NLM grant R01LM011028-01, NIEHS grant 5R01ES017400-03, the UW Institute for Clinical and Translational Research (ICTR) and the UW Carbone Cancer Center.