We present a genotypic-phenotypic model for breast cancer risk prediction, which incorporates the main components of mammary estrogen metabolism, enzyme variants, and traditional risk factors related to estrogen exposure. In contrast to the relatively small number of functional studies of estrogen metabolism, multiple epidemiological studies have investigated breast cancer risk in relation to genetic variation in the critical enzymes involved in estrogen metabolism with inconsistent findings (
56,
57). Such studies were limited by their ability to consider only one or two of the enzymes in the estrogen metabolic pathway. Even those studies that examined all of the component enzymes were not able to assess underlying metabolic interactions in the pathway (
38,
58–
60). The drawback of any purely genetic assessment is the lack of consideration about functional interactions inherent in complex metabolic pathways such as the estrogen metabolism pathway. A pathway-based functional and quantitative approach is necessary to overcome the current limitation in genotype assessment (
61). Our original estrogen metabolism pathway-based model (
27) not only combined kinetic and genetic data, but also provided the opportunity to incorporate traditional risk factors tied to estrogen exposure. In attempting to answer the important question how best to incorporate these risk factors into our kinetic-genetic model, we were guided by biological reasoning, experimental data, and epidemiological findings. We chose the two principal components of the model, namely estrogen level and reaction time, to connect to the traditional risk factors ().
It is obvious that the estrogen level is increased in women receiving exogenous estrogens in form of OC or HRT. Moreover, there is general agreement that the risk associated with OC and HRT depends on the duration of exposure, being lowest in women who never used OC or HRT (
62). In the United States, the most commonly prescribed HRT is Premarin, a complex mixture of estrogens, in particular the equine estrogens equilin and equilenin, which differ structurally from E
2 and E
1 by having an unsaturated B ring. The amount of human estrogens is much lower, e.g., E
2 accounts for only 1.5 % of estrogens present in Premarin (
63). In spite of the structural difference, equine estrogens are metabolized by CYP1B1 and CYP1A1 to the catechol 4-OH-equilenin, which contains aromatic A and B rings. Like 4-OHE
2, 4-OH-equilenin is further metabolized to its quinone and cell culture experiments showed that 4-OH-equilenin via its quinone induced DNA damage in breast cancer cell lines and cellular transformation
in vitro (
64,
65). Thus, all estrogens including equine estrogens used in HRT are metabolized via the same CYP-mediated oxidative pathway to generate catechols and quinones, which, in turn, cause DNA damage. However, equine estrogens appear to be metabolized less efficiently than human estrogens, which may explain why Premarin seems to have a weaker effect on risk of breast cancer than endogenous E
2. HRT and OC were documented in both GENICA and NBC, although specific issues, such as timing of exposure (
e.g., age at first use, time since first use, time since last use) were not recorded and therefore could not be addressed in our model. In general, it was our intent in designing the model to capture each risk factor without attempting to specify every possible subgroup.
Besides input from exogenous OC and HRT, a variety of other factors influence the estrogen concentration, especially body weight and exercise. The Endogenous Hormones and Breast Cancer Collaborative Group concluded that the increase in breast cancer risk with increasing BMI among postmenopausal women was largely the result of the associated increase in estrogens (
66). Because of the importance of body weight and obesity, we included BMI as an integral component into the model utilizing data available in the GENICA and NBC groups. Exercise has also a well-known effect on estrogen concentration and breast cancer risk, especially in postmenopausal women (
67). We did not include exercise in the model because neither GENICA nor NBC had collected exercise data. If such data were available in another study population, we could readily integrate exercise as a phenotypic factor into a future model via its effect on estrogen concentration.
Family history of breast cancer is associated with 10 to 20% of breast cancer cases and within that group approximately one half (5 10% of all cases) are strongly hereditary, for example linked to germline mutations in genes such as
BRCA1 and
BRCA2 (
68). It has been recognized that
BRCA1 and
BRCA2 mutations exhibit variable penetrance, which is likely accounted for by other susceptibility genes among carriers (
69). Thus, family history results from the combined input of high- and low-penetrance genes. There were no known patients with
BRCA1 or
BRCA2 mutations in either GENICA or NBC. To reflect family history, we used a weighting factor,
MFH, to optimize separation of cases and controls.
Benign breast disease encompasses a spectrum of histological entities, usually subdivided into nonproliferative lesions, proliferative lesions without atypia, and atypical hyperplasia (
41,
70). Analysis of the original NBC demonstrated that the latter two types of lesions have clinically significant pre-malignant potential (
41). In a more recent study of 9087 women followed for a median of 15 years, the relative risk of breast cancer associated with proliferative changes without atypia was 1.88 (95% confidence interval 1.66 2.12) and increased for atypical hyperplasia to 4.24 (95% confidence interval 3.26–5.41) (
70). As expected, the inclusion of proliferative disease as a risk factor in our model improved risk prediction for the NBC and the model showed a progressive risk increase for proliferative disease without atypia and atypical hyperplasia compared to the absence of proliferative lesions.
should be compared to . The odds ratio curves from the NBC in were derived using the weights for BMI and FH derived from the GENICA data set. There are nine parameters in the GENICA model that are being fit to the data and there are over 200 premenopausal cases and controls and over 700 postmenopausal cases and controls. This gives us over 20 premenopausal and over 70 postmenopausal cases and controls per parameter. Typical rules of thumb are that you should have ≥ 20 cases and controls per parameter to avoid over-fitting (
71). Hence, model over-fitting should not be serious concern, particularly for the postmenopausal women. In the postmenopausal GENICA women the breast cancer odds associated with women at the 90
th control 4-OHE
2-AUC percentile was 1.89 times that of women at the 10
th control 4-OHE
2-AUC percentile. This odds ratio was reduced to 1.81 (a 4% reduction) in postmenopausal NBC women using the model that excluded proliferative disease. Hence, the test set analysis of the NBC women provides considerable validation of the GENICA model for postmenopausal women. Adding a history of proliferative disease to the 4-OHE
2-AUC model () changes the range of 4-OHE
2-AUC values and increases the level of statistical significance but does not greatly affect the odds ratios associated with equivalent percentile values. For example, adding a proliferative disease history increases the 90
th vs. 10
th 4-OHE
2-AUC percentile odds ratio from 1.81 to 1.83. In marked contrast to , , shows no evidence of elevated breast cancer risk associated with the SNPs in our genotypic-phenotypic model. It is thus plausible that the variation in breast cancer risk shown in these figures is due to variation in the patient’s 4-OHE
2-AUC rather than to variation in the individual SNPs that are used in this model.
Several models are currently available to predict the risk of breast cancer, of which the Claus and Gail models are used most often (
22,
26). The Claus model, which is based on assumptions of the prevalence of high-penetrance genes for susceptibility to breast cancer, is only applicable for women with a family history of breast cancer (
23). The Gail model incorporates primary and secondary family history as well as the age at menarche, the age at first live birth, the number of breast biopsies, the presence of atypical hyperplasia in these biopsies, and race (
21,
24). Both of these models were developed on the basis of data from much larger study populations than the two study groups available to us. The advantage of our genotypic-phenotypic model is the underlying biologic reasoning inherent in a pathway-based model and the integration of endogenous and exogenous risk factors.
In a recent study, Wacholder et al. (
72) reported that the Gail model achieves an area under the receiver operating curve (ROC) of 0.534. The addition of seven SNPs associated with breast cancer increased the ROC to 0.586. We used the NBC, which includes information on most of the risk categories of the Gail model, i.e., patient age, age at menarche and first birth, number of biopsies, presence of atypical hyperplasia in these biopsies, and family history (
21,
24) for a direct comparison of our new model with the Gail model. The area under the ROC curve associated with our 4-OHE
2-AUC model that includes proliferative disease was 0.588 (95% CI 0.56 – 0.62). This was slightly greater than, but not significantly different from that associated with the Gail model 0.558 (95% CI 0.53 – 0.59). Hence, while these models can identify women at increased breast cancer risk, none of them are particularly effective at predicting who will develop breast cancer.
A shortcoming of our current model is the omission of functional SNPs outside the coding region and the inclusion of only three genes, albeit of primary importance for mammary estrogen metabolism. Another important gene, CYP19A1, encodes aromatase, the main enzyme producing E
2 and E
1 from androgen precursors. Haplotype-tagging SNPs and common haplotypes spanning the coding and proximal 5′ region of CYP19A1 were shown to be significantly associated with a 10 to 20% increase in endogenous estrogen levels in postmenopausal women (
73). The future addition of CYP19A1 in form of haplotype-tagging SNPs would extend the range of our model by including information about the input E
2 concentration to be converted by CYP1A1, CYP1B1, and COMT to carcinogenic metabolites. Among the phase II conjugating enzymes, COMT is the sole methylating enzyme while there are potentially three glutathione-conjugating enzymes, GSTA1, GSTM1, and GSTP1. COMT catalyzes the methylation of catechol estrogens to methoxy estrogens, which lowers the catechol estrogens available for conversion to estrogen quinones. In turn, the estrogen quinones undergo conjugation with glutathione (GSH) via the catalytic action of GSTs. The formation of GSH-estrogen conjugates would reduce the level of estrogen quinones and thereby lower the potential for DNA damage. Based on protein levels, GSTP1 is the most important member of the GST family expressed in breast tissue (
74). The two other GST isoforms, GSTM1 and GSTA1, are expressed at lower levels. In fact, about 50% of Caucasian women possess the GSTM1 null genotype and therefore completely lack GSTM1 expression in all tissues including breast (
75). Based on these considerations, we cloned wild-type GSTP1 cDNA and prepared the purified, recombinant enzyme to assess its role in the estrogen metabolic pathway. We showed that GSTP1 converted the estrogen quinones into estrogen-GSH conjugates (
31). Several non-synonymous GSTP1 polymorphisms have been described with altered catalytic activity towards polycyclic aromatic hydrocarbon carcinogens (
76,
77). With regard to estrogen substrates, it is presently unknown whether the variants differ from wild-type GSTP1 in their ability to convert carcinogenic estrogen quinones to nontoxic estrogen-GSH conjugates. In future experiments, we could determine the kinetic rate constants for the variant GSTP1 isoforms and utilize them to account for genetic differences between women in the production of these non-carcinogenic estrogen metabolites.
Another limitation of our model is the lack of actual estrogen metabolite measurements. However, it would be difficult if not impractical to obtain a sufficient number of samples to truly reflect a woman’s lifetime endogenous and exogenous estrogen exposure. Thus, we derived the overall exposure by taking into account her total years of ovulation as a function of current age, age at menarche, age at menopause, numbers of full-term pregnancies, and the use of OC and HRT. Our estimates could be improved by taking into account genetic information related to the CYP19A1 gene, which encodes aromatase as sole enzyme producing the parent estrogens E
2 and E
1. As mentioned above, certain CYP19A1 haplotypes were shown to be associated with increased endogenous estrogen levels in postmenopausal women (
73).
In a discussion of mathematical modeling, A.M. Turing wrote: “This model will be a simplification and an idealization, and consequently a falsification. It is to be hoped that features retained for discussion are those of greatest importance in the present state of knowledge. ” (
78) The genotypic-phenotypic approach to modeling reflects this paradigm. The model contains different facets that can be manipulated to strengthen its predictive powers. Furthermore, its flexibility allows one to change the metabolism pathway and/or the phenotypic parameters. For example, incorporation of additional enzymes (e.g., CYP19A1, GSTP1) and their variants into the pathway is easily accomplished by adding suitable differential equations with appropriate kinetic constants to the set of differential equations of the metabolism pathway. Similarly, if another phenotypic parameter became available (e.g., alcohol consumption with categorical data), it could readily be incorporated into the model. Regular alcohol consumption has been linked to an increase in breast cancer risk. A meta-analysis of 98 studies showed an excess risk of 22% for drinkers versus nondrinkers with a dose-response relationship among women who drink moderate to high levels of alcohol (
79,
80). Thus, the relationship between alcohol and breast cancer appears to be causal but the mechanism for this association is not well understood. One potential mechanism is the influence of alcohol intake on estrogen metabolism. Animal experiments have shown that ethanol consumption increases hepatic aromatase activity, which, in turn, could increase the conversion of androgens to estrogens (
81). Indeed, several studies observed a positive correlation between alcohol intake in women and both blood and urinary estrogen concentrations but other studies found no correlation or even an inverse association (
82,
83). However, postmenopausal women receiving HRT experienced a significant and sustained increase in circulating estrogen following ingestion of alcohol (
84). Women drinking ≥20 g/day who used HRT had an increased risk of breast cancer (RR 2.24; 95% CI 1.59 – 3.14) compared to nondrinkers who never used HRT (
85). In light of the latter association, the model could be refined by inclusion of alcohol consumption in the subgroup of women who received HRT. Other, seemingly unrelated, factors can also be tied into the AUC model. For example, since the genotypic-phenotypic model is based on the formation of DNA adducts in the estrogen metabolism pathway, a dynamic system (a submodel) for the enzymatic repair of these adducts can be integrated into the model (
86). This would permit one to investigate women who have the genetic machinery that produces high 4-OHE
2-AUC values and their accompanying risk, but have effective DNA repair machinery, thus mitigating the breast cancer risk. This flexibility allows us, as stated above by Turing, to experiment with the model by adding and/or removing components to enhance its ability to predict breast cancer risk.
In summary, the current study presents a model for the prediction of breast cancer risk that incorporates the mammary estrogen metabolism pathway, genetic enzyme variants, and traditional risk factors related to estrogen exposure. The model was applied to two separate case-control studies and has the potential to give a personalized risk estimate to allow more targeted screening and possibly earlier diagnosis and treatment of the disease.