The aromatase enzyme, coded by the CYP19 gene, is vitally important for normal hormone function. Previous reports of the relationship of CYP19 variants and risk of breast cancer have not been comprehensive and have not yielded a clear picture of its potential role in development of the disease. We have taken a haplotype-tagging approach to variant selection, such that all common variation within the gene has been studied with relationship to risk of breast cancer. We did this in a clinic-based case-control study of breast cancer in 756 cases and 726 matched controls.
Tagging methods are an efficient way to select and genotype a maximally informative set of common variants within a candidate gene [1
]. Complete assay of all the variation within a gene is not usually feasible, nor is it statistically efficient, because many variants are correlated with each other. In addition, some variants may be so rare that very few subjects within a study would be expected to carry it, making statistical conclusions impossible. Two commonly used methods of selecting representative SNPs from a gene are the method of Carlson [1
] and that of Stram [2
]. One of the major differences between the two methods is that the Carlson method does not require variants within a haplotype bin to lie contiguously on the strand of DNA. Rather, location is ignored, and the correlation between pairs of SNPs takes precedence. Both methods disregard the more rare variants. Our analyses selected two sets of variants for study: one set of twelve selected using the haplotype-tag method of Stram and another set of twelve selected using the LD-tag method of Carlson and colleagues. Six variants were in common across the two methods. We used MAF cutoffs of 0.05 for the Carlson method and 0.02 for the Stram method. Variants less common than this were not examined in our analysis. No association was detected for any variants, selected by either method, with regard to risk of breast cancer in our population. In addition, no association with risk of breast cancer was detected for any haplotype in either method.
The first manuscripts published on CYP19 variants and risk of breast cancer examined the variable number of tandem repeats (VNTR) located in intron 4. Several reports [12
] indicated a possible association with risk of breast cancer; however, because of the intronic location, it was considered likely to be due to linkage disequilibrium with a functional site within the gene or gene region. Within our resequencing data, this variation was in high LD with nine other variants, none of which are responsible for a codon change. Five of the nine variants were also selected by the HT-Tag method, were genotyped and were examined directly for association with risk of breast cancer. The other three were located in Exon 3 (A240G), IVS5-16T>G, and IVS7-79A>G. Therefore, if the previously published results were truly due to linkage disequilibrium with a functional site located elsewhere, there is no evidence that the site was in CYP19. This VNTR was not associated with increased risk of breast cancer in our population. No association was detected when cases were limited to only those whose tumors were estrogen receptor positive, when stratified on menopausal status, or when limited to only invasive cases of cancer. Similarly, we did not detect any association with risk of breast cancer for non-synonymous coding SNPs, or for variants in the untranslated region (UTR) of exon 10. We examined both the C to T substitution (rs10046) previously reported to be linked to breast cancer by Kristensen [19
] as well as a G to T substitution located 142 basepairs away (rs4646).
Another of our goals within this study was to evaluate whether the method of SNP tag selection influenced the results. Therefore, we systematically selected representative SNPs from the gene using two methods [1
] to determine whether a specific method provided greater insight into the gene and disease association. There were no differences in the scientific conclusions reached within our study. All of the variants selected by either method pointed to a lack of association between these variants and risk of breast cancer. We did, however, notice that the method of Carlson [1
] was more variable in the number of SNPs as we modified the SNP selection parameters. For example, when we set the parameters to a minimum MAF of 0.02 (previously 0.05), and 90% (previously 80%) correlation within bins, the number of variants required to represent the majority of variation changed from 12 to 21.
One of the lessons from our analysis was that the residual correlation between haplotype tagging variants justifies the analysis of risk for disease by common haplotypes. We had earlier wondered whether analysis of risk by haplotype would be necessary. However, each tag selection method selected a group of tag SNPs that were strongly correlated, making haplotype analysis relevant. Two of our variants selected by the LD-tag method ((−628) and (596)) were 84% correlated. This is higher than expected as it exceeds the 80% correlation parameter used in the selection process. One possible reason for this is that the population on which our tag-selection process was conducted was not a subset of our own population. Rather, we used 60 Caucasian samples from the Coriell Cell Repository (Camden, NJ). This emphasizes the importance of checking residual correlations between SNPs within studies of this type. Study designs that use data from small groups of subjects available from public databases to select tag SNPs are at even greater risk of selecting tag SNPs that do not perfectly represent the variation within their own study population.
A major strength of this study is the complete gene resequencing data from which we selected our haplotype tagging SNPs. This allowed us to investigate all of the common variation within this gene. Previous studies interrogated only a portion of the genetic variation within this gene. As mentioned earlier rare variants were not examined. Another strength of this study is the reasonably large population of breast cancer cases and controls within which we examined our hypotheses. With the sample size available, we had power to detect significant associations of 1.5 or greater, especially for allele frequencies above 10%. For the alleles on the lower bound of frequency (0.05), only odds ratios of 1.69 or larger would have been detected.
A limitation of this study is the clinic-based nature of the cases and controls. Compared to the Iowa SEER data, our cases are somewhat younger than all breast cancer cases in the Iowa Registry. This is somewhat expected, as women who are very old are unlikely to travel to a tertiary center for care of breast cancer. Ductal carcinomal in situ is slightly more common in Mayo cases than in Iowa SEER, but there were no strong differences between the Mayo cases and the Iowa SEER cases for tumor stage and ER status. From this comparison, we believe that the breast cancer cases in our sample are reasonably representative of all the breast cancer cases in the general population of this area. Women included in this study were over 90% Caucasian, therefore these data are not generalizable to other population groups.
Another limitation of this study was the limited power to examine gene-environment interactions. Although we have thoroughly tested for evidence of the main effects of CYP19 genetic polymorphisms with regard to risk of breast cancer in this population, there may still exist interactions with environmental factors that have not been detected. Therefore we encourage others with larger sample sizes to examine the joint effects of this gene and environmental factors.
In summary, we conducted a case-control analysis of breast cancer using a comprehensive selection of tagging variants to represent the majority of the common variants in the entire CYP19 gene. We also examined risk among cases stratified by both menopausal status and limited to only those cases that were estrogen receptor positive. We found no evidence that variation in CYP19 is associated with risk of breast cancer.