|Home | About | Journals | Submit | Contact Us | Français|
The Cancer Genetic Markers of Susceptibility (CGEMS) initiative has conducted a three-stage genome-wide association study (GWAS) of breast cancer in 9,770 cases and 10,799 controls. In Stage 1, we genotyped 528,173 single nucleotide polymorphisms (SNPs) in 1,145 cases of invasive breast cancer among postmenopausal white women, and 1,142 controls; in Stage 2, 24,909 SNPs with low p values observed in Stage 1 were analyzed in 4,547 cases and 4,434 controls. In Stage 3 we investigated 21 loci in 4,078 cases and 5,223 controls with low p values from Stage 1 and 2 combined. Two novel loci achieved genome-wide significance. A pericentromeric SNP on chromosome 1p11.2, rs11249433, (p=6.74 × 10-10 adjusted genotype test with 2 degrees of freedom) resides in a large block of linkage disequilibrium neighboring NOTCH2 and FCGR1B and is predominantly associated with estrogen receptor-positive breast cancer. A second SNP, rs999737 on chromosome 14q24.1 (p=1.74 × 10−7), localizes to RAD51L1, a gene in the homologous recombination DNA repair pathway, a prior candidate pathway for breast cancer susceptibility. We confirmed previously reported markers on chromosome 2q35, 5q11.2, 5p12, 8q24, 10q26, and 16q12.1. Our results underscore the importance of large-scale replication in the identification of low penetrance breast cancer alleles.
Epidemiologic investigation of breast cancer has identified a number of environmental and lifestyle risk factors (e.g., age at menarche and menopause, parity, age at first birth, body mass index and exogenous hormone use)1. Breast cancer is nearly twice as frequent in first degree-relatives of women with the disease than in relatives of women without such a history, suggesting an important contribution of inherited susceptibility. Established causal variants from before the GWAS era account for only a small fraction of sporadic breast cancers. Established associations include high penetrance germline mutations segregating in high-risk pedigrees, most notably in BRCA1 and BRCA22,3; a handful of rare susceptibility variants with lower penetrance identified in DNA repair and apoptosis genes4–8; only one locus with a minor allele frequency larger than 5% (CASP8) was found using the candidate gene approach in association studies9.
Genome-wide association studies have identified multiple new common genetic variants influencing breast cancer risk. Easton et al. analyzed genotypes from 390 cases enriched for a strong family history of breast cancer and 364 controls with 227,876 SNPs and followed the top 10,405 SNPs in a two-stage replication study (primarily conducted in population-based studies of unrelated subjects), resulting in the identification of 5 loci (10q26 (FGFR2), 16q12.1 (TNRC9), 5q11.2 (MAP3K1), 8q24 and 11p15.5 (LSP1)) based on large-scale follow-up studies10. In the initial report from the NCI Cancer Genetic Markers of Susceptibility (CGEMS) initiative, based on a follow-up of the top ten SNPs from the Stage 1 GWAS, we independently identified SNPs in intron 2 of FGFR2 as associated with breast cancer at genome-wide significant levels11. Subsequently, the FGFR2 locus was also identified in an Icelandic population12 and a locus at 2q35 was also reported to confer susceptibility to estrogen receptor [ER] positive breast cancer12. Finally, combined analysis of a promising signal using the three published GWAS led to the identification of an additional locus on 5p1213. Power calculations based on the available sample sizes (390–1,791 cases) in the three GWAS efforts, suggest each has limited power to detect the low observed relative risks (RRs of 1.1–1.3 per allele) at conventional levels of genome-wide significance (p < 5 × 10−7)14. Thus, it is likely that a high proportion of the susceptibility loci have not yet been detected.
In Stage 1 of CGEMS, we genotyped 1,145 cases post-menopausal women of European ancestry with invasive breast cancer and 1,142 matched controls nested within the prospective Nurses’ Health Study cohort11. This stage used 528,173 SNPs that were estimated to be correlated with an r2>0.8 to approximately 90% of the common HapMap Phase II SNPs. We report here a follow-up of this first stage. In Stage 2, we attempted to genotype 30,448 SNPs in 4,547 cases and 4,434 controls from four different studies (Table 1). These SNPs were selected using a stepwise procedure (Supplementary Methods); the majority were chosen by an hypothesis-free (agnostic) strategy while approximately one fifth of the SNPs were selected by alternative approaches fully reported in the supplementary methods and described below.
Briefly, for Stage 2, 22,136 SNPs were first selected based on a p-value less than 0.05 in a logistic regression model using a two-degree of freedom (df) score test with indicator variables for heterozygous and homozygous carriers and four continuous variables representing principal components of population stratification. The 2-df score test was chosen because it makes minimal assumptions for the underlying genetic model. This set of SNPs was complemented with 2,773 SNPs with a p-value less than 0.06 in tests of dominant, recessive or multiplicative models that were not already included by virtue of their p-value in the score test (each test has 1 df - see Supplementary Methods). In the ‘agnostic’ category, SNPs with low p-values in strong linkage disequilibrium (r2≥0.8) were removed. We selected an additional 1,436 ‘agnostic’ SNPs not included in the two previous criteria based on a 2-SNP test that conditioned each SNP on a neighboring SNP, if this improved the p-value relative to the single SNP-statistics by an order of magnitude. Loci marked by SNPs previously established by GWAS were further explored with a dense set of 1,711 SNPs. Also included were 3,788 SNPs drawn from candidate genes in previously proposed pathways or identified in an analysis of suggested interaction with variants in intron 2 of the FGFR2 gene. Finally, to monitor population stratification, 1,508 SNPs with low pair-wise linkage disequilibrium were included15.
A total of 30,278 SNPs (92.1%) provided reliable genotypes according to our quality control metrics (see Supplemental Methods). We removed subjects with greater than 20% admixture of non-European origin based on analysis using the STRUCTURE program16. We conducted a principal component analysis (PCA) using the SNPs chosen to monitor population stratification and there was minimal evidence of population stratification observed between cases and controls; the distribution of the p-values for the association statistics with a 2 degree-of-freedom test unadjusted for population heterogeneity was close to the expected distribution under the null hypothesis17. The inflation factor, λ, 1.010 was reduced to 1.009 when the first four principal components were included as covariates in the association test. A joint analysis of the genotypes18 in the first and second stages was performed using an age, study design and population stratification-adjusted multinomial regression analysis (2 df test).
In the combined analysis of the initial scan with the second stage, we note that markers in 6 of the reported 7 loci identified in prior GWAS studies were strongly associated with breast cancer risk in post-menopausal women (Table 2). SNPs in 2q35, 5q11.2 (MAP3K1), 5p12, 8q24, 10q26 (FGFR2) and 16q12.1 (TOX3/TNRC9) provided strong signals (Table 2 and Supplemental Table 1); in some cases, an alternative SNP to the originally reported SNP provided a smaller p value (see below). The lowest p value for a marker at 11p15.5 (LSP1, rs3817198) was minimally significant (p= 3.87 × 10−2, trend test with 1 df- see Supplemental Table 1) but its allele-specific odd ratio was similar to that reported previously (heterozygote odds ratio [OR] 1.04; 95% CI 1.00 to 1.09; homozygote OR 1.09; 95% CI 1.00–1.19 in our combined three-stage analysis. For the single candidate gene variant that had previously been reported as genome-wide significant, the results for rs1045485 in CASP8 (p=5.47 × 10−2, trend test with 1 df) were also consistent with previous findings (heterozygote OR 0.96; CI 95% 0.91–1.00; homozygote OR 0.92; CI 95% 0.84–1.00). After Stage 2, no indication of association (p2df=0.50) was observed for rs2107425 in the H19 region, previously associated at lower level of significance by Easton et al. (reported ptrend=2 × 10−5)10. A GWAS in American Jewish women of Ashkenazi background had identified a locus on chromosome 6 (rs2180341) with a MAF of 0.21 and a per allele OR of 1.41 (p= 3.0 x10−8)19. In CGEMS, SNP rs9398840, which was strongly correlated with rs2180341 (r2=1.0) in the CEU HapMap population was not significantly associated (p2df=0.58) and not taken into Stage 2.
Stage 3 included a set of 24 SNPs, 21 of which were based on a preliminary combined analysis of the first two stages, in 4,078 cases and 5,223 controls drawn from five studies (Tables 1 and and2).2). Specifically, we examined 16 promising novel regions based on the lowest p values of the preliminary data build with one SNP. Two novel regions were examined with two SNPs apiece. In a region of 3p24.1, two SNPs, rs724244 and 4973768, separated by 170 kb (r2 =0.35) each had low p values. In region 1p34.2 because of difficulty in the assay design, two SNPs, separated by 40 kb and in strong LD were selected (r2= 0.88). In the region of the two SNPs in 5p12, in which rs4415084 and rs10941679 were recently reported by Stacey et. al., we advanced two more SNPs, rs7716600 and rs2067980, separated by 100 kb (r2= 0.50) (Figure 1)13 Thus, the 5p12 region was explored with four SNPs. For Stage 3, rs3817198 in LSP1 was also added to the set because of a prior publication10.
The results of Stage 3 are remarkable for only four SNPs. Two novel SNPs, rs11249433 in the pericentromeric region of chromosome 1, and rs999737 in the candidate gene, RAD51-like 1 gene (RAD51L1) on chromosome 14q24.1, reached genome-wide significance in the combined analysis of all three stages (Table 3). Two of the SNPs in 5p12, rs7716600 and rs4415084, confirmed the previously reported signals.
The results of a combined joint adjusted analysis of the initial genome-wide scan plus two stages of follow-up provide conclusive statistical significance for an association with a novel marker, rs11249433 located in the pericentromeric region of the short arm of chromosome 1 (p = 6.74 × 10−10) (Figure 1 and Table 3). Pericentromeric regions are known to be recombination-poor regions and thus it is not surprising to observe that rs11249433 maps to large block of linkage disequilibrium. The definition of the block is difficult to determine for two reasons: (1) its close proximity to the centromere and (2) presence of a SNP desert of approximately 220kb which is immediately distal to the block (Figure 2A). The block contains several pseudogenes, and a member of the highly paralogous low affinity Fc gamma receptor family, FCGR1B. Distal to the SNP desert is the promoter of NOTCH2, a gene recently shown to be associated with type 2 diabetes20. Some epidemiological studies have suggested an association between type 2 diabetes and post-menopausal breast cancer21. Further mapping and subsequent functional work is required to provide plausibility for the association signal observed with rs11249433.
The second novel marker, rs999737 is in a gene in prior candidate pathway for breast cancer susceptibility, the double-strand break repair/homologous recombination pathway, RAD51L1 (also known as RAD51B) on chromosome 14q24.1 (p = 1.74 × 10−7) (Table 3). The SNP maps to a 70Kb LD block defined by two recombination hotspots and is entirely contained within intron 12 of the gene (Figure 2B and Supplemental Figure 1). Its gene product is one of five paralogs that interact directly with that of the RAD51 gene, that catalyzes key reactions in homologous recombination22. A polymorphism in the 5’UTR of RAD51 has recently been identified as a genetic modifier of outcome in women with deleterious BRCA2 mutations23. A copy number variation on chromosome 14q24.1 that includes the RAD51L1 has been observed repeatedly in pedigrees with Li-Fraumeni syndrome, suggesting a possible contribution of this locus to the spectrum of cancers (that includes breast cancer) observed in this hereditary syndrome24. Further work is warranted to dissect the genetic signal and investigate potential functional variants.
Tumor estrogen receptor (ER) status was available for 6,386 cases25. Figure 3 shows the results of the analysis for the two novel SNPs, rs11249433 (chromosome 1) and rs999737 (chromosome 14) by estrogen receptor status. The association with rs11249433 is more apparent for ER+ compared to ER− breast cancer (Supplementary Tables 2, 3 and 4). The observed difference was significant in a case/case comparison (trend p value = 0.001), suggesting that the chromosome 1 locus could be more important in ER+ breast cancer susceptibility. Although there was also some evidence for a stronger association with ER+ disease for the chromosome 14 SNP, rs999737, it was not significant (trend p value = 0.20). An analysis stratified by age did not demonstrate any significant differences for the two SNPs, though it should be emphasized that the majority of cases are post-menopausal women.
Given the initial genome coverage of the CGEMS study using the Illumina HumanHap500 platform and the number of cases and controls investigated, it is unlikely that many more common loci with relative risks comparable to FGFR2 will be discovered for the European population. The present study has confirmed strong association signals for 6 genomic regions previously reported and identified novel associations at genome-wide significance for markers on chromosome 1p11.2 and 14q24.1. In addition, we provide supportive evidence for two loci, previously associated with genome-wide significance, namely, 2p24.1 (CASP8) and 11p15.5 (LSP1). Though the direction and magnitude of the association signal is consistent with prior reports, our study indicates that larger data sets are required to identify at genome-wide significance levels loci with smaller estimated per allele effect sizes, especially SNPs with low MAF or for which the per allele OR is estimated to be 1.1 or less. Moreover, our study suggests the value of combining scans for discovery with subsequent follow-up in large data sets, such as CGEMS and Breast Cancer Association Consortium (BCAC)9–11. The individual genotype data for the Stage 1 CGEMS GWAS in 1,145 cases and 1,142 controls, and the aggregate data for Stages 1, 2 and 3 are available to researchers registered after approval by the NCI Data Access Committee (DAC) through the CGEMS portal (http://cgems.cancer.gov).
To date, GWAS for breast cancer have been conducted largely among women of European ancestry, mainly with ER+ tumors. Well-designed scans in other populations should yield additional loci, some of which could be population-specific. Additional scans of ER−ve tumors will be needed to find loci specific to this subtype. Together these findings should accelerate the effort to dissect the genetic signals observed in multi-stage GWAS in an effort to nominate variants for further investigation of their biological basis. The evidence for two new associations presented in this study pinpoints genomic regions that could elucidate novel etiologic pathways contributing to the development of breast cancer. Carriage of the multiple loci reported so far, together with additional loci to be identified in follow-up of this and other studies, should refine estimates of the increased risk of sporadic breast cancer associated with inherited genetic loci, although the clinical utility of these estimates has yet to be determined26,27
Briefly, this study reports the follow-up genotyping of studies based on the previously reported genome-wide scan conducted in the prospective Nurses’ Health Study using the Human Hap500 Infinium Assay (Illumina) in 1,145 cases of women with post-menopausal breast cancer and 1,142 controls 11. The details are reported elsewhere11. Quality control metrics included removal of samples with call rates under 90% and SNP assays with call rates under 95%. Subjects with more than 15% admixture of non-European background were removed from the analysis.
In Stage 2, we genotyped 30,278 SNPs in four follow-up studies of women of European background with breast cancer totaling 4,547 cases and 4,434 controls drawn from the American Cancer Society Cancer Prevention Study II, the Prostate, Lung, Colon and Ovarian Screening Trial, part of the available Polish Breast Cancer Study and the observational arm of the Women’s Health Initiative. In Stage 3, we genotyped 24 SNPs in 4,078 cases of breast cancer in women of European background and 5,223 controls drawn from the CONOR Norwegian cohort, the remaining cases and controls of the Polish Breast Cancer Study, the U.S. Radiologic Technologists Study, the Nurses’ Health Study II, and the Women’s Health Study. These studies were approved by the appropriate institutional review boards.
In Stages 2 and 3, we genotyped 18,282 unique subjects (excluding validation samples and study duplicates) passing sample handling quality control metrics in the Core Genotyping Facility of the National Cancer Institute. For NHS II and WHS, the 24 SNPs of Stage 3 were genotyped at the DF/HCC Genotyping Core at the Harvard School of Public Health, Boston, MA. Stage 2 was genotyped using a custom-designed iSelect assay from Illumina with content described above; 9,804 samples were attempted (including known duplicates). Using quality control measures, samples were removed with call rates under 90% and SNPs with call rates under 95%. Fitness for Hardy-Weinberg proportion was assessed for each SNP in unique controls subjects only but was not used to exclude SNP assays (see Supplemental Methods). In Stage 3, we genotyped 9,301 unique subjects for 24 TaqMan assays (ABI) selected on the criteria described above using custom designed assays that were subsequently optimized in the SNP500Cancer initiative.
A small fraction (less than 2%) of subjects who were successfully genotyped in Stage 2 were excluded from analysis due to one of the following reasons: 1. Unanticipated interstudy or intrastudy duplicates; 2. Unanticipated non-European admixture of greater than 20% (e.g., African or East Asian; notably, in Stage 1, the threshold for non-European admixture was 15%); and/or 3. Incomplete covariate data.
In Stage 2, a total of 16,715 discordant genotypes were detected out of a possible 7,255,923 genotype comparisons (237 duplicate pairs and one triplicate) yielding a discordance rate of 0.23%. Infinium cluster plots for notable SNPs are included in Supplemental Methods.
For the 24 SNPs analyzed in Stage 3, we validated genotype calls determined by Infinium HumanHap500 and custom iSelect assay by comparing TaqMan results in the entire Polish Breast Cancer Study. 1,110 samples were genotyped with both platforms and the overall concordance rate was 99.52% (see Supplemental Materials for results).
For the follow-up replication studies, all one-SNP analyses were conducted using unconditional logistic regression, adjusted for age in ten year intervals and study. For Stages 1 and 2, four continuous covariates were included to account for population heterogeneity based on principal component analysis of genotype correlations. Separate analyses were conducted according to the individual studies, the pooled replication studies in Stage 2 and Stage 3 and for all studies combined. Genotype effects were modeled individually, and a single-SNP score test with two degrees of freedom was computed. To enable comparison with other published GWAS, a Cochran-Armitage trend test was also performed. To explore a possible difference in effect between estrogen-positive and estrogen-negative breast cancer, separate analyses were conducted for ER+ and ER− cases, using a trend test with 1 degree of freedom..
We used GLU (Genotyping Library and Utilities version 1.0), a suite of tools available as an open-source application for management, storage and analysis of GWAS data. STRUCTURE and EIGENSTRAT programs were used to assess population heterogeneity (see URLs below)
CGEMS portal: http://cgems.cancer.gov/
The Nurses’ Health Studies are supported by NIH grants CA 65725, CA87969, CA49449, CA67262, CA50385 and 5UO1CA098233. The authors thank Barbara Egan, Lori Egan, Helena Judge Ellis, Hardeep Ranu, and Pati Soule for assistance, and the participants in the Nurses’ Health Studies.
The WHI program is supported by contracts from the National Heart, Lung and Blood Institute, NIH. The authors thank the WHI investigators and staff for their dedication, and the study participants for making the program possible. A full listing of WHI investigators can be found at http://www.whi.org
The ACS study is supported by UO1 CA098710. We thank Cari Lichtman for data management and the participants on the CPS-II. The U.S. Radiologic Technologists Study (USRT) is supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, DHHS.
The PLCO study is supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics and contracts from the Division of Cancer Prevention, National Cancer Institute, NIH, DHHS. The authors thank Dr Philip Prorok, Division of Cancer Prevention, National Cancer Institute; the Screening Center investigators and staff of the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) and Mr. Tim Sheehy, and staff at SAIC-Frederick. Most importantly, we acknowledge the study participants for their contributions to making this study possible. The authors thank the radiologic technologists who participated in the study; Jerry Reid of the American Registry of Radiologic Technologists for continued support of the study; Diane Kampa and Allison Iwan of the University of Minnesota for study coordination and data collection; Dr. Bill Kopp and staff at SAIC-Frederick for biospecimen processing; and Laura Bowen of Information Management Systems for data management.