|Home | About | Journals | Submit | Contact Us | Français|
Breast cancer is the most common cancer among women. Common variants at 27 loci have been identified as associated with susceptibility to breast cancer, and these account for ~9% of the familial risk of the disease. We report here a meta-analysis of 9 genome-wide association studies, including 10,052 breast cancer cases and 12,575 controls of European ancestry, from which we selected 29,807 SNPs for further genotyping. These SNPs were genotyped in 45,290 cases and 41,880 controls of European ancestry from 41 studies in the Breast Cancer Association Consortium (BCAC). The SNPs were genotyped as part of a collaborative genotyping experiment involving four consortia (Collaborative Oncological Gene-environment Study, COGS) and used a custom Illumina iSelect genotyping array, iCOGS, comprising more than 200,000 SNPs. We identified SNPs at 41 new breast cancer susceptibility loci at genome-wide significance (P < 5 × 10−8). Further analyses suggest that more than 1,000 additional loci are involved in breast cancer susceptibility.
Breast cancer is the most commonly occurring malignancy among women, with an estimated 1 million new cases and over 400,000 deaths annually worldwide1. Familial aggregation and twin studies have shown the substantial contribution of inherited susceptibility to breast cancer2,3. Many genetic loci are known to contribute to this familial risk, including genes with high-penetrance mutations (notably BRCA1 and BRCA2), moderate-risk alleles in genes such as ATM, CHEK2 and PALB2, and common lower penetrance alleles, of which 27 have been identified so far, principally through genome-wide association studies (GWAS)4–16. In total, these loci explain approximately 30% of the familial risk of breast cancer15. Global analysis of GWAS data suggests that a substantial fraction of the residual aggregation can be explained by other common variants not yet identified, but the relative contributions of common and rare variants are still uncertain.
To identify additional susceptibility loci for breast cancer, we first conducted a meta-analysis of 9 breast cancer GWAS in populations of European ancestry, including 10,052 cases and 12,575 controls (Supplementary Table 1). From this analysis, we selected 35,084 SNPs on the basis of evidence of association with breast cancer, derived from a 1-degree-of-freedom trend test, a test weighted for family history, a 2-degrees-of-freedom test and subset analyses based on cases of breast cancer diagnosed before 40 years of age and before 50 years of age (Online Methods). In particular, we were able to select all SNPs or surrogate SNPs with 1-degree-of-freedom Ptrend < 0.008. To evaluate these SNPs, we then designed a custom Illumina iSelect genotyping array (iCOGS) in collaboration with three other consortia studying, in addition to breast cancer risk, susceptibility to ovarian cancer, prostate cancer and breast and ovarian cancers in BRCA1 and BRCA2 mutation carriers (COGS)17–20. The array included, in addition to SNPs selected from GWAS, SNPs selected for fine mapping of known susceptibility loci, functional candidate SNPs and SNPs related to other traits (Online Methods and Supplementary Note). The iCOGS array comprised 211,155 SNPs. These arrays were used to genotype 114,255 DNA samples from 52 studies participating in BCAC (Supplementary Table 2). After quality control exclusions (Online Methods and Supplementary Table 3), data were obtained for 199,961 SNPs in 52,675 cases and 49,436 controls. The analyses presented here are based on data from subjects of European ancestry (45,290 cases and 41,880 controls from 41 studies) and focus on 29,807 SNPs that were selected on the basis of the GWAS analysis that were successfully genotyped and were not located in regions previously known to be associated with breast cancer.
The association between each SNP and breast cancer risk was tested using a 1-degree-of-freedom trend test adjusted for study and seven principal components (Online Methods). There was some evidence for inflation in the test statistics, detected using data from 22,897 uncorrelated SNPs on iCOGS not selected on the basis of breast cancer risk (λ = 1.20, λ1000 = 1.005; Supplementary Fig. 1a). There was, however, clear evidence of an excess of statistically significant associations among the SNPs selected from the GWAS analysis (Table 1 and Supplementary Fig. 1b). Although some excess was also observed among the SNPs not selected from the breast cancer GWAS, the excess of statistically significant associations was much more marked among the GWAS SNPs at all levels of statistical significance. In addition, of 21,128 SNPs not selected for breast cancer association that were also present in the combined GWAS data set, 10,864 (51%) had effects in the same direction in the GWAS and iCOGS data, and, for these SNPs, inflation was 1.26 (λ1000 = 1.007) compared with 1.14 (λ1000 = 1.0035) for SNPs with effects in opposite directions in the two stages. A similar direction of effect was seen for these SNPs in the combined GWAS (λ = 0.87 for SNPs with effects in the same direction versus λ = 0.79 for SNPs with effects in the opposite direction, with inflation being <1 because SNPs showing evidence of association were excluded). Taken together, these results suggest that much of the inflation in the test statistics for SNPs not selected for breast cancer association is also due to the effect of true associations. Moreover, some of the excess of statistically significant associations seen in the SNPs not selected for breast cancer association was due to SNPs close to breast cancer–associated SNPs. For example, of the 45 SNPs with significant association at P < 0.00001, 21 were within 1 Mb of 1 of the newly identified breast cancer loci identified at our set genome-wide significance threshold. Taken together, these results strongly suggest that most of the excess of significant association for the GWAS-selected SNPs reflect true associations.
Of the 27 previously established breast cancer–associated loci, all but 4 showed clear evidence of association with overall breast cancer risk in the iCOGS stage (P = 2.2 × 10−5 – P = 5.9 × 10−125; Supplementary Table 4). Three loci showed weaker evidence for association: rs1045485, encoding an Asp302His variant in CASP8, whose association was previously identified in a candidate gene study (P = 0.054 in the iCOGS stage; P = 0.0013 in combined data from the GWAS and iCOGS stages)21; rs2380205 at 10p15, identified in a GWAS but suggested to be a possible false positive association in a previous BCAC analysis22,23 (iCOGS P = 0.075; combined P = 0.0021); and rs8170 at 19p13.1, for which the association has been shown to be specific to estrogen receptor (ER)-negative breast cancer24 (P = 0.0027 in iCOGS; combined P = 0.0012). One locus, rs2284378 at 20q11, recently shown to be associated with ER-negative breast cancer, was not selected for the iCOGS array16.
When the results from the GWAS and the iCOGS array were combined, 263 SNPs in 37 new regions had associations that reached P < 5 × 10−8 (Fig. 1, Table 2 and Supplementary Figs. 2 and 3). In four regions (5q11.2, 8q21.11, 10p12.31 and 18q11.2), this set of SNPs included SNPs within 1 Mb of each other that were uncorrelated, such that a second SNP was associated with disease after adjustment for the most significantly associated SNP (Supplementary Fig. 4 and Supplementary Table 5). There was little or no evidence for heterogeneity in the per-allele odds ratios (ORs) among studies for any SNP (per-SNP I2 and P values are given in Supplementary Fig. 2 and Supplementary Table 6). Genotype-specific OR estimates were consistent with a log-additive (allele dose) model for most SNPs, with the exception of three SNPs (rs616488, rs204247 and rs720475) for which the heterozygotes had a similar OR as homozygotes for the high-risk allele and two SNPs (rs11242675 and rs6472903) that were more consistent with a recessive model (Supplementary Table 6). Consistent with the pattern seen for previously established loci, there was strong evidence for specificity of the association to tumor subtype. For 13 of the loci, the per-allele OR was higher for ER-positive disease than for ER-negative disease (case-only P < 0.05), in most instances with little or no evidence of an association with ER-negative disease (based on data from 7,465 ER-negative cases and 27,074 ER-positive cases; Supplementary Table 7a). The most notable differences were for SNP rs6828523 at 4q34.1 (ER-positive OR = 0.87 (95% confidence interval (CI) = 0.84–0.90); ER-negative OR = 1.01 (95% CI = 0.96–1.07); P for difference = 1.2 × 10−7) and for rs7072776 at 10p12.31, where the estimated effects were in opposite directions (ER-positive OR = 1.09 (95% CI = 1.06–1.12); ER-negative OR = 0.94 (95% CI = 0.90–0.98); P for difference = 3.1 × 10−10). No such difference was observed for the neighboring SNP rs11814448, which was associated with both ER-positive and ER-negative disease in the same direction. For one locus, SNP rs17817449 on chromosome 16, the association was stronger for ER-negative than for ER-positive disease (P for difference = 0.039). All SNPs showed comparable ORs for invasive and in situ disease (based on data from 2,335 ductal carcinoma in situ, DCIS, and 42,118 invasive cases), with the exceptions of rs12493607 and rs3903072, for which associations seemed to be restricted to invasive disease (Supplementary Table 7b). Two loci (rs2588809 at 14q24.1 (P = 0.001) and rs941764 at 14q32.12 (P = 0.007)) showed higher per-allele ORs for cases diagnosed at a young age (Supplementary Table 7c). Consistent with the predictions of a polygenic model of susceptibility25, for 26 of the loci, the estimated OR was higher when restricted to cases with a positive family history for disease (significant at P < 0.05 for 5 loci), whereas for only 6 loci was the OR lower when restricted to cases with a positive family history (Supplementary Table 7d).
Four of the newly associated loci (rs16857609 at 2q35, rs10759243 at 9q31, rs11199914 at 10q26 and rs2588809 at 14q24) lie close to regions previously associated with breast cancer risk. In each locus, however, the lead SNP was not correlated with the most strongly associated known association, and the association of the new SNP remained similarly statistically significant after adjustment for the previously associated SNP (Supplementary Table 5). In the case of rs2588809, which lies in RAD51B (also known as RAD51L1), the association was markedly stronger for ER-positive disease (P = 0.011; Supplementary Table 7a), whereas the previously associated SNPs (rs999737 and rs10483813), which lie ~370 kb telomeric, are associated with similar ORs for both ER-positive and ER-negative disease26.
Two associated loci lie within or close to known breast cancer susceptibility genes. rs11571833 is a polymorphic variant in BRCA2 that introduces a premature stop codon (p.Lys3326*), previously reported to have no association with breast cancer risk27. The results from the current study, however, indicate that this variant is associated with a modestly higher risk of breast cancer. Further work will be required to determine whether this association is due to a higher risk variant or variants in linkage disequilibrium (LD). SNP rs132390 at 22q12 lies within an intron of EMID1 but is ~500 kb upstream of CHEK2, raising the possibility that this association is mediated through the latter. CHEK2 c.1100delC, the major deleterious CHEK2 variant in European populations28, occurs more frequently in association with the risk allele at rs132390 (r2 = 0.06); however, the association between r132390 and breast cancer risk persisted after adjustment for CHEK2 c.1100delC, although attenuated (unadjusted OR in iCOGS = 1.12, P = 5.9 × 10−6; adjusted OR = 1.09, P = 0.04).
In addition to rs11571833, one further SNP is a coding variant: rs11552449 encodes a missense substitution p.His61Tyr in DCLRE1B (also known as SNM1B), an evolutionarily conserved gene involved in DNA stability and the repair of interstrand cross-links29. The remaining loci are either intronic (20) or intergenic (19). Two loci lie within genes previously proposed as candidate breast cancer susceptibility genes. SNP rs12493607 lies in intron 2 of TGFBR2. An analysis of genes in the transforming growth factor (TGF)-β signaling pathway in European populations found weak evidence of an association between rs4522809 and breast cancer risk (P = 0.02)30. This SNP is weakly correlated with rs12493607 (r2 = 0.25) and also showed some evidence of association in our study, although weaker than that seen for rs12493607 (iCOGS P = 0.00096; combined analysis of GWAS and iCOGS P = 0.0029). A similar analysis of candidate SNPs in Asian populations identified SNP rs1078985 as a potential breast cancer susceptibility variant31. This variant, however, was uncorrelated with rs12493607 in Europeans and showed no evidence of association in our study (P = 0.33 in the iCOGS stage). SNP rs7904519 lies in intron 4 of TCF7L2. A previous candidate gene study found weak evidence for an association between a correlated SNP, rs12255372, associated with type 2 diabetes (r2 = 0.37 with rs7904519), and familial breast cancer (P = 0.04)32.
The identification of the genes and variants underlying these associations will require more detailed fine mapping and functional analysis. Nevertheless, it is possible to discern some patterns. We identified 53 genes within 50 kb of the lead SNPs in the newly associated regions, totaling 96 genes when including the previously known loci. Analysis using Ingenuity Systems Pathway Analysis (IPA) identified an excess of genes reported to be involved in tumorigenesis (34 genes; P = 0.0005), breast cancer (15 genes; P = 2 × 10−5) and tumor incidence in model systems (10 genes; P = 2 × 10−7). The most consistently over-represented functions were cell death (P = 0.0028), differentiation (P = 2 × 10−5) and expression (P = 2 × 10−8).
Three loci are located in the vicinity of susceptibility regions for other cancer types. SNP rs11780156 lies ~400 kb downstream of MYC. Previous GWAS have identified multiple loci upstream of MYC that are associated with different cancer types, including a locus for breast cancer. Functional studies have indicated that these associations might be mediated through transcriptional regulation of MYC. The newly associated locus is ~300 kb centromeric to a previously reported susceptibility locus for ovarian cancer, rs10088218, but is uncorrelated with it (r2 = 0.02, based on data from European subjects in BCAC), raising the possibility that these loci might also be regulating MYC33. SNP rs9790517 at 4q24 lies ~20 kb away from SNP rs7679673, previously reported to be associated with prostate cancer34, and is correlated with it (r2 = 0.53). SNP rs9790517 lies in intron 11 of TET2, which encodes a methylcytosine dioxygenase involved in myelopoiesis. Mutations in TET2 are frequent in hematological malignancies but have also been reported in 2 of 47 breast tumors in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. In addition, Pharoah et al.18 have found an association between rs1243180 and ovarian cancer. This SNP is ~120 kb telomeric to rs7072776 and is partially correlated with it (r2 = 0.51); both SNPs and the neighboring breast cancer–associated locus rs11814448 lie within the region 400 kb upstream of DNAJC1.
To further investigate the likely genes underlying the susceptibility variants, we examined associations between the lead SNPs and the RNA expression of neighboring genes in 473 primary breast tumors and 61 normal breast tissue samples in The Cancer Genome Atlas (TCGA) database. We found strong evidence for an association between rs616402 (a surrogate for rs616488; r2 = 0.66) and expression of PEX14 in both tumor (P = 4.7 × 10−12) and normal tissue (P = 0.00018; Supplementary Table 8), between rs3760983 (a surrogate for rs3760982; r2 = 1) and expression of both ZNF404 (P = 1.2 × 10−6 in tumors) and ZNF283 (P = 0.0089) and between rs3903072 and expression of CTSW (P = 4.9 × 10−5). SNP rs3760982 was also found to be associated with the expression of ZNF45 (P = 0.0077), ZNF283 (P = 0.05) and ZNF222 (P = 0.01) in lymphoblastoid cell lines from HapMap samples using the Genevar database35 (Supplementary Table 8c). After adjustment for the SNP in the region most strongly associated with expression, SNP rs616488 and PEX14 (P = 0.0071) as well as rs1217396 (a proxy for rs11552449) and PTPN22 (P = 0.0055) and DCLRE1B (P = 0.0067) reached nominal significance at P < 0.01 (Supplementary Table 8a). Although none of these passed Bonferroni correction for multiple testing, the three associations found exceeded the number expected by chance with 46 associations tested. This supports some transcriptional effect from the risk-associated SNPs. PEX14 is involved in peroxisome organization and protein and transmembrane transport; mutations in PEX14 have been associated with Zellweger syndrome36. The functions of ZNF45, ZNF222 and ZNF283 are unknown but may involve transcriptional regulation.
In addition to the genes described above, plausible candidate genes exist in several of the newly associated regions. MUS81 at 11q13 has a key role in the maintenance of genomic stability and in DNA repair pathways37,38, and the cofilin gene (CFL1) is required for tumor cell motility and invasion, particularly in mammary tumors39,40. Several other genes have been associated with tumor aggressiveness; these include PTH1R at 3p21, FOXQ1 at 6p25, ARHGEF5 at 7q35 and MKL1 at 22q13. PTH1R is the receptor for PTHLH, encoded by a previously identified breast cancer susceptibility locus15. PTHLH is required for normal mammary gland function and has been shown to be involved in the metastasis of breast cancer cells to bone41,42. FOXQ1 encodes a transcription factor with a key role in cell proliferation and migration and in breast cancer metastasis43. Alterations in its expression level induce mesenchymal-epithelial transition44. Dysfunctional ARHGEF5 acts as an oncogene specific for human breast tissue, with a crucial role in tumorigenesis and metastasis in breast cancer45. MKL1 is also involved in tumor cell invasion and metastasis, particularly in human breast carcinoma46. Two of the newly associated SNPs lie within the TCF7L2 and FTO genes, previously associated with type 2 diabetes and/or obesity through GWAS47–49. TCF7L2 acts as a proto-oncogene and is involved in the Wnt pathway and in tumor formation50. PAX9 at 14q13.3 encodes a transcription factor that regulates cell proliferation, migration and resistance to apoptosis51,52. SSBP4 is involved in DNA recombination and repair and has been suggested to have tumor suppressor activity53,54. The expression of KREMEN1 at 22q12.1 is lower or absent in human tumors compared to normal tissue55,56. This gene encodes a negative regulator of the Wnt/β-catenin pathway, which has a key role in cell fate determination, stem cell regulation and cell differentiation and proliferation. It has been suggested that lack of KREMEN1 would activate the Wnt/β-catenin pathway, thereby enhancing susceptibility to tumorigenesis55,56. Finally, NTN4 at 12q22 encodes a secreted growth factor that regulates tumor growth. High levels of NTN4 have been found in ER-positive but not ER-negative breast tumors57. NTN4 expression in tumors has also been suggested as a potential prognostic marker for breast cancer57.
On the assumption that the risks conferred by common susceptibility loci combine multiplicatively (no interaction on a log-additive scale) and on the basis of the per-allele OR estimates from the iCOGS stage, we determined that the 41 newly associated loci explain approximately 5% of the familial risk of breast cancer. However, the overall excess of significant associations for SNPs selected from the breast cancer GWAS for genotyping in the iCOGS stage suggests that a much larger number of loci contribute to susceptibility, although they did not have associations reaching genome-wide levels of significance in the current study. To assess this hypothesis more formally, we identified a set of 10,668 SNPs selected from the GWAS that were uncorrelated (r2 < 0.1 between any pair). Of these, the estimated OR was in the same direction as in the combined GWAS for 5,918 SNPs and in the opposite direction for 4,750 SNPs. Assuming that SNPs with effects in opposite directions are not associated with risk, an estimated 1,168 loci selected from the GWAS are associated with risk. However, this is an underestimate because weakly associated SNPs might have effects in opposite directions in the two stages. As an alternative approach, we fitted the distribution of z scores for the iCOGS stage, aligned to the direction of the effect in the GWAS, as a mixture of two normal distributions representing those SNPs that were or were not associated with disease (Fig. 2 and Online Methods)58. On the basis of the posterior probabilities from this analysis, an estimated 92% of loci (n = 9,815) were associated with breast cancer risk (95% CI = 85–100%), and these contributed approximately 18% of the familial risk of breast cancer. It should be noted, however, that the large majority of the loci had very small individual effects on risk: for example, the estimated OR was >1.05 for only 10 loci, and 920 loci had an estimated OR of >1.02. When taking into account effects from the previously known loci, these analyses suggest that ~28% of familial risk is explained by common variants selected for iCOGS, of which ~14% can be explained by the 67 established loci (with a further ~20% due to higher penetrance loci).
To our knowledge, this is the largest genetic association study in cancer so far. The power of this approach is demonstrated by the fact that we have found evidence, at genome-wide levels of significance, for more than 40 new susceptibility loci, more than doubling the number of susceptibility loci for breast cancer. The effect sizes of the newly identified loci are generally modest (the highest OR was 1.26). However, the very high levels of statistical significance, the lack of heterogeneity among studies, the generally higher effect sizes for familial cases and the fact that most of the excess of significant associations was concentrated among SNPs selected on the basis of an association in the combined breast cancer GWAS all indicate that these are robust associations. Although the majority of the data are from populations of Northern and Western European ancestry, there was little or no evidence of heterogeneity in the OR estimates between studies, indicating that the associations apply broadly to populations of European ancestry. With more than 60 established breast cancer susceptibility loci, it is becoming possible to discern some more general patterns among the loci. Although most of the underlying genes and variants remain to be identified, there is a clear excess of genes either known to be involved in tumorigenesis in model systems or involved in processes relevant to cancer, such as cell death and differentiation. However, for other loci, such as PEX14, there is no obvious link to cancer susceptibility. Nine of the new loci lie in chromosomal regions with no known genes, suggesting that these may provide further examples of long-range regulation similar to that seen in the 8q24 region59. We have identified three additional examples of loci in the vicinity of susceptibility loci for other cancers (TET2, 8q24 and DNAJC1). These associations might reflect the tissue-specific regulation of key genes, and understanding the functional mechanisms underlying these associations may be particularly informative.
On the basis of the current set of loci and assuming that all loci combine multiplicatively, the currently known loci now define a genetic profile for which 5% of the female population has a risk that is ~2.3-fold higher than the population average and for which 1% of the population has a risk that is ~3-fold higher. However, the large excess of significant associations among the SNPs selected from the GWAS suggests that many more susceptibility loci exist that have not met our threshold for genome-wide-significant association in this study and that these explain a similar fraction of the heritability as the currently known loci. The observation, made by comparing effect sizes in the iCOGS stage with those in the GWAS, that a very large number of loci, perhaps several thousand, contribute to polygenic susceptibility to breast cancer is consistent with results from GWAS in other complex disorders such as schizophrenia, using a different analytical approach60. Incorporating these loci into risk models should substantially improve disease prediction, even if not all loci can be identified individually. Moreover, fine-scale mapping of the identified regions may uncover more of the missing heritability, either through identifying a more strongly associated variant (as found for the CCND1 locus; see French et al.61) or by identifying additional signals (exemplified for the TERT region in Bojesen et al.62). Genetic profiling using these common susceptibility loci in combination with rarer high-risk loci and other risk factors may provide a rational basis for targeted breast cancer prevention.
TCGA, http://cancergenome.nih.gov/; IPA, http://www.ingenuity.com/products/ipa; COSMIC, http://www.sanger.ac.uk/genetics/CGP/cosmic/; BCAC, http://ccge.medschl.cam.ac.uk/consortia/bcac/index.html; CIMBA, http://ccge.medschl.cam.ac.uk/consortia/cimba/index.html; OCAC, http://ccge.medschl.cam.ac.uk/consortia/ocac/index.html; PRACTICAL, http://ccge.medschl.cam.ac.uk/consortia/practical/index.html; COGS, http://www.cogseu.org/; iCOGS, http://ccge.medschl.cam.ac.uk/research/consortia/icogs/; Illumina GenCall, http://www.illumina.com/Documents/products/technotes/technote_gencall_data_analysis_software.pdf; SNAP, http://www.broadinstitute.org/mpg/snap/ldplot.php.
Primary genotype data were obtained for nine breast cancer GWAS in populations of European ancestry (Supplementary Table 1). Standard quality control was performed on all scans as follows. We excluded all individuals with low call rate (<95%) and extremely high or low heterozygosity (P < 1 × 10−5), as well as all individuals evaluated to be of non-European ancestry (>15% non-European component, as determined by multidimensional scaling using the HapMap version 2 CEU, JPT/CHB and YRI populations as a reference). We excluded SNPs with MAF < 1%; call rate < 95%; or call rate < 99% and MAF < 5% and all SNPs with genotype frequencies that departed from Hardy-Weinberg equilibrium at P < 1 × 10−6 in controls or P < 1 × 10−12 in cases. For highly significant SNPs, genotype intensity cluster plots were examined manually to judge reliability, either centrally or by contacting the original investigators.
Data were imputed for all scans for ~2.6 million SNPs with the HapMap version 2 CEU panel (Utah residents of Northern and Western European ancestry) as a reference, using the program MaCH v1.0. Imputation was conducted separately for each scan. Estimated per-allele ORs and standard errors were generated from the imputed genotypes using ProbABEL63. For two studies (UK2 and HEBCS), estimates were adjusted by the first three principal components, as this was found to materially reduce the inflation of test statistics. Residual inflation was then adjusted for by multiplying the variance by a genomic control adjustment factor, based on the ratio of the median χ2 test statistic to its expected value64. BBCS and UK2 used the same control data (WTCCC2) but different genotyping platforms. Data were imputed separately for these studies. For the combined analysis, the control set was divided randomly between the two studies, in proportion to the size of the case series, to provide disjoint strata. Overall significance tests for each SNP were performed using a fixed-effects meta-analysis; data were only included for a given study if the imputation accuracy r2 was >0.3.
Details of SNP selection for the iCOGS array are given in the Supplementary Note.
For the purpose of the BCAC analyses, we included SNPs on the basis of the analysis of the nine GWAS described above. We ranked SNPs on the basis of the results from five analyses: an overall 1-degree-of-freedom trend test; a 1-degree-of-freedom trend test giving a weight of 2 to those studies selecting cases for a positive family history (UK2, BBCS, DFBBCS and GC-HBOC); a 2-degrees-of-freedom genotype test; and 1-degree-of-freedom tests based on cases diagnosed before the ages of 40 years or 50 years compared with all controls. We also defined lists based on 1-degree-of-freedom trend tests restricted to data from each of the nine component studies. SNPs were also selected from analyses of cases with ER-negative disease, but these are not reported here.
Samples for the iCOGS stage were drawn from 52 studies participating in BCAC, including 41 from populations of predominantly European ancestry, 9 of Asian ancestry and 2 of African-American ancestry. The majority of studies were population-based or hospital-based case-control studies, but some studies selected samples by age or oversampled for cases with a family history of breast cancer (Supplementary Table 2). Studies were required to provide ~2% of samples in duplicate.
Genotyping was conducted using a custom Illumina Infinium array (iCOGS) in seven centers, of which four were used for BCAC. Genotypes were called using Illumina’s proprietary GenCall algorithm. Initial calling used a cluster file generated from 270 samples from HapMap 2. To generate the final calls, we first selected a subset of 3,018 individuals, including samples from each of the genotyping centers, each of the participating consortia and each major ancestry group. Only plates with a consistently high call rate in the initial calling were used. We also included 380 samples of European, Asian or African ancestry genotyped as part of the HapMap Project and 1000 Genomes Project and 160 samples that were known positive controls for rare variants on the array. This subset was used to generate a cluster file that was then applied to call the genotypes for the remaining samples. We also investigated two other calling algorithms: Illumnus65 and GenoSNP66. All three algorithms were >99% concordant in their calling for 91% of the SNPs on the array. However, manual inspection of a sample of the SNPs with discrepancies indicated that the calls from GenCall were almost invariably superior (generally, because Illumnus or GenoSNP attempted to call SNPs that clustered poorly). Therefore, only the genotypes called by GenCall have been used in the analyses reported here.
We excluded individuals for any of the following reasons: genotypically not female XX (XY, XXY or XO); overall call rate < 95%; low or high heterozygosity (P < 1 × 10−6, determined separately for individuals of European, East Asian and African-American ancestry); genotypes discordant with those determined in previous BCAC genotyping such that the individual appeared to be different; genotypes for the duplicate sample that seemed to be from a different individual; and cryptic duplicates where the phenotypic data indicated that the individuals were different. We searched for cryptic duplicates, both within each study and between studies from the same country. For known and cryptic concordant duplicates, the sample with the lower call rate was excluded. We attempted to identify first-degree relative pairs using identity-by-state estimates based on ~37,000 uncorrelated SNPs. For apparent first-degree relative pairs, we removed the control from a case-control pair; otherwise, we excluded the individual with the lower call rate. For the main analyses presented here, we also excluded 1,880 individuals who were included in any of the GWAS to allow the GWAS and iCOGS stages to be combined.
Ancestry outliers were identified by multidimensional scaling, combining the iCOGS data with genotypes from the HapMap 2 populations, on the basis of a subset of 37,000 uncorrelated markers that passed quality control (including ~1,000 that were selected as ancestry-informative markers). Most studies were predominantly of a single ancestry (European or East Asian), and individuals with >15% minority ancestry, as determined on the basis of the first two principal components, were excluded. Two studies from Singapore (SGBCC) and Malaysia (MYBRCA) contained a substantial fraction of individuals of mixed European and Asian ancestry (likely of South Asian ancestry). For these studies, no exclusions for ancestry outliers were made, but principal-components analysis adequately corrected for inflation in these studies. Similarly, for the two African-American studies (NBHS and SCCS), no exclusions for ancestry outliers were made.
Principal-components analyses were carried out separately for the European, Asian and African-American subgroups, on the basis of a subset of 37,000 uncorrelated SNPs. For the analyses of European subjects, we included the first six principal components as covariates, together with a seventh component derived specifically for one study (LMBC) for which there was substantial inflation not accounted for by the components derived from the analysis of all studies (this component was set to zero for all other studies). The addition of further principal components did not reduce inflation further. We included two principal components each for the studies in Asian and African-American populations.
We excluded SNPs with call rates of <95%. We also excluded SNPs that deviated from Hardy-Weinberg equilibrium in controls at P < 1 × 10−7, on the basis of a stratified 1-degrre-of-freedom test in which the deviations were summed across strata67. We also excluded SNPs for which the genotypes were discrepant in more than 2% of duplicate samples across all COGS consortia. The final analyses were based on data from 199,961 SNPs.
Genotype intensity cluster plots were examined manually for SNPs in each new region in which a genome-wide significant association was obtained, and SNPs were eliminated if the clustering was judged to be poor.
For each SNP, we estimated a per-allele log(OR) and standard error by logistic regression, including study and principal components as covariates. Genotype-specific ORs were also computed. Overall significance levels were obtained by combining the estimates from the combined GWAS and iCOGS using a fixed-effects meta-analysis to derive a 1-degree-of-freedom test. Inflation of the test statistics (λ) was estimated by dividing the 45th percentile of the test statistic by 0.357 (the 45th percentile for a χ2 distribution on 1 degree of freedom). For this purpose, we used a subset of 22,897 SNPs that were uncorrelated (r2 < 0.1), which were not selected by BCAC and were not within 1 of the 4 common fine-mapping regions. This subset was used to minimize the selection of SNPs associated with disease, on the assumption that such SNPs are likely to be representative of common SNPs in terms of population structure. The inflation statistic was converted to an equivalent inflation statistic for a study with 1,000 cases and 1,000 controls (λ1,000) by adjusting by effective study size, namely
where nk and mk are the number of cases and controls, respectively, for study k. Heterogeneity in the per-allele OR by ER status, age at diagnosis, family history and tumor invasiveness (DCIS versus invasive) were evaluated using a case-only analysis.
Gene expression, copy number and genotype data were retrieved from the TCGA breast cancer study. Gene expression profiles were measured by TCGA using a custom Agilent 244K expression array. We downloaded the raw expression data and performed preprocessing using the limma R package. Copy number and germline genotype were both measured using the Affymetrix Genome-Wide Human SNP 6.0 array. We used the segmented copy number and called genotype data as provided by TCGA. Intersecting the different genomic data types, we collected 458 primary tumor samples with germline genotypes from blood and both gene expression and somatic copy number data from the tumor. In addition, for 61 samples, we had germline genotype and gene expression data from normal breast tissue from individuals in the TCGA breast cancer study. Expression quantitative trait locus (eQTL) analysis was performed on both sets separately. For cis-eQTL analysis, we considered all genes 50 kb upstream or downstream of the lead SNP. Fourteen of the risk-associated SNPs are represented directly on the Affymetrix SNP array. For an additional 23, we were able to select proxies on the basis of maximum LD with minimum r2 of 0.5. In case of equal LD, we used proximity on the genome to break the tie. LD estimates were extracted from the HapMap data for the CEU population. eQTL analysis was performed by regressing the gene expression of selected candidate genes on the genotype followed by a significance test of the t statistic for the genotype covariate. For both the normal and tumor analyses, the linear regression was adjusted for potential batch effects by including indicator variables for the plate identifier component of the TCGA sample barcode. In addition, the first principal component of the complete gene expression matrix was added as a covariate to adjust for other global, typically non-genetic contributions to the gene expression signal. To prevent spurious associations due to confounding by nearby eQTLs, we corrected the model for the most strongly associated eQTL SNP in the region. For the tumor analysis only, we also added the copy number of the candidate gene as a covariate because apparent associations between germline genotype and tumor expression may be confounded or obscured by somatic copy number alterations.
To assess the potential effects of the new SNPs on nearby gene expression in lymphocytes, we identified all genes that lie within a 500-kb window surrounding each of the SNPs and used Genevar (Gene Expression Variation), a public database with gene expression data quantified in lymphocytes from individuals in the HapMap 2 populations35,68.
To estimate the total number of newly associated loci selected for the iCOGS array, we first used the set of 29,807 SNPs selected from the GWAS and not selected for fine mapping, to exclude previously known loci. We then defined a set of 10,668 SNPs that were uncorrelated (r2 < 0.1 between any pair) and determined the number of loci for which the estimated effect size in the iCOGS stage was in the same direction as in the combined GWAS and the number of loci for which the effect was in the opposite direction. Similar results were obtained using cutoffs of r2 < 0.05 and r2 < 0.2. On the assumption that none of the loci with effects in opposite directions in the two stages were associated with disease, the number of loci associated with disease can be estimated as the difference between the number of loci with effects in the same direction and the number with effects in opposite directions. This, however, is an underestimate because loci with weak effects may have estimated effects in opposite directions in the two stages. To allow for this possibility, we fitted the distribution of z scores as a mixture of a standard normal distribution (representing SNPs with no effect) and a normal distribution with unknown mean and variance, using an expectation-maximization algorithm58. The total contribution to heritability was then computed from the posterior estimates. To allow for the potential effect of residual population stratification, we conducted an additional analysis in which the null distribution was assumed to have variance of 1.2, based on the estimated inflation from the non-BCAC SNPs, but the estimates were essentially identical.
The authors wish to thank all the individuals who took part in these studies and all the researchers, clinicians, technicians and administrative staff who have enabled this work to be carried out. BCAC is funded by Cancer Research UK (C1287/A10118 and C1287/A12014) and by the European Community’s Seventh Framework Programme under grant agreement 223175 (HEALTH-F2-2009-223175) (COGS). Meetings of BCAC have been funded by the European Union European Cooperation in Science and Technology (COST) programme (BM0606). Genotyping of the iCOGS array was funded by the European Union (HEALTH-F2-2009-223175), Cancer Research UK (C1287/A10710), the Canadian Institutes of Health Research (CIHR) for the CIHR Team in Familial Risks of Breast Cancer program and the Ministry of Economic Development, Innovation and Export Trade of Quebec (grant PSR-SIIRI-701). Combining the GWAS data was supported in part by the US National Institutes of Health (NIH) Cancer Post-Cancer GWAS initiative grant 1 U19 CA 148065-01 (DRIVE, part of the GAME-ON initiative). A full description of funding and acknowledgments is provided in the Supplementary Note.
Note: Supplementary information is available in the online version of the paper.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Reprints and permissions information is available online at http://www.nature.com/reprints/index.html.
AUTHOR CONTRIBUTIONSK. Michailidou and D.F.E. performed the statistical analysis and drafted the manuscript. D.F.E. conceived and coordinated the synthesis of the iCOGS array and led BCAC. P.H. coordinated COGS. J. Benitez led the iCOGS genotyping working group. A.G.-N., G.P., M.R.A., J. Benitez, D.V., F.B., D.C.T., J. Simard, A.M.D. and C.L. coordinated genotyping of the iCOGS array. M.G.-C., P.D.P.P. and M.K.S. led the BCAC pathology and survival working group. J.C.-C. led the BCAC risk factor working group. A.M.D. and G.C.-T. led the iCOGS quality control working group. J.D., E.D., M. Ghoussaini and A. Lee provided bioinformatics support. M.K.B. and Q. Wang provided data management support for BCAC. S.C. and L.F.A.W. provided analysis of the TCGA expression data. C.T., N.R. and D.F.E. led the UK2 GWAS. O.F., J.P. and I.d.S.S. led the BBCS GWAS. H.N., T.A.M., K. Aittomäki and C.B. led the HEBCS GWAS. P.H., K.C., A.I. and J. Liu led the SASBAC GWAS. Q. Waisfisz, H.M.-H., M.A. and R.B.v.d.L. led the DFBBCS GWAS. J.C.-C., R.H., N.D. and L. Beckman led the MARIE GWAS. A. Meindl, R.K.S., B.M.-M. and P.L. led the GC-HBOC GWAS. J.L.H., M.C.S., E.M., D.F.S. and H.T. led the ABCFS GWAS. A.G.U. and A. Hofman led the genotyping in the Rotterdam study. D.J.H. and S.J.C. led the CGEMS GWAS. F.J.C. and S. Slager coordinated TNBCC. C.A.H., B.E.H., F.S. and L.L.M. coordinated MEC. P.D.P.P., D.F.E. and M. Shah coordinated SEARCH. R.L. coordinated EPIC-Norfolk. J. Brown coordinated SIBS. P.H., K.C., N.S., K.H. and J. Li coordinated SASBAC and pKARMA. S.E.B., B.G.N., S.F.N. and H.F. coordinated CGPS. F.J.C., X.W., C.V. and K.N.S. coordinated MCBCS. D.L., M.M., R.P. and M.-R.C. coordinated LMBC. J.C.-C., A.R., S.N. and D.F.-J. coordinated MARIE. N.J., L.G. and Z.A. coordinated BBCS. K. Aaltonen and T.H. coordinated HEBCS. M.K.S., A.B., L.J.V.t.V. and C.E.v.d.S. coordinated ABCS. P.G., T.T., P.L.-P. and F. Menegaux coordinated CECILE. F. Marme, A. Schneeweiss, C. Sohn and B. Burwinkel coordinated BSUCH. R.L.M., A.G.-N., M.P.Z., J.I.A.P. and J. Benitez coordinated CNIO-BCS. A.C., I.W.B., S.S.C. and M.W.R.R. coordinated SBCS. E.J.S., I.T., M.J.K. and N.M. coordinated BIGGS. I.L.A., J.A.K., G.G. and A.M.M. coordinated OFBCR. A. Lindblom and S. Margolin coordinated KARBAC. M.J.H., A. Hollestelle, A.M.W.v.d.O. and A. Jager coordinated RBCS. J.L.H., M.C.S., Q.M.B., J. Stone, G.S.D. and C.A. coordinated ABCFS. J.L.H., M.C.S., G.G.G., G.S. and L. Baglietto coordinated MCCS. P.A.F., L.H., A.B.E. and M.W.B. coordinated BBCC. H. Brenner, H. Müller, V.A. and C. Stegmaier coordinated ESTHER. A. Swerdlow, A.A., N.O., M.J. and M.G.-C. coordinated UKBGS. M.G.-C., J.F., J. Lissowska and L. Brinton coordinated PBCS. M.S.G., F.L., M.D. and J. Simard coordinated MTLGEBCS. R.W., K.P., A.J.-V. and M. Grip coordinated OBCS. H. Brauch, U.H. and T.B. coordinated GENICA. P.R., P.P., S. Manoukian and B. Bonanni coordinated MBCSG. P.D., R.A.E.M.T., C. Seynaeve and C.J.v.A. coordinated ORIGO. A. Jakubowska, J. Lubinski, K.J. and K.D. coordinated SZBCS. A. Mannermaa, V.K., V.-M.K. and J.M.H. coordinated KBCP. N.V.B., N.N.A. and T.D. coordinated HMBCS. V.N.K. coordinated NBCS. H.A.-C. coordinated UCIBCS. A.E.T. coordinated OSU. S.E. coordinated RPCI. F.F. coordinated DEMOKRITOS. D.K., K.-Y.Y. and D.-Y.N. coordinated SEBCS. K. Matsuo, H. Ito, H. Iwata and A. Sueta coordinated HERPACC. A.H.W., C.-C.T., D.V.D.B. and D.O.S. coordinated LAABC. W.Z., X.-O.S., W.L., Y.-T.G. and H.C. coordinated SGBCS. S.H.T., C.H.Y., S.Y.P. and B.K.C. coordinated MYBRCA. M.H., H. Miao, W.Y.L. and J.-H.S. coordinated SGBCC. K. Muir, A. Lophatananon, S.S.-B. and P.S. coordinated ACP. C.-Y.S., C.-N.H., P.-E.W. and S.-L.D. coordinated TWBCS. S. Sangrajrang, V.G., P.B. and J.M. coordinated TBCS. W.J.B., L.B.S., Q.C. and W.Z. coordinated SCCS. W.Z., S.D.-H., M. Shrubsole and J. Long coordinated NBHS. G.C.-T. coordinated the genotyping component of kConFab. All authors provided critical review of the manuscript.