|Home | About | Journals | Submit | Contact Us | Français|
A genome-wide association study was conducted among Chinese women to identify risk variants for breast cancer. By analyzing 607,728 SNPs in 1505 cases and 1522 controls, we selected 29 promising SNPs for a fast-track replication in an independent set of 1554 cases and 1576 controls. Four replicated loci were further investigated in a third set of samples including 3472 cases and 900 controls. SNP rs2046210 at 6q25.1, located upstream of the estrogen receptor 1 gene (ESR1), exhibited strong and consistent association with breast cancer across all three stages. Adjusted odds ratio (95% CI) were 1.36 (1.24–1.49) and 1.59 (1.40–1.82), respectively, for genotypes A/G and A/A versus G/G (P for trend, 2.0×10−15) in the pooled analysis of samples from all three stages. A similar, although weaker, association was also found in an independent study including 1591 cases and 1466 controls of European ancestry (Ptrend, 0.01). These results provide strong evidence implicating 6q25.1 as a susceptibility locus for breast cancer.
Breast cancer, a complex multifactorial disease, is one of the most common malignancies among women in the world. Genetic factors play an important role in the pathogenesis of both sporadic and familial breast cancer1–3. However, only a small fraction of breast cancer cases can be explained by the breast cancer susceptibility genes identified thus far, such as the BRCA1 and BRCA2 genes4–6. Family-based linkage studies have been successful in mapping genes associated with Mendelian disorders1–3, 7. However, this approach has had limited success in identifying common genetic variants that confer small to moderate risk of disease susceptibility. Over the past 15 years, a large number of association studies have evaluated genetic variants in many candidate genes in relation to breast cancer risk 1–3, 7–9. Although numerous genetic variants have been implicated, only a few of them have been replicated in subsequent studies10, 11. Four recent GWA studies have identified several novel risk alleles for breast cancer12–15. All these studies, however, are conducted in women of European decent who differ from women of other ethnic groups in certain genetic architecture. Therefore, additional GWA studies, particularly those conducted in non-European decent population, are needed to fully uncover genetic basis for breast cancer susceptibility.
Since 1996 we have initiated multiple population-based epidemiologic studies of cancer in Shanghai, China, including the Shanghai Breast Cancer Study (SBCS) (see Supplementary Methods). Included in Stages I to III of the current GWA study were genomic DNA samples from 6,531 incident breast cancer cases and 3,998 community controls who participated in these studies (Table 1). The pilot phase of the GWA study was initiated in 2005 with 150 cases and 150 controls genotyped using the Affymetrix GeneChip Human Mapping 500K Array Set that contains approximately 500,000 SNPs. An additional 1,374 cases and 1,402 controls were genotyped in 2008 using the Affymetrix Genome-Wide Human SNP Array 6.0 that contains 906,602 SNPs. Cases and controls were matched on age. Included in the current analysis were 607,728 SNPs in the Affymetrix SNP array 6.0 and 330,885 SNPs in the Affymetrix 500K array set that met the following criteria: 1) ≥ 5% minor allele frequency (MAF), 2) ≥ 95% call rate, 3) ≥ 95% genotyping concordance rate in quality control (QC) samples, and additionally for SNPs in the Affymetrix 500K arrays only those that are also present in the Affymetrix 6.0 array (Supplementary Table S1). Of the initial 3,076 samples included in the GWA scan, 49 samples were excluded, due to < 95% call rate (n = 4), or sample duplication or contamination (n = 45). A total of 1,505 cases and 1,522 controls remained for the GWA analyses. Multidimensional scaling analyses based on pairwise identity-by-state showed no evidence of apparent genetic admixture in this study population (Supplementary Figure S1).
Multiple genomic locations were revealed as potentially related to breast cancer risk (Figure 1), and the observed number of SNPs with a small P-value is larger than that expected by chance (Supplementary Figure S2). Similar results were obtained after excluding the 292 subjects genotyped by Affymetrix 500K Array Set from the analyses (data not shown). P-values presented in Figure 1 are derived from trend tests using logistic regression (df = 1) after adjusting for age. Six of the 11 SNPs identified from published GWA studies12–15 are included in the Affymetrix 6.0 array, and four of them showed an association with breast cancer risk consistent with that reported previously (Supplementary Table S2). Specifically, elevated risk of breast cancer was found to be associated with the minor allele of rs1219648 (FGFR2, Ptrend=0.0025), rs2981582 (FGFR2, Ptrend=0.001), rs3803662 (TNRC9, Ptrend=0.012), and rs8051542 (TNRC9, Ptrend=0.098). No apparent association, however, was found for rs3817198 (LSP1, Ptrend=0.75), and the association with rs2180341 (6q22.33, Ptrend=0.068) was in the opposite direction of the one reported initially in a study based on the Ashkenazi Jewish population15.
For our fast track replication, 29 most promising SNPs were genotyped in an independent set of 1,554 cases and 1,576 controls recruited in the SBCS. These SNPs were selected from those that had 1) MAF ≥ 10%; 2) very clear genotyping clusters; 3) not yet confirmed previously as a genetic risk variant for breast cancer; and 4) P ≤ 1 × 10−4 for all samples along with a consistent association at P ≤ 0.05 in samples analyzed in the first batch (754 cases/741 controls) and the second batch (751 cases/781 controls) or P ≤ 5 × 10−4 for all subjects and consistent association at P ≤ 0.01 in both batches.
Of the 29 SNPs included in fast-track replication (Supplementary Table S3), four SNPs in stage II showed a significant association with breast cancer risk at P ≤ 0.05 and the fifth one had a P-value of 0.077 (Table 2). A highly significant association with breast cancer risk was identified for rs2046210 (P = 3.9 × 10−5) and rs10872676 (P = 1.6 × 10−3). Both SNPs are located at 6q25.1, approximately 4.4 kb apart, showing a high degree of LD (r2 = 0.69). Therefore, rs2046210 was selected for further validation, as rs10872676 showed a weaker association with breast cancer risk than rs2046210, and its association was not statistically significant after adjusting for rs2046210.
Four SNPs were further evaluated in Stage III (Table 2), which included 3,472 cases who were recruited during 2002 and 2006 as part of the Shanghai Breast Cancer Survivor Study (SBCSS), along with 900 healthy women recruited from the same source population as the control group for a population-based endometrial cancer study that was conducted in parallel with the SBCS. Again, rs2046210 was associated with breast cancer risk (P = 3.3 × 10−7) (Table 2), and the P-value reached 2.0 × 10−15 in the pooled analysis of samples from all three stages (Table 3). This p-value is substantially lower than the genome-wide significance level based on conservative Bonferroni adjustment of multiple comparisons at a level α = 0.05, providing unequivocal evidence for an association of this SNP with breast cancer risk. This SNP was associated with a population attributable risk (PAR) of 18.9% and an estimated 2.1% excess familial risk of breast cancer. The positive association of this SNP with breast cancer risk was found for both pre- and post-menopausal women, and the association was stronger for ER negative cancer than ER positive cancer (P=0.02) (Table 3). None of the other three SNPs, however, were replicated in Stage III (Table 2).
Figure 2 shows the 6q25.1 locus where rs2046210 is located. A cluster of SNPs that are in strong LD with rs2046210 all showed a significant association with breast cancer risk with P ≤ 0.001 in Stage I. Using data from Stage I, haplotype analyses of a haplotype block including rs2046210 and other 7 SNPs as defined by the method of Gabriel16 or a larger block including 7 additional SNPs failed to identify any particular SNP that may explain the observed association in this locus (Supplementary Table S4).
We also evaluated SNP rs2046210 in association with breast cancer risk among 1,590 cases and 1,466 controls of European ancestry, recruited as part of the Nashville Breast Health Study (NBHS), a population based case-control study conducted in Tennessee, USA (Table 1). Consistent with the findings from the Shanghai studies, the variant allele of this SNP was associated with an elevated risk of breast cancer, and the association was stronger in post- than pre-menopausal women (Table 4).
Several genes are located in the 1 Mb region centered on SNP rs2046210 including PLEKHG1, MTHFD1L, AKAP12, ZBTB2, RMND1, C6orf211, C6orf97, ESR1, C6orf98, SYNE1, and NANOGP11. Of them, the ESR1 gene is perhaps of particular interest to breast carcinogenesis. The ESR1 gene encodes estrogen receptor α (ERα) that regulates signal transduction of estrogen, a sex hormone that plays a central role in the etiology of breast cancer. Elevated estrogen levels have been shown to be associated with an increased risk of breast cancer in multiple prospective studies17. Since biological effects of estrogen are mediated primarily through high-affinity binding to ERs, genetic variants in ER genes, including ESR1 and ESR2 have been the focus of multiple previous epidemiologic studies18–21. The identified SNP (rs2046210) associated with breast cancer risk is located 29 kb upstream of the first untranslated exon and 180 kb upstream of the transcription start site of exon 1 of the ESR1 gene22. None of the SNPs at this locus has been previously reported to be associated with breast cancer, nor in LD with two of the most widely studied polymorphisms in ESR1: rs2234693 and rs9340799 (r2<0.05 in both HapMap Asian and women of European decent samples). SNP rs2234693 was genotyped in Stage I of the study and carried an OR (95% CI) of 0.95 (0.80–1.12) for C/T and 0.79 (0.63–1.00) for T/T genotype in relation to breast cancer risk. Because of the relatively close location to the ESR1 gene and the biological function of ERα, it is possible that rs2046210 or SNPs in LD with it may alter ESR1 gene expression and affect susceptibility to breast cancer. It is noteworthy that a recent GWA study has found that the 6q25.1 locus is associated with bone mineral density23, a phenotype that is affected by estrogen.
SNP rs2046210 is located 6 kb downstream of C6orf97, the chromosome 6 open reading frame 97. The function of C6orf97 is unknown. The LD block that includes rs2046210 spans a region of about 41 kb (151,971,942 to 152,013,380) which contains part of C6orf97. By running BLAST with C6orf97 coding peptide as the query sequence, a structural maintenance of chromosomes (SMC) domain was found in the C-terminal of the C6orf97 protein. SMC proteins appear to play an important role in chromosome dynamics24. Further research to the functionality of C6orf97 and its potential association with breast cancer may be warranted.
For detailed descriptions of the component studies see Supplementary Materials. The study protocol was approved by the Institutional Review Boards of the Vanderbilt University Medical Center, the Shanghai Cancer Institute, the Shanghai Center for Disease Prevention and Control, and Meharry Medical College. Informed consent was obtained from all participants.
For detailed descriptions of the genotyping and quality control procedures see Supplementary Materials. Briefly, in Stage I, the initial 300 subjects were genotyped using the Affymetrix GeneChip Mapping 500K Array Set and the remaining 2,776 samples were genotyped using the Affymetrix Genome-Wide Human SNP Array 6.0. In each of the 96-well plates genotyped using the Affymetrix SNP 6.0 array, three positive QC samples (NA15510, NA10851, and NA18505) purchased from Coriell Cell Repositories (http://ccr.coreill.org/), and a negative QC sample (water) were included. SNP data obtained from the positive QC samples showed a very high concordance rate of called genotypes based on 79,764,872 comparisons (mean, 99.87%; median, 100%). In addition, 742 SNPs were genotyped using the Affymetrix Target Genotyping System, TaqMan or Sequenom for a subset of subjects included in previous studies. A high concordance rate was also observed between genotypes determined using these platforms and Affymetrix SNP Array 6.0 based on 1,478,383 comparisons (mean, 99.1%; median, 99.8%). Samples with genotyping call rates less than 95% were excluded, and the remaining samples were recalled by using Birdseed v2. The gender of all study samples was confirmed to be female. The identity-by-descent analysis based on identity-by-state was performed to detect first degree cryptic relationships using the PLINK 1.04 25 (http://pngu.mgh.harvard.edu/purcell/plink/). A final data set included 3,027 individuals and 607,728 markers in the Affymetrix SNP Array 6.0 and 330,885 in the Affymetrix 500K Array Set that met the following criteria: (1) genotype call rate ≥ 95%; (2) MAF ≥ 5%; and (3) genotyping concordance rate in QC samples ≥ 95%. Also excluded from the analyses were 21,223 SNPs that are on Affymetrix 500K Array Set but not on the SNP Array 6.0.
Genotyping for the replication sets was completed using the iPLEXtm Sequenom MassArray® platform. Included in each 96-well plate as QC samples were two negative controls (water), two blinded duplicates, and two samples included in the HapMap project. The mean concordance rate was 99.7% for blind duplicates (2,572 comparisons) and 99.2% for HapMap samples (1,751 comparisons).
The PLINK version 1.04 was used to analyze genome-wide data obtained in Stage I. A set of 4,305 SNPs with MAF ≥ 0.35 and ≥ 100 kb between two adjacent SNPs was selected to evaluate the population structure. The inflation factor λ was estimated to be 1.024, suggesting that population substructure, if present, should not have any appreciable effect on the results. Odds ratios (OR) and 95% confidence intervals (CI) were estimated from logistic regression analysis. Age was adjusted in the GWA analyses of Stage I data to select promising SNPs. Additional adjustment was made for education in the analyses of Stages II and III data. Finally, multivariate analyses including age, education, study stage, body mass index, age at menarche, and age at first live birth were performed in the pooled analyses of data from all stages. None of the 29 markers genotyped in Stage II showed deviations from Hardy-Weinberg equilibrium (P > 0.05). P-values based on 2-tailed tests are presented.
Haplotype analyses methods: SNPs genotyped in Stage I in a region of ~100kb containing rs2046210 were used to reconstruct haplotypes. LD between SNPs was assessed by Haploview26. Haplotype blocks were defined using the methods of Gabriel et al16 and 9 blocks were observed (Supplementary Figure S3). Block 5 includes rs2046210 and seven other SNPs, rs7740686, rs7763637, rs9397436, rs9397437, rs6908732, rs852004, and rs865898 (Supplementary Table S4). Associations of haplotypes with breast cancer risk were analyzed with HAPSTAT software27 based on additive models. Additional analyses were performed for haplotypes defined using SNPs included in blocks 4 and 5 with a total of 15 SNPs (Supplementary Figure S3, Supplementary Table S4).
Imputation analysis: We used PLINK to impute genotypes for autosomal SNPs that were present in HapMap Phase II release 23a but not genotyped in our GWA scan. HapMap genotype data from the 90 Asian HapMap subjects that have MAF ≥ 1% and genotyping rate ≥ 95% were used as reference. During the imputation, an information score was generated for each imputed SNP that reflected how confidently genotypes were inferred. Values below 0.80 were taken to be indicative of poor quality and were not analyzed. Besides the 0.61 million observed SNPs, an additional 1.14 million imputed SNPs were tested for association with breast cancer.
This research was supported in part by NIH grants R01CA124558, R01CA64277, R01 CA70867, R01CA90899, and R01CA100374, as well as Ingram professorship funds and research award funds to WZ, R01 CA118229, R01CA92585, and DOD Idea Award BC011118 to XOS, and R01CA122756 and DOD Idea Award BC050791 to QC. The authors wish to thank study participants and research staff for their contributions and commitment to this project, Regina Courtney and Qing Wang for DNA preparation, and Brandy Venuti for clerical support in the preparation of this manuscript. Sample preparation and stage I genotyping were conducted at the Survey and Biospecimen and Microarray Shared Resources that are supported in part by the Vanderbilt-Ingram Cancer Center (P30 CA68485). Stage II and III genotyping was carried out at the Proactive Genomics, Winston-Salem, NC.