|Home | About | Journals | Submit | Contact Us | Français|
Colorectal cancer is the second leading cause of cancer death in developed countries. Genome-wide association studies (GWAS) have successfully identified novel susceptibility loci for colorectal cancer. To follow-up on these findings, and try to identify novel colorectal cancer susceptibility loci, we present results for genome-wide association studies (GWAS) of colorectal cancer (2,906 cases, 3,416 controls) that have not previously published main associations. Specifically, we calculated odds ratios (ORs) and 95% confidence intervals (CIs) using log-additive models for each study. In order to improve our power to detect novel colorectal cancer susceptibility loci, we performed a meta-analysis combining the results across studies. We selected the most statistically significant single nucleotide polymorphisms (SNPs) for replication using 10 independent studies (8,161 cases and 9,101 controls). We again used a meta-analysis to summarize results for the replication studies alone, and for a combined analysis of GWAS and replication studies. We measured 10 SNPs previously identified in colorectal cancer susceptibility loci and found eight to be associated with colorectal cancer (p-value range: 0.02 to 1.8 × 10−8). When we excluded studies that have previously published on these SNPs, five SNPs remained significant at p<0.05 in the combined analysis. No novel susceptibility loci were significant in the replication study after adjustment for multiple testing, and none reached genome-wide significance from a combined analysis of GWAS and replication. We observed marginally significant evidence for a second independent SNP in the BMP2 region at chromosomal location 20p12 (rs4813802; replication p-value 0.03; combined p-value 7.3 × 10−5). In a region on 5p33.15, which includes the coding regions of the TERT-CLPTM1L genes and has been identified in GWAS to be associated with susceptibility to at least seven other cancers, we observed a marginally significant association with rs2853668 (replication p-value 0.03; combined p-value 1.9 × 10−4). Our study suggests a complex nature of the contribution of common genetic variants to risk for colorectal cancer.
Colorectal cancer is the second leading cause of cancer death in developed countries, with the lifetime risk estimated to be 5% to 6% (Ries et al., 2007). Linkage studies have identified important rare germline mutations, such as those in the APC gene and DNA mismatch repair genes, leading to severe syndromes, e.g. familial adenomatous polyposis and Lynch syndrome (also called hereditary non-polyposis colorectal cancer) (de la Chapelle, 2004). However, these high-penetrance mutations explain only a small fraction of the genetic risk. To date, genome-wide association studies (GWAS) have identified fourteen low-penetrance genetic variants that, together, explain approximately 8% of the familial association of this disease (Broderick et al., 2007;Gruber et al., 2007;Houlston et al., 2008;Houlston et al., 2010;Tenesa et al., 2008;Tomlinson et al., 2007;Tomlinson et al., 2008;Zanke et al., 2007). Based on a recent method by Chatterjee and Park (Park et al., 2010) that estimates the amount of familial association explained by common genetic variants, we estimate that about 60 to 70 common variants (95% confidence interval: 31–173) would explain approximately 17% (95% confidence interval: 11.6–35.8%) of the familial association in colorectal cancer. Accordingly, we hypothesize that additional common colorectal cancer susceptibility loci exist that yet have to be identified, and that these loci can be identified through a genome-wide analysis of single nucleotide polymorphism (SNP) data.
As has been demonstrated in studies of other common complex diseases, power to detect novel loci is enhanced by performing meta-analysis that combine GWAS results (Zeggini and Ioannidis, 2009). Therefore, we conducted a combined analysis of two recently completed scans that have not previously published main associations, followed by a replication study of the most significant findings using ten independent studies (Table 1; Supplemental Note; Supplemental Table 1) to follow up on the currently established colorectal cancer susceptibility loci and to try to identify additional susceptibility loci. The GWAS meta-analysis included a total of 2,906 cases and 3,416 controls recruited as part of the Colon Cancer Family Registry (CCFR), the Diet, Activity and Lifestyle Study (DALS), the Prostate, Lung, Colorectal and Ovarian Screening Trial (PLCO), and the Women’s Health Initiative (WHI). The replication included a total of 8,161 cases and 9,101 controls from the Nurses’ Health Study (NHS), the Health Professionals Follow-up Study (HPFS), the Physicians’ Health Study (PHS), the Assessment of Risk in Colorectal Tumors In Canada (ARCTIC), additional samples from DALS and CCFR, and case-control studies from Germany, France, Israel, and Newfoundland. Most of these studies are part of the Genetics and Epidemiology Colorectal Cancer Consortium (GECCO; details in Supplemental Note).
From the fixed-effects meta-analysis of GWAS scans, the inflation factor λ was 1.008, indicating little evidence of residual population substructure, cryptic relatedness, or differential genotyping between cases and controls (Supplemental Figure 1). When analyzed separately, λ was similarly low for each scan (range: 1.005 to 1.01).
Initially, we attempted to validate the ten established susceptibility SNPs (p-value 5 × 10−8) that had been published at the time we selected SNPs for replication (Broderick et al., 2007; Gruber et al., 2007; Houlston et al., 2008; Tenesa et al., 2008; Tomlinson et al., 2007; Tomlinson et al., 2008; Zanke et al., 2007). We found nominal evidence for association in the same direction with p<0.05 from combined analyses of GWAS and replication for eight of these ten loci (rs4939827/SMAD7, rs4779584/GREM1, rs16892766/EIF3H, rs3802842/11q23, rs961253/BMP2, rs4444235/BMP4, rs9929218/CDH1, rs6983267/MYC; Table 2). When we excluded results from studies that had been previously published (Supplemental Table 2; Supplemental Figure 2), we found evidence for replication at p<0.05 for five out of nine SNPs. This latter analysis did not include rs6983267, since that SNP has already been published on by a majority of the studies.
The most significant novel SNP in both the replication study, and in the combined analysis of the GWAS and replication, was rs7315438 located on chromosome 12q24 near MED13L (replication OR=0.92; replication p-value 1.0 × 10−3; combined OR=0.92; combined fixed-effects p-value [pfixed] 5.6 × 10−6; combined random-effects p-value [prandom] 1.5 × 2=10−3; Table 3; Supplemental Table 3). For two other SNPs, their association with colorectal cancer was nominally significant within the replication study: rs4925386 located on chromosome 20q13 near LAMA5 (replication p-value 2.5 × 10−3; pfixed 2.1 × 10−4; prandom 0.015) and rs16888522 located on 8q23.3 near EIF3H (replication p-value 4.1 × 10−3; pfixed 1.7 × 10−5; prandom 4.5 × 10−4). We note that the rs4925386 SNP did not have strong evidence for association in the random effects model. Since we selected rs4925386 for our replication study, it has been identified as being associated with colorectal cancer by a published GWAS (Houlston et al., 2010). The variant rs16888522 is in the region of rs16892766/EIF3H, previously identified to be associated with colorectal cancer by a GWAS (Tomlinson et al., 2008). The variant rs16888522/EIF3H was in weak linkage disequilibrium (LD) with rs16892766/EIF3H (D′=0.255; r2=0.043; Supplemental Figure 4). Conditional analysis, including both variants in the same model, resulted in less significant results for both variants (Supplemental Table 4) and showed weak correlation between the beta coefficients (r = −0.269), which suggests that these variants may not be independently associated SNPs.
We identified five other loci with p<0.05 in our replication study and combined p-value <10−4, (Table 3). The associated SNPs were located near BMP2, POLS, SLC8A1, TERT-CLPTM1L and TPK1. One was in a region previously identified to be a colorectal cancer susceptibility locus by a GWAS (rs4813802/BMP2): replication p-value 0.03; pfixed 7.3 × 10−5; prandom 0.014; Table 3; Supplemental Figure 3; Supplemental Table 3). The higher p-value for the random effects for this SNP reflects the fact that we observed evidence of heterogeneity among GWAS for rs4813802 (I2=76.1% and p = 0.06) (Ioannidis et al., 2007); however, this was less pronounced among the replication studies (I2 =40.4% and p = 0.08) suggesting that the result is consistent among studies after accounting for those that may be subject to the “winner’s curse” (Garner, 2007). rs4813802/BMP2 was not in LD with the known colorectal cancer susceptibility SNP in the region rs961253/BMP2 (D′=0.02, r2 <0.001) (Houlston et al., 2008), and the joint conditional analysis demonstrates the independent association of both variants with colorectal cancer risk (correlation of beta coefficients = 0.018; Supplemental Table 4; Figure 1).
For all SNPs in Table 3, we tested if the risk estimates of these variants may vary by mode of inheritance or sex. While for some variants (rs4813802/BMP2, rs275454/POLS, rs2373859/SLC8A1, and rs2853668/CLPTM1L), the recessive model tended to provide stronger risk estimates and slightly lower p-values than the log-additive or dominant model, the AIC value was >2 in all cases, indicating no statistical evidence for improvement over the log-additive model (Supplemental Table 5). We also explored if results vary by sex and found that for rs16888522/EIF3H the statistical evidence for association was stronger in men (OR=1.25, p-value=0.002) than in women (OR=1.10; p-value=0.25), although the effect estimates were in the same direction and similar in magnitude for both men and women (Supplemental Table 6).
As a sensitivity analysis, we reran the combined fixed-effects meta-analysis leaving out one study at a time for all SNPs in Table 2. In no case did the point estimate change >3%. Further, all pfixed remained <5×10−3 except for when we removed the French study from the analysis of rs4925386. In that case the OR remained similar (OR=0.94) but the p-value was slightly attenuated pfixed=8.2×10−3.
From the analysis of GWAS and replication, including a total of up to 11,067 cases and 12,517 controls, we found that SNPs in eight out of ten previously identified colorectal cancer susceptibility loci were associated with the disease in our replication study at p<0.05. We found evidence that a second SNP (rs4813802) near the BMP2 gene could be associated with colorectal cancer, independently of the association with the previously identified susceptibility SNP in that region (rs961253). Furthermore, our study reports for the first time a potential new association of a variant in the TERT-CLPTM1L region with colorectal cancer risk.
Our results provide further support for eight of ten previously identified GWAS hits. When excluding studies that have previously published results on these known loci, five loci showed evidence of replication in this independent subsample. The 8q24 SNP rs6983267 has already been heavily studied, including published reports for many of the studies included in this paper (Figueiredo et al., 2011;Hutter et al., 2010), so we were not able to examine independent replication of this SNP in this study. Among the remaining four loci that did not show a significant association at p<0.05, three showed a trend towards replication (with p<0.2 and an odds ratio in the same direction as the original GWAS report). However, one SNP, rs10795668, did not show any evidence for association with disease (OR=1.00; 95% CI: 0.93–1.08; p=0.96; Supplemental Figure 2). Several papers have reviewed potential reasons for the lack of replication of GWAS findings (Chanock et al., 2007;Kraft et al., 2009). As in any observational study, it is possible this represents either a false positive in the initial report, or a false negative in this replication; although that seems unlikely since both the discovery GWAS and this report are based on large, well-powered studies. We used the same genetic model and similar trait definitions as the discovery GWAS. Further, all studies were restricted to non-Hispanic Whites, limiting the possibility of differences in linkage disequilibrium patterns. It is possible that there may be differences in the distribution of a key effect modifier between the studies used to identify rs10795668 and the studies presented in this paper. A full exploration of underlying gene-gene or gene-environment interactions is beyond the scope of the current paper, but we did explore if the effect of rs10795668 varied by sex. Although the results were not significant for either sex, and the 95% CIs overlap, we do note an interesting pattern where the odds ratios are in opposite directions for women and men, with men showing a trend in the direction of the discovery GWAS. Specifically, we found ORwomen =1.07 (95% CI: 0.92–1.26; pfixed=0.38) and ORmen =0.95 (95% CI: 0.84–1.07; pfixed=0.40).
The rs4925386/LAMA5 SNP was also recently identified in another GWAS meta-analysis (Houlston et al., 2010). Although it was not a known colorectal cancer susceptibility loci at the time that we selected SNPs for replication, this SNP met our criteria for selection, and showed evidence for association in our replication sample. The rs4925386 variant lies in the intron of the large laminin A5 protein encoding gene. As previously reported the variant is in LD (r2>0.5) with four nonsynonymous SNPs in LAMA5 (Houlston et al., 2010). However, the prediction of each of these amino acid changes is proposed to be benign. Overall, our finding provides additional independent support that this variant is associated with susceptibility to colorectal cancer.
None of the loci were significantly associated with colorectal cancer in our replication study after adjusting for multiple testing (0.05/321=1.6 × 10−4), and none of the loci reached “genome-wide significance” at the suggested p-values of 1.6 × 10−7 after accounting for the two-stage design (details see method section) (Dudbridge and Gusnanto, 2008;Hoggart et al., 2008;International HapMap Consortium, 2005;Pe’er et al., 2008;Risch and Merikangas, 1996;Wellcome Trust Case Control Consortium, 2007). However, for some of the variants with p<0.05 in our replication and combined p-value <10−4, additional lines of evidence provide support for the hypothesis that we may have identified genomic regions harboring causal variants for colorectal cancer susceptibility. The variant rs4813802 is about 295.3kb centromeric to the previously identified rs961253/BMP2 GWAS hit (Houlston et al., 2008); both statistical models and LD data support the idea that these are independent signals. The closest gene is bone morphogenetic protein 2 (BMP2). The new variant of interest, rs4813802, is closer to BMP2 (49.2kb upstream) than the previously identified SNP rs961253 (344.5kb upstream of BMP2). Interestingly, rs4813802 lies within an ENCODE Digital DNAseI Hypersensitivity Cluster; it is also within an ENCODE region showing H3K4Me1 enhancer associated histone marks (Rosenbloom et al., 2010), and the flanking 15 bp show strong placental mammal conservation by PhastCons (Siepel et al., 2005). While not conclusive, all of these are consistent with the region flanking the SNP acting as a long-range enhancer element, plausibly for BMP2. The BMP2 gene belongs to the transforming growth factor-β (TGFβ) superfamily, which plays an important role in cell proliferation, differentiation, and apoptosis (Massague, 2000). SNPs in five out of the 10 known colorectal cancer SNPs have chromosomal locations in or near TGFβ superfamily genes (Tenesa and Dunlop, 2009). Furthermore, loss in BMP signaling has been reported at the transition from advanced adenoma to early cancer stage, compatible with a role in tumor progression (Hardwick et al., 2008). Support for a role for BMP signaling in colorectal cancer comes from the identification of mutations in the bone morphogenetic protein receptor, type IA protein (BMPR1A) in juvenile polyposis (Howe et al., 2001). Individuals with familial juvenile polyposis have a 20% risk of colon cancer by age 35 and 68% by age 60 (Schreibman et al., 2005). Our finding supports the possibility of allelic heterogeneity at the BMP2 locus, which is consistent with findings for the 8q24 cancer locus (Al Olama et al., 2009;Witte, 2007) and recent findings for height showing evidence for allelic heterogeneity at as many as 19 loci (Lango et al., 2010). Similar to our finding, these 19 secondary signals in height were rather distant (on average 177kb) from the initial index SNP that was found to be associated through GWAS (Lango et al., 2010). Accordingly, a comprehensive exploration of already discovered colorectal cancer loci may uncover additional independent variants. However, this example demonstrates that defining the boundaries of a susceptibility locus may be challenging, because the SNP we identified (rs4813802) would not have been included if we had defined the region around the initial index SNP (rs961253) by LD.
The 8q24 region has been shown to have multiple independent variants that are associated with cancers. Several of these variants are associated with more than one cancer, and some cancers are associated with multiple variants in this region (Al Olama et al., 2009;Witte, 2007). Similarly, multiple variants associated with various cancer sites, including cancers of lung, pancreas, testes, and bladder, as well as glioma, basal cell carcinoma, and melanoma are found in the TERT-CLPTM1L region (Figure 2) (Hsiung et al., 2010;Landi et al., 2009;McKay et al., 2008;Miki et al., 2010;Petersen et al., 2010;Rafnar et al., 2009;Shete et al., 2009;Stacey et al., 2009;Turnbull et al., 2010;Wang et al., 2008). Ours is the first report suggesting that a variant in the TERT-CLPTM1L region could be associated with colorectal cancer. The variant rs2853668 is 4.9kb upstream of telomerase reverse transcriptase (TERT) and 18.0kb downstream of cleft lip and palate transmembrane protein 1-like protein (CLPTM1L). Both genes have been implicated in cancer: CLPTM1L has been shown to be altered in cisplatin-resistant cell lines and potentially impacts apoptosis (Yamamoto et al., 2001); TERT encodes for the telomerase catalytic subunit that is important for the replication and stabilization of telomere ends, and subsequently impacts chromosome replication and suppression of cell senescence. Malfunction of telomerase can result in chromosomal abnormality and subsequent tumor formation (Rafnar et al., 2009). Our finding provides further evidence that TERT-CLPTM1L is a general cancer susceptibility locus that impacts critical function for cancer development, similar to the 8q24 region. The candidate gene in the 8q24 loci is MYC, and, as noted by Johnatty and colleagues (Johnatty et al., 2010), these two loci could act in concert. Specifically, the TERT promoter has several MYC (the nearest gene to the 8q24 locus) binding sites (Wu et al., 1999); however, we did not observe a statistically significant interaction between rs2853668/TERT-CLPTM1L and rs6983267/8q24, MYC (p for interaction term=0.8).
The SNP rs7315438, which showed the most statistically significant association in both the replication study alone, as well as in the combined meta-analysis of GWAS and replication studies, is located on chromosome 12q24 about 76.9kb upstream of the T-box 3 protein (TXB3). The SNP is also located 50.4kb downstream of MED13L, which encodes for a subunit of the mediator complex, a large complex of proteins that functions as a transcriptional coactivator for most RNA polymerase II-transcribed genes. Since it has been implicated in transcription, this gene is a plausible candidate for further study. However, this SNP is in a large LD region containing numerous other potential candidate genes, including the kinase suppressor of RAS2 (KSR2).
Other SNPs identified as potentially associated with colorectal cancer in this study are rs27545 (POLS), rs2373859 (SLC8A1) and rs1525461 (LOC643308/TPK1). The gene closest to rs27545 is POLS (59 kb downstream) a DNA polymerase that is likely involved in DNA repair and, hence, provides a potentially interesting candidate gene (Hubscher et al., 2002). Other genes close to rs27545 are SRD5A1 (146 kb upstream), which converts testosterone into the more potent dihydrotestosterone, and the methyltransferase NSUN2 (183 kb downstream), which methylates tRNA (Brzezicha et al., 2006). The SNP rs2373859 resides in the intronic region of SLC8A1 also known as NCX1, which is a cell membrane protein that is involved in the rapid Ca(2+) transports (Annunziato et al., 2004). It is in a gene rich region including other interesting candidates, such as MAP4K3 (954 kb upstream) a member of the mitogen-activated protein kinases, which is involved in regulating both cell growth and death and has altered gene expression in many cancer types (Cuadrado and Nebreda, 2010) and SOS1, which may act as a positive regulator of RAS (Freedman et al., 2006). The closest gene to rs1525461 is TPK1 (195kb upstream). TPK1 is involved in the regulation of thiamine metabolism (Timm et al., 2001). TPK1 flanks a gene rich region, including several olfactory receptors but none of the genes has an obvious link to colorectal cancer development. However, the assignment of SNPs to candidate genes should be done with caution, as recently shown by additional fine mapping and in silico analysis of the previously identified colorectal cancer loci 8q23.3 (EIF3H), 16q22.1 (CDH1/CDH3), which suggested functional variation in unexpected candidate target genes (Carvajal-Carmona et al., 2011).
Overall, our study suggests a complex nature of the contribution of common genetic variants to risk for colorectal cancer, and suggests the need for additional studies to identify variants with marginal effects, as well as studies to examine potential sources and role of heterogeneity, including gene-gene and gene-environment interactions. We note that this study focused on the log-additive model. Although we present results for other genetic models for our top findings, our results may have been biased for SNPs that do not follow this assumed log-additive model (Minelli et al., 2005). Further, this study was not set up to investigate less frequent (allele frequency 1–5%) and rare variants (allele frequency <1%), which have the potential to contribute substantially to the genetic susceptibility of colorectal cancer (Bodmer and Bonilla, 2008;Cirulli and Goldstein, 2010;Manolio et al., 2009).
In summary, we replicated the majority of SNPs that have previously been found to be associated with CRC in GWAS studies. We also report suggestive evidence for an additional independent signal for colorectal cancer risk in the BMP2 locus and a possible new association of colorectal cancer with a variant in the multi-cancer susceptibility locus around TERT-CLPTM1L. Future studies are needed to try to replicate these findings, and if successful, to identify the underlying variants directly responsible for the association, and to study the underlying molecular mechanisms.
The studies and their abbreviations are listed in Table 1, and each study is described in detail in the Supplemental Note. In brief, all cases were defined as colorectal adenocarcinoma (International Classification of Disease Code 153–154) and confirmed by medical records, pathologic reports, or death certificate. All cases and controls were self-reported as White, which was confirmed in GWAS samples based on genotype data. All participants gave written informed consent and studies were approved by the Institutional Review Board.
The GWAS meta-analysis results are based on two scans. One GWAS was conducted within the CCFR, including population-based cases and unrelated population-based controls from three sites: USA, Canada, and Australia (Figueiredo et al., 2011). In total, 1,191 cases and 999 controls were successfully genotyped on the Illumina 1M/1M Duo platform and passed all quality-control (QC) steps. The second scan was conducted across three US study populations: the WHI and PLCO cohorts and the DALS population-based case-control study. A total of 1,715 colon cancer cases and 2,417 controls were successfully genotyped on the Illumina HumanHap 550K, 610K or combined Illumina 300K and 240K platforms and passed all QC steps. After applying rigorous genotyping QC filters (see below), a total of 378,739 directly genotyped single-nucleotide polymorphisms (SNPs) commonly shared among the scans were included in the GWAS meta-analysis. To further boost the power and inform the ranking of SNPs we included summary statistics from a previously published colorectal cancer GWAS (Colorectal Tumour Gene Identification Consortium, CORGI) in the meta-analysis (The Institute of Cancer Research, 2008;Tomlinson et al., 2008). However, to ensure independence of results from prior published scans we did not include any CORGI results in any of the presented odds ratios or p-values.
Fixed effects p-values from the GWAS meta-analysis were used to select SNPs for replication. We rank ordered the top SNPs. We used linkage disequilibrium (LD) information in our controls to prune out “redundant” signals (defined as r2>0.5 for SNPs ≥100kb apart and r2>0.1 for SNPs <100kb apart). For the top five SNPs, with p<10−5, we selected two other SNPs with r2>0.9 to ensure against potential genotyping failure. We then went down the ranked list until we filled our SNP platform (total number of SNPs selected for this project=343). SNPs were excluded based on p-value for heterogeneity <0.001 (n=1) and poor clustering in visual inspection of cluster plots (n=3). If SNPs had a low design score, we replaced them with an alternative SNP with r2>0.9. The lowest ranked SNP had p-value 1.2 × 10−3. Our platform also included SNPs for the ten known colorectal cancer susceptibility loci published in previous GWAS at the time we designed the platform. These 343 SNPs were genotyped in samples from DACHS, DALS, French, HPFS, NHS and PHS studies (N=4,062 cases and 4,718 controls) (Table 1; Supplemental Note) and 306 SNPs were successfully genotyped in all studies (see details below). After we selected SNPs for replication, the ARCTIC genome-wide scan became available (769 cases and 665 controls), and we used imputed data from that study for analysis of the 343 SNPs (12 SNPs were not included due to low imputation quality or low HWE p-values).
As of April 2010, we had genotyped and analyzed the GWAS data and replication data from ARCTIC, DACHS, DALS and the French case control study. We selected 32 SNPs with p<0.1 in this replication set and/or a pfixed<10−4 in the combined replication and GWAS for further genotyping in 2,550 cases and 3,539 controls, including additional samples from NHS, PHS, and HPFS, and samples from MECC and NFCCR. The top SNPs were also analyzed in a second set of data from the CCFR (780 cases and 780 controls). We present results for the total replication sample of 8,161 cases and 9,101 controls.
Genomic DNA was extracted from blood samples or, in the case of a subset of PLCO samples, from buccal cells using conventional methods.
Genotyping was completed on the Illumina Human1M and Human1M-Duo Bead Array in accordance with the manufacturer’s protocol.
The following sample exclusion criteria were applied: call rate <95% (n=75), any stripe (physical/analytical location on BeadChip) call rate <80% (n=9), discordance with prior genotyping (n=3), non-White (n=29), samples that showed admixture identified using the program STRUCTURE (n=33) (Falush et al., 2003; Pritchard et al., 2000), high identity by descent using PLINK (n=2), and mismatch between called and phenotypic sex (n=4). The final analysis was based on 1,191 cases and 999 controls.
SNPs were excluded if they did not overlap between the Illumina Human1M and Human1M-Duo (n=190,301), were annotated as “Intensity Only” (n=8,263), had call rates <90% on either the Illumina Human1M or Human1M-Duo (n=9,229), or by study center or case-control status (n=12,695). When further restricting analysis to SNPs with Hardy-Weinberg-Equilibrium (HWE) p>0.0001, MAF >0.05, and SNP call rate >0.98, a total of 739,733 SNPs remained in the analysis.
Average sample call rate was equal to 98.6% with >94% of samples having a call rate >98%. Intra- and interplate replicate concordance rates were equal to 99.97% and 98.7%, respectively.
Genotyping was completed using Illumina HumanHap300 and HumanHap240S (PLCO), 550k (WHI, DALS) and 610k (DALS, PLCO) BeadChip Array System on the Infinium platform in accordance with the manufacturer’s protocol or as previously described for HumanHap300 and HumanHap240S (Yeager et al., 2007).
Samples were excluded if the average call rate was <97% (DALS n=110, PLCO n=63, WHI n=66) or there was a mismatch between called and phenotypic sex (DALS n=6, PLCO n=1). To search for unexpected duplicates and closely related individuals we calculated identity-by-state values. We excluded unexpected duplicates (DALS n=2). Additionally, we excluded samples based on low concordance with prior genotyping (DALS: n=10, WHI: n=1) as well as samples that did not cluster with the CEU samples in principal component analysis including the three HapMap populations as a reference (DALS n=20, PLCO n=2, WHI n=6). The final analysis was based on 698 cases and 719 controls in DALS, 534 cases and 1,168 controls in PLCO, and 483 cases and 530 controls in WHI.
Because we combined data from different platforms, we took precautions to exclude SNPs that do not perform consistently across platforms. This included SNPs reported by Illumina as not performing consistently across platforms (n=78), SNPs found to have more than one discordant call across the 550K and 610K platforms in HapMap Data or our interplatform duplicates (n=185); and SNPs with different MAF calls on the two platforms in our control populations (n=9). We further filtered SNPs within each study (DALS, PLCO, WHI) based on MAF <0.05% or HWE in controls <0.0001. We applied a call rate per chip type per study of >98%. A total of 392,361 SNPs passed all QC checks for all three studies.
The average sample call rate was ≥98.8% in any of the three studies, and the concordance rate of blinded duplicates (n=98 pairs) was >97%.
When we combined data across all scans a total of 378,739 autosomal SNPs were successfully genotyped across all studies and used in our final GWAS meta-analysis of 2,906 cases and 3,416 controls.
Genotyping of 343 SNPs in DACHS, DALS, French, and the first sub-sets of HPFS, NHS and PHS were carried out using BeadXpress technology according to the manufacturer’s protocol. Problematic genotype clusters were visually inspected by lot number and the calling algorithm was adjusted, if indicated. 35 SNPs were excluded from the analysis due to poor cluster quality and 2 SNPs were excluded for being out of HWE (p<0.0001) in controls of at least one study. The 306 SNPs in the replication had call rates >92% across studies (average call rate per SNP per study 97.8%). MECC and NFCCR samples were genotyped using Matrix-assisted Laser Desorption/Ionization Time-of-Flight on the Sequenom® MassARRAY 7K platform (Sequenom, Inc., San Diego, CA). A total of 23 and 30 SNPs were successfully genotyped in MECC and NFCCR, respectively. Additional samples from NHS, HPFS and PHS were genotyped on 29 SNPs using the TaqMan® OpenArray® Genotyping Instrument Platform Assays (Applied Biosystems, Carlsbad, CA). Overall, 32 SNPs had call rates >98% across studies (average call rate per SNP per study 99.5%; Supplemental Table 7), indicating excellent quality.
Two GWAS data sets (ARCTIC and CCFR II) became available after the GWAS meta-analysis and were used only for replication as described above. ARCTIC has been previously published (Zanke et al., 2007). Because ARCTIC was genotyped on the Affymetrix platform with limited overlap of SNPs with the Illumina platforms, we made use of imputed data for this study. Imputation was done with BEAGLE, using the phased HapMap release 22 as the reference sample (http://ftp.hapmap.org/phasing/2007-08_rel22/) (Browning and Browning, 2009). SNPs were removed if they were out of HWE (p<0.0001) in the controls (n=1) or had an imputation r2<0.3 (n=11). For CCFR phase II, samples were genotyped using the Illumina 1M Omni. Inclusion/exclusion criteria for cases in phase II were consistent to those described for phase I.
To estimate the association between each genetic marker and risk for colorectal cancer we calculated odds ratios (ORs) and 95% confidence intervals (CIs) using log-additive genetic models relating the genotype dose (0, 1 or 2 copies of the minor allele) to risk of colorectal cancer. We adjusted for age, sex (when appropriate), center and the first three principal components from EIGENSTRAT to account for population substructure. The CCFR calculated these estimates with Cochran-Mantel-Haenzsel analysis with strata defined by age, sex, and center.
Quantile-quantile (Q-Q) plots were assessed to determine whether the distribution of the p-values in each study population was consistent with the null distribution (except for the extreme tail; Supplemental Figure 1). To quantify the data in the QQ plots, we calculated the inflation factor (λ) to measure the over-dispersion of the test-statistics from association tests by dividing the mean of the test statistics by the mean of the expected values from a Chi-Square distribution with 1 degree of freedom.
We conducted inverse-variance weighted fixed-effects meta-analysis to combine odds ratio estimates from log-additive models or multiplicative methods across individual study populations as described above. In this approach, we weighted the beta estimates of each study by their inverse variance and calculated a combined estimate by summing the weighted betas and dividing by the summed weights. We chose to focus on fixed-effects because we only had a small number of studies. When the number of studies is small, the between study variance may be poorly estimated, resulting in deflated test statistics for association. As such, fixed-effects analysis is better powered for discovery of novel variants (Kraft et al., 2009). We calculated I2, which is a measure of the percentage of total variation across studies due to heterogeneity beyond chance, and obtained the heterogeneity p-values based on Cochran’s Q statistic (Ioannidis et al., 2007).
To estimate the association between each genetic marker and risk for colorectal cancer we calculated odds ratios (ORs) and 95% confidence intervals using a log-additive genetic model relating the genotype dose (0, 1 or 2 copies of the minor allele) to risk of colorectal cancer and adjusting for age, sex, and study center (as appropriate) in logistic regression analysis.
We conducted inverse-variance weighted fixed-effects meta-analysis to combine odds ratio estimates from log-additive models across individual study populations and measured heterogeneity using I2 and Cochran’s Q statistic, as discussed above.
We again combined across studies using inverse-variance weighted fixed effects meta-analysis. For novel SNPs with p < 5 × 10−4 based on combined analysis of GWAS and replication, we also report random-effects that incorporate potential heterogeneity into the effect estimate. For these SNPs we also examined dominant, recessive and unrestricted genetic models and compared models by calculating the Akaike information criterion (AIC). We performed stratified analyses and evaluated whether the effects differed by sex. For novel SNPs in regions identified by previous GWAS, we also performed a conditional analysis including both the newly and previously identified SNPs in the region in one model to examine whether the effect of the newly identified SNP can be explained by the existing one. To quantify the independence of the novel SNPs from prior GWAS hits in the same region we calculated the variance-covariance matrix and reported the correlation between the two betas. Finally, we performed a sensitivity analysis where we removed the studies one at a time and examined the results from the fixed-effect meta-analysis. We report any situations where removing one study resulted in a >5% change in the OR point estimate and/or reduced the p-value of the combined fixed-effects meta-analysis to be <5×10−3, since that would indicate the results might be being driven by only one study.
Based on an increasing number of papers (Dudbridge and Gusnanto, 2008;Hoggart et al., 2008;International HapMap Consortium, 2005;Pe’er et al., 2008;Risch and Merikangas, 1996;Wellcome Trust Case Control Consortium, 2007) providing a detailed discussion on the appropriate genome-wide significance threshold, which all arrive at similar values in the range of 5 × 10−7 to 5 × 10−8 for White populations, we decided to use a p-value of 5 × 10−8 as genome-wide significance threshold. To account for the two-stage approach (GWAS and replication) we calculated that an overall p-value of 5 × 10−8 is equal to a combined two-stage p-value of 1.6 × 10−7 given our sample sizes in the GWAS and replication and a threshold for selecting SNPs from the GWAS of 1.2 × 10−3 as used here.
We used PLINK (Purcell et al., 2007; Purcell, 2011) and R (R Development Core Team, 2011) to conduct the statistical analysis and summarized results graphically using STATA (StataCorp, 2009), snp.plotter (Luna and Nicodemus, 2007), and LocusZOOM (Pruim et al., 2010).
The authors thank Dr. Ian Tomlinson at the Wellcome Trust Centre for Human Genetics, Oxford, UK, Dr. Richard Houlston at the Section of Cancer Genetics, Institute of Cancer Research, Sutton, UK and Dr. Malcolm Dunlop at Colon Cancer Genetics Group, Institute of Genetics and Molecular Medicine, University of Edinburgh and Human Genetics Unit, Medical Research Council, Edinburgh, UK for providing access to GWAS summary statistics of the Colorectal Tumour Gene Identification Consortium (CORGI) and allow us to use these results to inform the ranking of the SNP selection for the replication.
ARCTIC: This work was supported by the Cancer Risk Evaluation (CaRE) Program grant from the Canadian Cancer Society Research Institute. TJH and BWZ are recipients of Senior Investigator Awards from the Ontario Institute for Cancer Research, through generous support from the Ontario Ministry of Research.
CCFR: This work was supported by the National Cancer Institute, National Institutes of Health under RFA # CA-95-011 and through cooperative agreements with members of the Colon Cancer Family Registry and P.I.s. This genome-wide scan was supported by the National Cancer Institute, National Institutes of Health by U01 CA122839. The content of this manuscript does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating centers in the CFRs, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government or the CFR. The following Colon CFR centers contributed data to this manuscript and were supported by the following sources: Australasian Colorectal Cancer Family Registry (U01 CA097735), Familial Colorectal Neoplasia Collaborative Group (U01 CA074799), Mayo Clinic Cooperative Family Registry for Colon Cancer Studies (U01 CA074800), Ontario Registry for Studies of Familial Colorectal Cancer (U01 CA074783), Seattle Colorectal Cancer Family Registry (U01 CA074794), University of Hawaii Colorectal Cancer Family Registry (U01 CA074806).
DACHS: This work was supported by grants from the German Research Council (Deutsche Forschungsgemeinschaft, BR 1704/6-1, BR 1704/6-3, BR 1704/6-4 and CH 117/1-1), and the German Federal Ministry of Education and Research (01KH0404 and 01ER0814). We thank all participants and cooperating clinicians, and Ute Handte-Daub, Belinda-Su Kaspereit and Ursula Eilber for excellent technical assistance.
DALS: This work was supported by the National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services (R01 CA48998 to MLS).
DALS, PLCO and WHI GWAS: Funding for the genome-wide scan of DALS, PLCO, and DALS was provided by the National Cancer Institute, Institutes of Health, U.S. Department of Health and Human Services (R01 CA059045 to UP). CMH was supported by a training grant from the National Cancer Institute, Institutes of Health, U.S. Department of Health and Human Services (R25 CA094880).
FRENCH: This work was funded by a regional Hospital Clinical Research Program (PHRC) and supported by the Regional Council of Pays de la Loire, the Groupement des Entreprises Françaises dans la LUtte contre le Cancer (GEFLUC), the Association Anne de Bretagne Génétique and the Ligue Régionale Contre le Cancer (LRCC).
GECCO: Funding for GECCO infrastructure is supported by National Cancer Institute, Institutes of Health, U.S. Department of Health and Human Services (U01 CA137088 to UP).
HPFS: This work was supported by the National Institutes of Health (P01 CA 055075 to C.S.F., R01 137178 to A.T.C, and P50 CA 127003 to C.S.F.). We acknowledge Patrice Soule and Hardeep Ranu for genotyping at the Dana-Farber Harvard Cancer Center High Throughput Polymorphism Core under the supervision of David J. Hunter, and Carolyn Guo for programming assistance.
MECC: This work was supported by the National Institutes of Health, U.S. Department of Health and Human Services (R01 CA81488 to SBG and GR).
NFCCR: This work was supported by an Interdisciplinary Health Research Team award from the Canadian Institutes of Health Research (CRT 43821); the National Institutes of Health, U.S. Department of Health and Human Serivces (U01 CA74783); and National Cancer Institute of Canada grants (18223 and 18226). The authors wish to acknowledge the contribution of Alexandre Belisle and the genotyping team of the McGill University and Génome Québec Innovation Centre, Montréal, Canada, for genotyping the Sequenom panel in the NFCCR samples.
NHS: This work was supported by the National Institutes of Health (P01 CA 087969 to ELG, R01 137178 to ATC, and P50 CA 127003 to CSF). We acknowledge Patrice Soule and Hardeep Ranu for genotyping at the Dana-Farber Harvard Cancer Center High Throughput Polymorphism Core under the supervision of David J. Hunter, and Carolyn Guo for programming assistance.
PHS: We acknowledge Patrice Soule and Hardeep Ranu for genotyping at the Dana-Farber Harvard Cancer Center High Throughput Polymorphism Core under the supervision of David J. Hunter, and Haiyan Zhang for programming assistance.
PLCO: This research was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics and supported by contracts from the Division of Cancer Prevention, National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services. The authors thank Drs. Christine Berg and Philip Prorok at the Division of Cancer Prevention at the National Cancer Institute, and investigators and staff from the screening centers of the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, Mr. Thomas Riley and staff at Information Management Services, Inc., Ms. Barbara O’Brien and staff at Westat, Inc., and Mr. Tim Sheehy and staff at SAIC-Frederick. Most importantly, we acknowledge the study participants for their contributions to making this study possible.
Control samples were genotyped as part of the Cancer Genetic Markers of Susceptibility (CGEMS) prostate cancer scan were supported by the Intramural Research Program of the National Cancer Institute. The datasets used in this analysis were accessed with appropriate approval through the dbGaP online resource (http://www.cgems.cancer.gov/data) through dbGaP accession number 000207v.1p1.c1.(2009;Yeager et al., 2007) Control samples were also genotyped as part of the GWAS of Lung Cancer and Smoking. Funding for this work was provided through the National Institutes of Health, Genes, Environment and Health Initiative [NIH GEI] (Z01 CP 010200). The human subjects participating in the GWAS are derived from the Prostate, Lung, Colon and Ovarian Screening Trial and the study is supported by intramural resources of the National Cancer Institute. Assistance with genotype cleaning, as well as with general study coordination, was provided by the Gene Environment Association Studies, GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NHI GEI (U01 HG 004438). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number ph000093.v2.p2.c1.
WHI: The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts N01WH22110, 24152, 32100-2, 32105-6, 32108-9, 32111-13, 32115, 32118-32119, 32122, 42107-26, 42129-32, 44221, and 268200764316C.
The authors wish to acknowledge Jacques Rossouw, Shari Ludlam, Joan McGowan, Leslie Ford, and Nancy Geller at the (National Heart, Lung, and Blood Institute, Bethesda, Maryland); the following Clinical Coordinating Center investigators: Kooperberg (Fred Hutchinson Cancer Research Center, Seattle, WA) Ross Prentice, Garnet Anderson, Andrea LaCroix, Charles Kooperberg, (Medical Research Labs, Highland Heights, KY) Evan Stein, and (University of California at San Francisco, San Francisco, CA) Steven Cummings; and (Wake Forest University School of Medicine, Winston-Salem, NC) Sally Shumaker with the Women’s Health Initiative Memory Study.
In addition, we wish to acknowledge the following Clinical Center investigators: (Albert Einstein College of Medicine, Bronx, NY) Sylvia Wassertheil-Smoller; (Baylor College of Medicine, Houston, TX) Haleh Sangi-Haghpeykar; (Brigham and Women’s Hospital, Harvard Medical School, Boston, MA) JoAnn E. Manson; (Brown University, Providence, RI) Charles B. Eaton; (Emory University, Atlanta, GA) Lawrence S. Phillips; (Fred Hutchinson Cancer Research Center, Seattle, WA) Shirley Beresford; (George Washington University Medical Center, Washington, DC) Lisa Martin; (Los Angeles Biomedical Research Institute at Harbor- UCLA Medical Center, Torrance, CA) Rowan Chlebowski; (Kaiser Permanente Center for Health Research, Portland, OR) Erin LeBlanc; (Kaiser Permanente Division of Research, Oakland, CA) Bette Caan; (Medical College of Wisconsin, Milwaukee, WI) Jane Morley Kotchen; (MedStar Research Institute/Howard University, Washington, DC) Barbara V. Howard; (Northwestern University, Chicago/Evanston, IL) Linda Van Horn; (Rush Medical Center, Chicago, IL) Henry Black; (Stanford Prevention Research Center, Stanford, CA) Marcia L. Stefanick; (State University of New York at Stony Brook, Stony Brook, NY) Dorothy Lane; (The Ohio State University, Columbus, OH) Rebecca Jackson; (University of Alabama at Birmingham, Birmingham, AL) Cora E. Lewis; (University of Arizona, Tucson/Phoenix, AZ) Cynthia A. Thomson; (University at Buffalo, Buffalo, NY) Jean Wactawski-Wende; (University of California at Davis, Sacramento, CA) John Robbins; (University of California at Irvine, CA) F. Allan Hubbell; (University of California at Los Angeles, Los Angeles, CA) Lauren Nathan; (University of California at San Diego, LaJolla/Chula Vista, CA) Robert D. Langer; (University of Cincinnati, Cincinnati, OH) Margery Gass; (University of Florida, Gainesville/Jacksonville, FL) Marian Limacher; (University of Hawaii, Honolulu, HI) J. David Curb; (University of Iowa, Iowa City/Davenport, IA) Robert Wallace; (University of Massachusetts/Fallon Clinic, Worcester, MA) Judith Ockene; (University of Medicine and Dentistry of New Jersey, Newark, NJ) Norman Lasser; (University of Miami, Miami, FL) Mary Jo O’Sullivan; (University of Minnesota, Minneapolis, MN) Karen Margolis; (University of Nevada, Reno, NV) Robert Brunner; (University of North Carolina, Chapel Hill, NC) Gerardo Heiss; (University of Pittsburgh, Pittsburgh, PA) Lewis Kuller; (University of Tennessee Health Science Center, Memphis, TN) Karen C. Johnson; (University of Texas Health Science Center, San Antonio, TX) Robert Brzyski; (University of Wisconsin, Madison, WI) Gloria E. Sarto; (Wake Forest University School of Medicine, Winston-Salem, NC) Mara Vitolins; (Wayne State University School of Medicine/Hutzel Hospital, Detroit, MI) Michael S. Simon.