|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) have led to the identification of a number of common susceptibility loci for colorectal cancer (CRC); however, none of these GWAS have considered gene-environment (GxE) interactions. Therefore, it is unclear whether current hits are modified by environmental exposures or whether there are additional hits whose effects are dependent on environmental exposures.
We conducted a systematic search for GxE interactions using genome wide data from the Colon Cancer Family Registry that included 1,191 cases of microsatellite stable (MSS) or microsatellite instability (MSI)-low CRC and 999 controls genotyped using either the Illumina Human1M or Human1M-Duo BeadChip. We tested for interactions between genotypes and 14 environmental factors using three methods: a traditional case-control test, a case-only test, and the recently proposed two-step method by Murcray et al. All potentially significant findings were replicated in the ARCTIC Study.
No GxE interactions were identified that reached genome-wide significance by any of the three methods. When analyzing previously reported susceptibility loci, seven significant GxE interactions were found at a 5% significance level. We investigated these seven interactions in an independent sample and none of the interactions were replicated.
Identifying GxE interactions will present challenges in a GWAS setting. Our power calculations illustrate the need for larger sample sizes; however, since CRC is a heterogeneous disease, a tradeoff between increasing sample size and heterogeneity needs to be considered.
The results from this first genome-wide analysis of GxE in CRC identify several challenges, which may be addressed by large consortium efforts.
Recently, several genome-wide association studies (GWAS) have led to the identification and replication of a number of susceptibility loci for CRC (1–6). Incorporating environmental exposures into GWAS data may aid in the identification of additional susceptibility alleles that would be otherwise masked by heterogeneity in subgroups, and would also clarify whether certain environmental exposures may modulate risk in susceptible individuals. However, there are limited data on the interaction between other susceptibility alleles and environmental risk factors for CRC. To date, no studies have examined the interaction between a wide range of environmental factors and genome-wide genotype data with respect to cancer risk. Detecting gene-environment (GxE) interactions using a standard case-control test is challenging in a genome-wide context because of the stringent significance level required to adjust for multiple testing and because only weak GxE are expected. The case-only test is known to be more powerful than the case-control test but in the presence of population level gene-environment association it can yield a severely inflated type I error (7). Recently, new methods to test for GxE interactions in GWAS have been proposed. Murcray et al. introduced an efficient two-step approach that is performed independently of any initial scans for main effects (8). The method expands on the traditional test for GxE interaction in a case-control study by incorporating a preliminary screening step constructed to efficiently use all available information. This method has been shown to be more powerful for a wide range of environmental exposures, minor allele frequencies and genetic effects compared to the traditional 1-step test (8). In this study, we take advantage of these methodologies to systematically search for GxE interactions within a GWAS of MSS/MSI-L colorectal cancer from the Colon Cancer Family Registry considering lifestyle and environmental exposures known to be involved in the etiology of CRC.
Participants included in this analysis were recruited from three population-based registries based at the Fred Hutchinson Cancer Research Center (FHCRC, Seattle, WA), Cancer Care Ontario (Ontario Familial Colorectal Cancer Family Registry (OFCCR), Toronto, Canada), and the University of Melbourne (Victoria, Australia), which recruited families from both Australia and New Zealand as part of the Colon Cancer Family Registry (Colon CFR) (9).
Cases from these registries met the following eligibility criteria: invasive CRC; self-identified as non-Hispanic White; no identified germline mutations in mismatch repair (MMR) genes; MSS or MSI-L CRC and/or MMR protein immunohistochemistry positive determined using standard methods (10). All cases meeting these criteria and under age 50 or who had an affected first-degree relative with CRC were included, together with a 20% random sample of those over age 50 with no affected first-degree relative.
Population-based controls were randomly sampled from these same catchment areas as the three registries, frequency matched on age, as described recently (9). All controls were self-identified as non-Hispanic White and reported no personal or family history of CRC.
Written, informed consent was obtained from all participants. The study was approved by the Institutional Review Board at each of the institutions.
All participants completed mailed questionnaires (Cancer Care Ontario) or a telephone-based or face-to-face interview (FHCRC, University of Melbourne) at study enrollment information. Questions focused on exposures 2-years before the date of diagnosis for cases and 2-years before the date of recruitment for controls. Data were collected on personal and family histories of colorectal (cancer and polyps) and other cancers and colon polyps, and lifestyle risk factors, including: medication use, reproductive history, physical activity, body height and weight, demographics, alcohol intake, tobacco use, diet and supplement use.
Ever-use (yes, no) of selected supplements (multivitamins, folic acid and calcium) and medications (non-steroidal anti-inflammatory drugs, NSAIDs) were defined as use at least 2 times per week for more than a month during a participant’s lifetime. NSAIDs included ibuprofen and aspirin. Since folic acid is contained in nearly all multivitamins, the derived variable for folic acid included use of folic acid supplements and multivitamins. Alcohol-use was defined as the consumption of any alcoholic beverage (beer, wine, hard cider, sake, liquor, spirits, mixed drinks or cocktails) at least once a week for 6 months or longer during the most recent decade of life at enrollment. Being an ever-smoker was defined as ever smoking at least one cigarette per day for 3 months or longer. Pack-years of smoking was calculated based on the number of cigarettes smoked per day and the number of years smoked. A person was considered to be physically active if they reported more than 20 metabolic equivalent (MET) hours per week of physical activity during the most recent decade of life at enrollment. The number of servings per week of fruits, vegetables and red meat were also calculated. Body-mass-index (BMI) was calculated as the person’s weight (kg) two years prior to study recruitment divided by adult height (m) squared.
All participants provided a blood sample at the time of recruitment. DNA samples were genotyped with the Illumina Human1M (n individuals=1,973; m=1,072,820 SNPs) or Human1M-Duo (n individuals=374; m=1,199,187 SNPs) BeadChip platforms. Samples with GenCall scores <0.15 at any locus were considered ‘no calls’. Each 96-well plate included one inter-plate positive quality control sample (NA06990 - Coriell Cell Repositories). In addition, 27 blinded and 22 un-blinded quality control replicates from the study sample were genotyped. SNP data obtained from both the Coriell and study sample replicates showed a very high concordance rate of called genotypes: 99.95% and >99.94%, respectively (for samples with call rates >90%). The Human1M and Human1M-Duo contain 415 and 436 SNPs, respectively, that were genotyped as part of a candidate gene study on the Illumina GoldenGate platform on a subset of the individuals genotyped in this study (N=444). A high concordance rate (>98%) was observed for >99% of the samples with a call rate >90%.
Individuals were excluded with (Figure 1):
SNPs were excluded from analysis if:
2,190 individuals and 770,098 SNPs were used in the final analysis.
All SNPs with borderline significant associations underwent additional quality control checks including:
We replicated significant interactions using data from the ARCTIC Study. Details of the ARCTIC Study are provided elsewhere (5). The selected SNPs were extracted from a larger set of SNPs genotyped in 2,433 unique samples on a custom 10,640-bead iSelect array from Illumina. All eligible cases of colorectal cancer were included irrespective of MSI-status. After excluding samples from the Colon CFR that were included in this study and samples not self-identified as white, we were left with a total of 872 cases and 810 controls. The selected SNPs were extracted from a larger set of SNPs genotyped in 2,433 unique samples on a custom 10,640-bead iSelect array from Illumina designed for 7703 SNPs. The call rate for this panel was 99.96% after excluding three failed DNAs (call rate < 69%) and 332 failed SNPs (4.3%). There were no discordant genotypes in 23 pairs of duplicates. Data collection on environmental risk factors and variable definitions were performed in the same manner as described above. All subjects provided written informed consent. This study was approved by the ethics review boards of the Toronto Academic Health Sciences Council.
We considered three approaches for genome-wide GxE testing. In the first two approaches, we exhaustively tested every SNP in the GWAS panel for GxE interaction with each of the environmental exposures using either a case-control test or a case-only test. The case-control test is based on the logistic regression model:
whereY indicates disease status, with Y=1 for cases and Y=0 for controls, E is an environmental exposure, G is the genotype at a particular SNP, and Z represents any additional covariates to be adjusted for such as sex, age (continuous) and center.
For a binary exposure, the case-only test is based on logistic regression model:
For a quantitative exposure, the case-only test is based on logistic regression model:
We used an additive coding for the genotypes for both the case-control and the case-only tests, i.e. G indicates the number of copies of the reference allele (G=0,1,2). The additive model is known to have good power for a very wide range of true modes of action (recessive, dominant, multiplicative) (12). For the case-control test, the hypothesis of no SNP×E interaction corresponds to the H0: βge=0 in model . For the case-only test the hypothesis of no SNP×E interaction corresponds to H0: β=0 in models  or  depending on whether the exposure is binary or quantitative. We tested every SNP that passed quality control and was polymorphic in our sample using the case-control and case-only tests. We refer to these scans for SNPxE interactions as exhaustive case-control or exhaustive case-only respectively, in contrast with the ‘2-step’ scan test described below. To control for multiple testing we used a simple Bonferroni correction for the number of SNPs that were actually tested, (0.05/770,098 = 6.5×10−8). Since each exposure was considered an independent a priori hypothesis we corrected across SNPs for each exposure, but not across exposures. For continuous exposures, we also report the stratified estimates of risk by dichotomizing at the median unless otherwise reported.
The third method for genome-wide GxE testing we considered was the approach of Murcray et al (8). This 2-step method consists of a screening first step followed by a formal test of interaction. Specifically, in the first step a test of association between the exposure E and SNP G is performed on the combined sample of cases and controls based on the logistic regression model:
for binary exposures, or the linear regression model:
for continuous exposures.
The hypothesis H0: γ=0 is tested for each SNP at significance level a1=0.001 using a chi-square 1df Wald test. The m SNPs achieving α1 significance (i.e. with p-value <α1) pass the screening step and are tested for GxE interaction using model (1). To preserve a genome-wide type I error of α=0.05 of the overall two-step procedure, Murcray et al. (8) showed that it suffices to correct in step 2 by the m SNPs that pass the screening, i.e. by testing at significance level α/m. For details of the rationale behind the screening step and the validity of the 2-step approach see Murcray et al. (8). All the genome-wide GxE analyses were carried out with the software PLINK (13).
In addition to the genome-wide GxE analyses using the exhaustive case-control, exhaustive case-only, and 2-step methods described above, we performed focused testing of previously reported and replicated genetic variants associated with CRC from five published GWAS (1–5) and a meta-analysis (6) for GxE interaction with each of the exposures of interest: 8q23.3 (rs16892766, EIF3H) (2); 8q24 (rs6983267, rs7014346) (1, 4, 5, 14–16), 10p14 (rs10795668) (2), 11q23 (rs3802824) (1), 14q22.2 (rs4444235, BMP4) (6); 15q13 (rs4779584) (17); 16q22.1 (rs9929218, CDH1)(6); and 18q21 (rs4939827, SMAD7) (1, 3), 19q13.1 (rs10411210, RHPN2) (6) and 20p12.3 (rs961253) (6). 8q24-rs1050477 and 9p24-rs719725 (5, 14) were not available on the Illumina Human1M or Human1M-Duo. We considered 9p24-rs7025295 and 9p24-rs7857628 as surrogates for the missing 9p24-rs719725 (r2=0.965, r2=0.966 using HapMap2_r24 CEU, respectively). We tested each variant using the case-control based on model  and case-only test based on model  at 5% significance level. In addition to testing individual SNPs, we tested a score that combines information from all the 13 SNPs into a single variable for interaction with the exposures of interest. For each subject, the score was constructed by counting the number of CRC risk-increasing variants across the 13 SNPs (i.e. the score ranges from 0 to 26). We tested the interaction of the score and the exposures using the standard logistic regression model , with G representing now the quantitative score.
Only interactions that were identified to be significant from this GWA study were tested in the ARCTIC Study using the 1-step test on model  at 5% significance level. All models were adjusted for age, center and sex unless otherwise specified.
This study included 1,191 population-based cases of MSS/MSI-Low CRC and 999 unrelated population-based controls. Table 1 shows the distribution of selected characteristics for the study population. After adjustment for age, sex and study center, we found BMI, smoking and red meat intake were positively associated with risk of CRC. Ever-use of folic acid and multivitamins were associated with an increased risk of disease in unadjusted models only. Alcohol-use, NSAID-use and calcium-use were associated with statistically significant decreased risk of CRC. Among women, ever-use of post-menopausal hormones or oral contraceptives were associated with statistically significant decreased disease risks. Servings of fruits and vegetables, physical activity and height were not associated with risk of CRC.
Using the exhaustive case-control and case-only tests, we observed no statistically significant interactions with any SNP at a genome-wide significance level of 6.5 × 10−8 with any environmental exposure. The lowest interaction p-values were between: oral contraceptive use and rs17329226 (p=7.0E-07); and ever smoker and rs2486540 (p=3.1E-07), rs2486538 (p=3.7E-07) and rs538835 (p=5.3E-07).
Using the two-step method, between 662 and 1,004 SNPs (depending on the exposure variable) passed the significance level in the screening step and were carried on to the second step. Therefore, the appropriate number of corrections for multiple testing varied by exposure, dependent on the total number of SNPs in the second step, from 5.0 × 10−5 to 7.6 × 10−5. We identified no significant GxE interactions with p-value less than 10−4.
Table 2 lists the known hits for CRC indentified through published GWA studies (1–6). We tested whether any of these SNPs showed a significant interaction with the selected environmental exposures. We identified the following interactions (using a case-control test): rs3802842 and post-menopausal hormones (p=0.01); rs10795668 and oral contraceptive-use (p=0.04), rs961253 and oral contraceptive-use (p=0.01); rs9929218 and height (p=0.02); rs9929218 and alcohol-use (p=0.04); rs4939827 and servings of vegetable intake (p=0.01); and rs9929218 and calcium-use (p=0.045). In our replication sample, none of the interactions with the individual SNPs were significant at the 5% level. When we tested the interaction of the environmental covariates and the score that combines the previously reported GWAS hits, we only found two marginal significant interactions with red meat consumption (p=0.01) and calcium (p=0.05). We did not attempt replication of these interactions.
In this genome-wide association study of early-onset MSS/MSI-Low CRC, we identified no selected personal or lifestyle characteristic that significantly modified the effect of genetic variants on the risk of CRC at a strict genome-wide level of less than 6.5 × 10−8 using an exhaustive case-control or case-only test or the appropriate significance levels for two-step method of Murcray et al. (8). We identified seven significant interactions with previously identified hits from published GWAS in CRC. Interestingly, one of the interactions was between rs3802842 and post-menopausal hormone use; rs3802842 has been previously reported to be associated with an increased risk of CRC among females with Lynch syndrome (18). However, none of these seven interactions were statistically significant at the 5% level in an independent replication sample.
Little of the genetic variation in CRC has been explained and it is likely that many more variants remain to be identified. One potential way to identify additional susceptibility alleles is to search for GxE interactions, and thereby identify genetic variants that may have an effect only in a given subgroup of individuals, identified by a common environmental risk factor or molecular profile. We applied an efficient two-step approach described by Murcray et al. for detecting loci involved in GxE interactions. It is performed independently of any initial scans for main effects and that incorporates a preliminary screening step constructed to efficiently use all available information (8). Other methods have been proposed, such as a 2-df test for assessing genetic main effects and interactions jointly (19) and approaches designed to combine the case-control and case-only analyses (20, 21), but there has been no formal comparison of these methods.
Achieving sufficient statistical power is challenging in a genome-wide context, even with these recently described methodologies. Our power calculations highlight this point, especially where the expected gene, exposure and interaction effects are modest. Figure 2 shows the sample size required to attain 80% power with the two-step approach for various combinations of minor allele frequencies, exposure prevalences, and interaction odds ratios. In this context, it was assumed that there were no SNP main effects, corresponding to the scenario where a GxE scan could detect a SNP that a standard GWAS based on SNP main effects would not. We found that a using data from a typical GWAS of 1,000 cases and 1,000 controls would detect interaction odds ratios of 2 or higher, with highly prevalent exposures and allele frequencies. There are likely to be many GxE interactions, but our study is underpowered to detect them. International consortia gathering GWAS data in CRC may aid in this effort if environmental covariates are available and there is potential for harmonization of variable definitions. However, even this increased sample size will not suffice to detect interaction odds ratios below 1.4, especially for less frequent exposures and lower allele frequencies.
We also investigated whether any of the recently reported and robustly replicated susceptibility loci identified through GWA studies of CRC were modulated by selected environmental factors. We considered only replicated susceptibility variants from published GWA studies of independent CRC cases and unaffected controls (1–6). We identified a few significant interactions at the <5% level, but none of these were significant in an independent case-control study of CRC that had collected epidemiologic data using the same questionnaires from individuals in one of the same geographical regions. One potential reason for our failure to replicate could be that we were unable to restrict our replication sample to only cases with early-onset MSS or MSI-low cancers. Common environmental exposures, such as alcohol intake, cigarette smoking, and obesity, have been reported to differ by MSI strata (22, 23). Furthermore, for four known susceptibility alleles we found no association with colorectal cancer in the Colon CFR and in the absence of a main effect the prospects of identifying a GxE interaction may be lower.
There are some limitations to this study. The main concern is the limited statistical power to investigate GxE interactions for less common exposures and less frequent alleles. Collaborative consortia offer important advantages of increasing sample size; however, they also have important limitations, including the potential introduction of heterogeneity due to combining different study designs, measures of exposures, and cancer outcome. Consortia with central quality control procedures and careful standardization and harmonization of definitions and measurements may be helpful. However, large sample size alone does not guarantee quality and reliable results (24). In this study, we had uniform data collection protocols and all cases were defined in a standard manner as MSS or MSI-Low. Another potential limitation is our relatively crude definitions of the environmental factors. Furthermore, because of the study design, we were unable to investigate the potential effects of ethnicity, family history of CRC, or other phenotypes of CRC (i.e., MSI-high). Lastly, there is no consensus about the correct statistical method to model gene-environment interactions and more research is required.
In summary, we identified no genome-wide significant GxE interactions in this genome-wide association study of early-onset MSS/MSI-low CRC. Much of the evidence from descriptive epidemiology, migrant studies, and changes in CRC rates in countries undergoing rapid economic development (most obviously Japan in the second half of the twentieth century; Japan now has the highest rates of CRC in the world) points to environmental risk factors as the major determinants of the international variation in CRC. It is crucial therefore that we gain a better understanding of susceptibility to these environmental factors. This, in turn, underscores the need to detect GxE interactions, which will require large collaborations of GWA studies with adequate data collection on exposures.
This work was supported by the National Cancer Institute, National Institutes of Health under RFA # CA-95-011 and through cooperative agreements with members of the Colon Cancer Family Registry and P.I.s. The content of this manuscript does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating centers in the CFRs, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government or the CFR.
Funding: This work was supported by the National Cancer Institute, National Institutes of Health under RFA # CA-95-011 and through cooperative agreements with the Australasian Colorectal Cancer Family Registry (U01 CA097735), the Ontario Registry for Studies of Familial Colorectal Cancer (U01 CA074783), and the Seattle Colorectal Cancer Family Registry (U01 CA074794), as well as NIH/NCI U01CA122839 GWAS (Casey) and the Canadian Cancer Society Research Institute, the Ontario Institute for Cancer Research, and the Ontario Ministry of Research and Innovation.