|Home | About | Journals | Submit | Contact Us | Français|
Recent genome-wide association studies have identified independent susceptibility loci for prostate cancer (CaP) that could influence risk through interaction with other, possibly undetected, susceptibility loci. We explored evidence of interaction between pairs of 13 known susceptibility loci and single nucleotide polymorphisms (SNPs) across the genome to generate hypotheses about the functionality of CaP susceptibility regions. We used data from Cancer Genetic Markers of Susceptibility: Stage I included 523,841 SNPs in 1175 cases and 1100 controls; Stage II included 27,383 SNPs in an additional 3941 cases and 3964 controls. Power calculations assessed the magnitude of interactions our study is likely to detect. Logistic regression was used with alternative methods that exploit constraints of gene-gene independence between unlinked loci to increase power. Our empirical evaluation demonstrated that an empirical Bayes (EB) technique is powerful and robust to possible violation of the independence assumption. Our EB analysis identified several noteworthy interacting SNP pairs, although none reached genome-wide significance. We highlight a Stage II interaction between the major CaP susceptibility locus in the subregion of 8q24 that contains POU5F1B and an intronic SNP in the transcription factor EPAS1, which has potentially important functional implications for 8q24. Another noteworthy result involves interaction of a known CaP susceptibility marker near the prostate protease genes KLK2 and KLK3 with an intronic SNP in PRXX2. Overall, the interactions we have identified merit follow-up study, particularly the EPAS1 interaction which has implications not only in CaP but also in other epithelial cancers that are associated with the 8q24 locus.
Recent genome-wide association studies (GWAS) have identified multiple single nucleotide polymorphisms (SNPs) associated with risk of prostate cancer (CaP) (1–8). Functional variants in linkage disequilibrium (LD) with the established markers may directly contribute to CaP risk through interactions with other, yet undetected, susceptibility loci. Large datasets are required for exploration of interactions between known risk alleles and the remainder of the genome. Such large scale studies could give rise to the discovery of novel susceptibility regions and a better understanding of the biology of CaP. In this report, we present the first results from a study of gene-gene interactions in the etiology of CaP using data from Stages I and II of the Cancer Genetics Markers of Susceptibility (CGEMS) Initiative.
To identify SNPs that may interact with established susceptibility regions to affect CaP risk, we conducted a series of conditional genome scans. The susceptibility (conditioning) regions included nine gene regions and four independent regions within the chromosomal region 8q24. All have demonstrated strong associations with CaP in recent GWAS, including but not limited to CGEMS (Table 1).
The region of 8q24 warrants close attention because it contains independent risk markers for prostate and additional cancers (Figure 1). Despite its strong associations with multiple cancers, the physiologic function of 8q24 remains an area of investigation. Molecular studies have focused on Region 3 of 8q24 which is associated with several epithelial cancers, including prostate and colorectal. The most significant marker in that region, rs6983267, is part of a consensus binding sequence for TCF (9), a family of transcription factors that are nuclear targets of WNT signaling. The risk allele has been shown to participate in long-range regulation of the WNT-targeted oncogene MYC that is telomeric to the regions of 8q24 associated with multiple cancers (10–12). To a lesser degree than MYC, POU5F1B (also called POU5F1P1) has drawn attention as a plausible candidate gene to explain the underlying biology of the association signals (13–16). It is the only confirmed gene within the extended 8q24 region, located specifically in Region 3. Until recently POU5F1B was classified as a highly homologous (15) pseudogene of POU5F1 (also called OCT4 or OCT3), which is a central gene in the regulation of stem cell pluripotency (17) (Figure 2a). Recent reports on POU5F1B have demonstrated that it produces a protein with similar function to POU5F1(14) and that it is over-expressed in CaP (13).
We applied three methods that are available for exploring gene-gene interactions in case-control studies. Traditionally, logistic regression has been the most popular method for analysis of case-control data. In recent years, a number of reports have noted that the power for exploring gene-gene interactions from case-control studies can be greatly enhanced by alternative methods that exploit the assumption of gene-gene independence between distant loci (18, 19). These methods, however, can be very sensitive to violation of the underlying gene-gene independence assumption (20). Our analysis provides the first empirical assessment of the performance of these alternative methods for large scale exploration of gene-gene interactions in a GWAS setting.
The details of CGEMS Stage I have been published previously (6). Briefly, the subjects available for this study included 1175 cases and 1100 controls of European ancestry from the Prostate, Lung, Colon and Ovarian Screening Trial. They were genotyped using two Illumina chips (HumanHap 300 and HumanHap240) that constituted 523,841 autosomal SNPs.
The details of CGEMS Stage II have been published previously (5). Briefly, the subjects included an additional 3941 cases and 3964 controls of European ancestry. They represent five studies (case/control): Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study in Finland (929/921), Health Professionals Follow-up Study in America (596/611), American Cancer Society Cancer Prevention Study II Nutrition Cohort (1760/1775), and CeRePP French Prostate Case-Control Study (656/657). Subjects were genotyped on 27,383 autosomal SNPs that showed evidence of association in single-SNP Stage I analyses (p<0.05).
We performed a series of conditional genome scans, using data from CGEMS Stage I or Stage II. We chose to study the 13 regions that have already been shown to be strongly associated with risk of CaP in CGEMS or other recent GWAS (Table 1). For each region, except EEFSEC we used the SNP from the original publication. For EEFSEC, we chose the most significant SNP within the gene in the combined CGEMS Stage I and II analysis.
In each scan, we “conditioned” on the most notable SNP from each of 13 well-established susceptibility regions and tested for interaction of the conditioning SNP with all the remaining “scan” SNPs across the genome. Tests for interaction for each scan SNP used a logistic regression model that included the main effect of the conditioning SNP, the main effect of a scan SNP, an interaction term that captured any non-multiplicative effect between the two markers, and adjusting covariates for study and DNA extraction. We assumed alleles within a locus affect risk of the disease in an additive fashion (on the logistic scale) and, thus, coded genotype data for each SNP as a continuous variable and the interaction as the product of the allele counts. In addition to performing standard tests for interactions, we also conducted “omnibus” (21) tests that simultaneously evaluated the significance of the main- and interaction- effect of the scan SNP.
For statistical inference under the logistic model, we used three alternative methods: (a) Unconstrained maximum-likelihood (UML), (b) Constrained maximum-likelihood (CML) and (c) Empirical Bayes (EB). The UML method corresponds to the traditional prospective logistic regression analysis that obtains maximum-likelihood estimates of the odds ratio parameters with no constraint on the joint distribution of scan and conditioning SNPs in the underlying population. The CML method obtains the maximum-likelihood estimates of the same odds ratio coefficients assuming gene-gene independence in the underlying population between the scan and conditioning SNPs (22). While the advantage of the UML method is that its validity does not require any assumption of joint genotype distribution in the underlying population, the CML and analogous methods are known to be much more powerful when the underlying assumptions of independence are valid (19, 23). For the CML analysis, we excluded scan SNPs within 500Kb of the conditioning SNP to minimize gene-gene dependence due to physical proximity. For assessment and estimation of odds ratio interactions, the CML method essentially produces inferences very similar to the popular case-only method (18, 19). Unlike the case-only approach, however, the CML method yields both interaction coefficients and main effects, which are needed for performing omnibus tests as well as for contextual interpretation of interaction.
The third approach we considered involved the recently proposed EB method which exploits the gene-gene or gene-environment independence in a more data-adaptive fashion so that bias can be avoided when the assumptions of independence are violated in the underlying population (24). The method obtains parameter estimates by taking weighted averages of those from the UML and CML methods where the weights depend on two key quantities: (a) the bias of the CML method which could be estimated by the difference of the estimates it produces from those obtained from the UML method and (b) the variance of the parameter estimates from the UML method. As the magnitude of the bias of the CML method increases, the EB method puts more weight on the UML method. Previous simulation studies have suggested that the EB method can strike a good balance between bias and efficiency in large scale studies where the assumptions of independence are likely to be satisfied for most, but not all, combinations of gene-gene and gene-environment factors under study (25). The parameter estimates and standard errors for logistic regression coefficients were obtained in the R package called Case Control. Genetics (26) that implements all three methods.
For each conditional region, we ran two separate scans, one for all 523,841 Stage I SNPs and the other for the subset of 27,383 SNPs that were available in both Stages I and II. We performed the scan for Stage II SNPs using only the Stage II data to make the analysis independent of the selection effect from Stage I. For each scan, we considered Bonferroni adjustment for the appropriate number of SNPs to declare genome-wide significance at p-value<0.05 level. To give higher priority for potential “cis” effects, we separately examined the significance of interaction for scan SNPs within 500Kb of the conditioning gene(s) by adjusting for multiple testing only within that region, using the EB method.
As we aimed to generate hypotheses about the functionality of susceptibility regions in the etiology of CaP, we conducted literature reviews and bioinformatics searches to investigate potential biological mechanisms that could underlie the interactions we observed. For SNPs in or near gene regions, we considered whether those genes were known or thought to participate in processes that relate to cancer or to the function of the relevant conditioning region. This work can help prioritize interacting SNP pairs for follow-up study. It guided our follow-up analysis on the top interaction result for 8q24 Region 3. Using Stage II data, we jointly modeled the top SNPs in ESRRB and SALL4 for interaction with 8q24 Region 3 in a single logistic regression that we analyzed using UML. We focused on the transcription factors ESRRB and SALL4 because, like the top-ranking EPAS1 (also known as HIF-2A), they are positive regulators of POU5F1 (27, 28) (Figure 2). They differ from EPAS1 in that they do not require hypoxic stimuli and that they are also target genes of POU5F1 (29). We examined the top SNP from each gene based on Stage I main effects results (ESRRB p-value=0.001, SALL4 p-value=0.28). Both SNPs are located in an intron of their respective gene. rs7155416 in ESRRB on chromosome 14q24 has a minor allele frequency (MAF) of 0.12 and rs6021460 in SALL4 on chromosome 20q13 has a MAF of 0.44.
Inspection of the quantile-quantile plots for omnibus and interaction tests for all conditional genome scans suggest that, in general, none of the methods was affected by large scale systematic bias or overdispersion. However, in some instances, the CML method showed more statistically significant associations than would be expected by chance (see, for example, Figure 3) even when we excluded all scan SNPs on the same chromosome as the conditioning SNP. In contrast, the UML and data adaptive EB methods did not show any evidence of such excess significance. A possible explanation for this phenomenon is population stratification that could cause long range dependence, violating the underlying assumption of gene-gene independence of the CML method (30). In the subsequent sections, we report the main findings of our conditional scans based on the EB method that is known to be more powerful than standard logistic regression analysis and yet, unlike CML type methods, is resistant to bias due to violation of the gene-gene independence assumption due to population stratification or otherwise.
In the Stage I analysis, one SNP, rs2002865, reached genome-wide significance for interaction after Bonferroni correction (p-value=9.14e-10) in the conditional scan of 8q24 Region 4. A second SNP, rs4960563, in strong LD with rs2002865 (r2=0.68), was also highly significant for interaction with the same 8q24 conditioning region (p-value=3.79e-6). Both interactions, however, failed to replicate (p-value>0.05), when the SNP was followed-up with further genotyping in a subset of the Stage II sample. Within the extended JAZF1 conditioning region, one SNP met “region-wide” significance for potential cis-interaction: rs4857841 on chromosome 7p15. It also failed to replicate in follow-up.
In the Stage II conditional scans, no SNP reached genome-wide significance for interaction. A list of top-ranking SNPs (p-value < 1.0e-4) from each conditional scan is shown in Table 2. The most notable finding considering biologic plausibility was an interaction between rs6983267 in 8q24 Region 3 which contains POU5F1B and rs4953347 in the first intron of EPAS1 which upregulates POU5F1 (p-value=9.69e-5, multiplicative odds ratio=1.13). In a follow-up analysis we detected a significant interaction between 8q24 Region 3 and both ESRRB and SALL4 (Table 3), which have functional similarities to EPAS1 (Figure 2). Also of note is the top-ranking SNP for interaction with the KLK2-KLK3 conditioning region: rs1558874 intronic to PRRX2 (p-value=4.80e-5, multiplicative odds ratio=1.33). Those SNPs demonstrated evidence of an interaction in the same direction and of similar magnitude in the independent CGEMS Stage I data (p-value=0.047, multiplicative odds ratio=1.27).
We highlight two region-wide significant results from our investigation of potential “cis” interactions. The first involves rs4314620 in the extended 8q24 region. It showed significant interaction in the conditional scans for Region 2 (p-value=0.003), Region 3 (p-value=0.0004) and Region 4 (p-value=0.02) (Figure 1). An omnibus test for rs4314620 that includes its main effect and interaction with each of the 8q24 conditioning SNPs was highly significant (p-value=2.48e-5). The second result involves rs17714461 for the KLK2-KLK3 region (p-value=7.14e-4). That SNP resides in chromosome 19q13, located ~15Kb from KLK4. It is ~60Kb from the conditioning SNP with which it does not demonstrate LD (r2<0.001).
Our study presents one of the first large scale explorations of gene-gene interactions in the setting of a multi-stage GWAS. Our analysis identifies a list (Table 2) of pair-wise SNP interactions that, through follow-up study, may elucidate the functional relevance of CaP susceptibility SNPs. Our analysis also provides insights into future methodological challenges that large scale studies will face in establishing conclusive interaction with either the primary or a secondary trait. It presents an empirical evaluation of modern methods for interaction analyses using case-control data.
The results we report for the conditional scans were obtained using a recently proposed EB method. We focused on that method because previous simulations demonstrated it is more powerful than standard logistic regression (25) and our empirical evaluation suggests it is more robust than case-only type methods. Our observation that the CML method can suffer bias even when scan and conditioning SNPs are on different chromosomes is particularly cautionary. The robustness of the EB method is expected to be similarly beneficial in studies of gene-environment interactions for which it can be difficult to assess an independence assumption.
Perhaps the most noteworthy result of our study is the top SNP for interaction with 8q24 Region 3 in Stage II: rs4953347 which is intronic to EPAS1. That gene belongs to a family of hypoxia-inducible factors that promote key carcinogenic processes such as angiogensis and metastasis (31). Under hypoxic conditions, which are common in malignant tumors, EPAS1 directly binds and activates POU5F1 (32, 33). By activating POU5F1, EPAS1 has been shown to promote tumorigenesis (34). Both EPAS1 and POU5F1B are over-expressed in CaP, but POU5F1 is not expressed in either healthy or malignant prostate tissue (13, 31). Given these data, we propose that POU5F1B mediates the observed EPAS1-8q24 Region 3 interaction. Our hypothesis involves an assumption that EPAS1 participates in the regulation of POU5F1B, which is currently poorly understood.
Kastler et. al suggested the over-expression of POU5F1B in CaP may mimic the ectopic expression of POU5F1 (13), which has been shown to promote epithelial tumors (35). That hypothesis aligns well with reports that CaP progression involves the reactivation of embryonic pathways (36) because POU5F1 is central to the regulation of stem cell pluripotency (17) and its encoded transcription factor is functionally similar to the protein of POU5F1B (37). Specifically, multipotential progenitor cells are thought to be seeds of tumorigenesis in CaP (37) and ectopic POU5F1 expression is thought to promote epithelial tumorigenesis by inhibiting the differentiation of progenitor cells (35). Given these data we cautiously hypothesize that the CaP association of 8q24 Region 3 involves a type of pluripotency network centered on POU5F1B rather than on POU5F1. Our follow-up analysis of 8q24 Region 3 with ESRRB and SALL4 offers preliminary support of the hypothesis because those genes, which function as both regulators and targets of POU5F1, demonstrate a significant interaction with 8q24 Region 3.
The preceding results should be interpreted cautiously. Future effort is needed to replicate the finding of statistical interaction for EPAS1 and 8q24 Region 3 in independent studies. To obtain conclusive evidence, the sample size for those studies needs to be large due to modest magnitude of the anticipated interaction. Even if our findings replicate, it cannot provide direct evidence for the proposed model underlying the interaction. Additional functional studies would be needed. We believe these future studies should consider not only CaP but rather all epithelial cancers associated with 8q24 Region 3 because, in subsets of those cancers, EPAS1 is over-expressed (31), mRNA transcripts of POU5F1B have been detected (15), and embryonic pathways are implicated (38). Notably, all those features characterize colon cancer.
The Stage II analysis produced other notable results, including two for the conditioning region KLK2-KLK3. The region-wide significant result for rs17714461 is noteworthy because its nearby gene, KLK4, has been shown to stimulate cellular proliferation in CaP in conjunction with KLK2 (39), and additional reports have linked KLK4 to various aspects of CaP progression, including mesenchymal transition, invasion and metastasis (40–42). The top SNP for interaction in the KLK2-KLK3 conditional scan was rs1558875 an intronic SNP to PRRX2 that is also associated with cellular proliferation (43). This interaction replicated in our relatively small, independent CGEMS Stage I analysis. These preliminary results suggest that the KLK2-KLK3 susceptibility region contributes to CaP risk through interaction with genes involved in cellular proliferation. Functional follow-up studies and additional replication efforts are warranted.
A second notable finding from our investigation of potential “cis” interactions involved rs4314620 for the extended 8q24 region. Its pair-wise interactions with three independent known risk alleles for CaP within the 8q24 region suggest different 8q24 susceptibility loci may be related by some common underlying biologic mechanism. It is hard to speculate what that mechanism may be because it is a gene-poor region, but one possible explanation for the observed interactions is long-range gene regulation. rs4314620 resides in a sub-region of 8q24 that is associated with bladder cancer (Figure 1) (44) and contains two regulatory regions for the oncogene MYC that flanks 8q24 Region 4 (45).
In our analysis of CGEMS Stage I data, one SNP-pair exceeded genome-wide significance for interaction in the conditional scan for 8q24 Region 4. The result was unlikely to be due to genotyping error as a second SNP in strong LD with the original signal also showed strong significance. Yet, the interaction failed to replicate when we genotyped the SNPs in an additional 2439 cases and 2241 controls in CGEMS Stage II. This example illustrates the challenge of employing rank p-values for prioritization of interaction as well as of establishing the threshold needed for a conclusive finding. We note that Stage I of CGEMS, which included 1175 cases and 1100 controls, was underpowered for study of interactions (Figure 4). In the future, one way of reducing such false positives would be to consider Bayesian methods (46, 47) that can incorporate both power and biological plausibility into measures of statistical significance. Another strategy to gain power, particularly for initial GWAS stages, is meta-analysis of multiple studies, enabling one to increase sample size while retaining the full array of SNPs.
Due to the scarcity of highly significant findings, we carefully examined the power of CGEMS Stages I and II to detect interactions at genome-wide significance levels (alpha = 1.0e-7 and 1.85e-6 for Stage I and II, respectively; Figure 4). In these calculations, we focused only on quantitative interactions where the effect of one locus can be modified, but not reversed, by the other locus and vice versa. Stage I of CGEMS had virtually no power to detect interaction odds ratios in the scenarios we examined which ranged 1.13–2.05. The larger Stage II, in contrast, had high power for detecting modest to large interaction odds ratio (≥ 1.7) even after accounting for the fact that some power is lost due to selection of the SNPs at Stage I by main effect only. It is notable that under a model of quantitative interaction, a larger interaction odds ratio also corresponds to larger main effects. Thus, the loss of power at Stage I due to the selection by main effect is often small when the interaction odds ratio and MAFs are reasonably large. In contrast, in the presence of qualitative interaction, the main effects of loci could be very weak or even non-existent even when the interaction odds ratio is large. The power for detecting such loci in our analysis is low, as the probability of selecting them for Stage II is low.
We conclude that the susceptibility SNPs we have studied are unlikely to have quantitative interactions of large magnitude with other SNPs in the genome. Theoretical calculations (48), as well as a lack of findings of epistasis for other diseases, also point towards the possibility that large non-multiplicative or non-additive effects may not be abundant in the etiology of complex traits. It is possible that our study has missed qualitative interactions, but the biologic plausibility of the presence of many such extreme types of interaction is questionable. It is also possible that epistasis, if it plays an important role in the etiology of CaP, will have a much more complex form than the pairwise SNP-SNP interactions we studied. Finding such higher order interactions in large scale studies, however, will remain an intrinsically challenging problem because of both the computationally daunting task of exploring all possible multi-locus models and the requirement for extremely large sample sizes that will be necessary to achieve sufficient power while minimizing the chance of false positives.
In the future, detecting evidence of gene-gene interactions through study of statistical interactions between SNP markers will likely require very large sample sizes that are achievable only by sharing individual level data in consortiums of GWAS. For smaller scale studies, the exercise of exploring gene-gene interactions is unlikely to lead to definitive findings, but it can be useful in generating lists of loci that require follow-up in replication studies. Incorporating biological knowledge from reliable pathway and network databases could enhance the power for detection, validation, and interpretation of interaction.
Our exploration of gene-gene interactions in CGEMS identified a list of SNPs that require future replication effort with varying degrees of priority (Table 2). We hope its public availability will motivate replication studies. The EB method we highlight is appropriate for those analyses, as it is both powerful and robust. We consider our most notable finding to be an interaction between SNPs in EPAS1 and 8q24 Region 3 because it generates a preliminary hypothesis about the poorly understood association of 8q24 Region 3 with multiple epithelial cancers that centers on the recently characterized gene POU5F1B. A second result with high priority for follow-up is an interaction between PRRX2 and the KLK2-KLK3 region. It suggests that the functional relevance of the KLK2-KLK3 susceptibility region in the etiology CaP may involve cellular proliferation.
This study utilized the high-performance computation capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD.