The identification and characterization of susceptibility genes for common complex human disease is a difficult challenge. The usual approach of focusing a study on just one or a few candidate genes limits our ability to identify novel genetic effects associated with disease. In addition, many susceptibility genes may exhibit effects that are partially or solely dependent on interactions with other genes and/or the environment. Genome-wide association studies (GWAS) have been proposed as a solution to these problems; however, the analysis of GWAS data is problematic because we must separate the one or few true, but modest, signals from the extensive background noise. GWAS analyses must embrace abundant clinical and environmental data available to complement the rich genotypic data with the ultimate goal of revealing the genetic and environmental factors important for disease risk.
The ultimate goal of any disease gene discovery project is to identify all the genomic variations relevant to the phenotype being studied. As technology has advanced, the field has gone from very coarse genomic examination embodied in cytogenetic analyses, to higher resolution linkage analyses, and now to very high resolution association analyses. Methodological advances in the analysis of large scale or GWA studies and the ability to integrate results across experiments have simply not kept pace with this flood of genotyping data. It is a central fallacy that simply collecting enough data will solve the problem of elucidating disease susceptibility loci. Instead, it is this wealth of data that has made distinguishing true scientific discoveries from the thousands of false discoveries even more challenging. The growing disparity in developing data collection vs. data analysis methods mandates a more concerted effort to develop the necessary analytical tools to successfully interpret the genotypic data and thus ultimately improve the prevention, diagnosis, and treatment of common disease. The ultimate utility of our monumental investment in data generation will depend largely on the development of innovative analytical strategies and study designs that allow for the detection of gene-gene and gene-environment interactions.
While GWAS have been extremely successful(
Hindorff et al., 2009) there is clearly evidence that there is a large proportion of the genetic heritability for common, complex disease that has yet to be uncovered. Maher (
Maher, 2008) explains the possibilities for where the missing heritability may be hiding; one of which is “underground networks”, where it is the interactions between genes in a more complex network that explains a larger proportion of the heritability(
Maher, 2008).However, teasing gene-gene interaction networks apart from the many false positive loci in a GWA study is incredibly difficult. Cantor et al. review prioritization of GWAS results and identify epistasis and pathway analysis as two potential areas for deeper investigation in the quest for the missing heritability (
Cantor et al., 2010). This manuscript will focus on the combination of prior biological knowledge and pathway information for the detection of epistasis (gene-gene interactions) in GWAS data.
Epistasis was first described by Bateson as the effect of one gene masking (or literally
standing upon) the effect of another(
Bateson, 1909). The Bateson view of epistasis has also been described as
biological epistasis (
Moore and Williams, 2005), where variation in the physical interaction of biomolecules affects a phenotype (
Moore, 2003). From a statistical perspective, epistasis was also observed as multi-allelic segregation patterns by Fisher who mathematically described the phenomenon as deviation from additivity in a linear model of genotypes (
Fisher, 1918). Statistical epistasis and biological epistasis eventually converge as scientific understanding progresses. For example Bridges discovered statistical epistasis in
Drosophila eye color (
Bridges, 1919). These alleles influence a common set of biochemical pathways controlling eye pigmentation that was elucidated many years later (
Lloyd et al., 1998).
Most rare Mendelian genetic disorders, such as cystic fibrosis, are influenced by the effects of a single gene (although epistasis is being discovered in Mendelian disorders as described below). However, common diseases, such as multiple sclerosis, breast cancer, or diabetes, are likely influenced by more than one gene, some of which may be associated with disease risk primarily through nonlinear interactions (
Moore and Williams, 2002;
Ritchie et al., 2001). The possibility of complex interactions makes the detection and characterization of genes associated with common, complex disease difficult. Templeton (
Templeton, 2000) documents that gene-gene interactions are commonly found when properly investigated. Based on recent research, epistasis is not merely a theoretical argument. Epistasis has been identified as a component of complex phenotypes in a number of studies (
Ming and Muenke, 2002). For example, Mendelian disorders such as retinitis pigmentosa(
Kajiwara et al., 1994), Hirschsprung disease(
Auricchio et al., 1999), juvenile-onset glaucoma (
Vincent et al., 2002), familial amyloid polyneuropathy(
Soares et al., 2005) and cystic fibrosis (
Dipple and McCabe, 2000a;
Dipple and McCabe, 2000b) are documented examples of epistasis where modifier genes interact with Mendelian inherited main effect genes. More compelling are studies in model organisms where there is both biological and statistical evidence for epistasis. Three arthritis loci have been identified in a quantitative trait locus (QTL) in mice that exhibit epistatic interactions (
Johannesson et al., 2005a;
Johannesson et al., 2005b). Epistatic effects have also been documented in a number of other phenotypes in mice including obesity (
Warden et al., 2004) and fluctuating asymmetry of tooth size and shape (
Leamy et al., 2005). Similarly, other model organisms such as
Saccharomyces cerevisiae have documented epistasis associated with quantitative traits such as metabolism (
Segre et al., 2005). These model organism studies provide additional evidence that epistasis detected via statistical and computational techniques may be relevant biologically. This is something that is not possible to assess easily in human genetic studies(
Moore and Williams, 2005).
As epistasis is believed to play an important role in the genesis of complex disease, analysis strategies for detecting epistasis in large-scale data are increasingly important. A major hurdle in discovering epistasis, however, is the variable selection problem. Exhaustively evaluating all two-marker models in whole-genome data is a computational and statistical challenge, as processing the 5.00e11 possible two-marker models from a set of 1 million SNPs requires extensive computing resources and produces a plethora of statistically significant results with limited biological interpretability. That said, there are analytic strategies that have been developed or adapted specifically for this purpose. Some have argued that epistasis is unlikely to contribute in a significant way to the missing heritability, because in their opinion, epistatic effects may explain even less of the heritability than the independent single locus effects. They also comment that there are few, if any, convincing replicated gene-gene interaction models. However, other researchers present the contradictory point of view and provide evidence that epistasis is likely to exist and may have even larger effect sizes than the main effect counterparts(
Eichler et al., 2010). One of the challenges in identifying convincing epistasis models from large scale genomic studies is the difficultly in replication. How should “replication” be defined for epistasis models? Some argue in support of a conservative Bonferroni corrected p-value cut-off for an association describing the same SNP, with the same direction of effect, in the same race/ethnicity; the other extreme is to allow for replication of a particular pathyway or genes to satisfy the replication requirement.