|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: RW. Performed the experiments: ZL JPH. Analyzed the data: ZW TL. Contributed reagents/materials/analysis tools: ZW TL ZL JPH. Wrote the paper: RW.
Epistasis, i.e., the interaction of alleles at different loci, is thought to play a central role in the formation and progression of complex diseases. The complexity of disease expression should arise from a complex network of epistatic interactions involving multiple genes.
We develop a general model for testing high-order epistatic interactions for a complex disease in a case-control study. We incorporate the quantitative genetic theory of high-order epistasis into the setting of cases and controls sampled from a natural population. The new model allows the identification and testing of epistasis and its various genetic components.
Simulation studies were used to examine the power and false positive rates of the model under different sampling strategies. The model was used to detect epistasis in a case-control study of inflammatory bowel disease, in which five SNPs at a candidate gene were typed, leading to the identification of a significant three-locus epistasis.
The complexity of biological systems arises from the highly interactive relationships of their components , . Thus, it is likely that the metabolic pathways for a phenotypic trait or disease involve multiple interacting gene products and regulatory loci that could generate a complex network of genetic actions and interactions , . Current genome-wide linkage or association studies have been able to detect genetic actions of individual genes involved in the phenotypic diversity of a complex trait –. Given its ubiquitousness in controlling complex traits and diseases, epistasis resulting from interactions between alleles at different genes has now received increasing attention in genetic studies , . However, many of these studies focus on the identification of low-order pairwise epistasis, leaving epistatic interactions of high orders, their frequency and impact on genetic variation, unexplored.
More recently, Stich et al.  developed a linkage mapping approach to uncover three-way interactions among different quantitative trait loci (QTLs) using a mating design. Beerenwinkel et al.  proposed a mathematical approach for describing multi-way genetic interactions and employing it to study the genetic structure of fitness landscapes for Escherichia coli. Based on the analysis of pathway fragments, Imielinski and Belta  used a genome-scale knockout design to detect high-order epistatic relationships between components of large metabolic networks. Hansen and Wagner  showed that higher-order genetic interactions are potentially important if the total genomic mutation rate is large and the interaction density among loci is not too low. With the widespread availability of high-throughpout genotyping technology, there is a pressing need to estimate higher-order epistasis involving any number of genes and assess the role of epistasis in the creation and maintenance of genetic variation for complex traits.
The motivation of this study is to develop a general model for estimating epistasis of any order with multilocus single nucleotide polymorphism (SNP) data in case-control studies. In particular, the model allows the estimation and testing of high-order epistasis. Because of its easy sample collection, a population-based case-control design has been widely used in candidate gene or genome-wide association studies –. By comparing genotype frequencies for a gene in unrelated individuals with the disease and healthy controls, this design has power to test the significance of the association between the gene and disease. However, only a few studies used a case-control design to characterize epistasis  and, also, the epistasis they defined on the basis of logistic regression models presents a computational complexity. The new model described in this article has, for the first time, embedded quantitative genetic principles into a chi-square test framework, allowing the dissection of overall multilocus genetic effects into various components including epistatic interactions of high orders. The model was validated through simulation studies and a real data analysis.
Epistasis was originally defined as the expression of an allele at one locus masked by an allele at another locus . This concept was then explained in a statistical manner by Fisher  as the deviation of genetic action from additivity in a linear model. Fisher's definition allows epistasis to be quantified in different forms based on its biological meaning determined by Bateson . For a two-locus epistasis, all possible forms of epistasis include the interactions between additive effects at the two loci, additive effect at the first locus and dominant effect at the second locus, dominant effect at the first locus and additive effect at the second locus, and dominant effects at the two loci. Each of these epistatic forms contributes differently to the overall genetic value of a two-locus genotype. We used Mather and Jinks' formulation  to partition a genotypic value into its different components including epistasis.
Suppose there are two loci, A with two alleles and and B with two alleles and , which form nine two-locus genotypes. Let denote the genetic value of an arbitrary genotype ( for genotypes , , and ; for genotypes , , and , respectively). We dissect into different components as table 1.
Where is the overall mean, and are the additive effect at genes A and B, and are the dominant effect at genes A and B, respectively, and , , , and are the additive additive, additive dominant, dominant additive, and dominant dominant epistatic interactions between the two genes, respectively.
The dissection of genotypic values is expressed, in matrix form, as
The genetic effect parameters can be solved using
Adding a locus, C with two alleles and , to the two-gene model generates 27 three-locus genotypes, expressed as ( for genotypes , , and , respectively). A three-genotypic value () is dissected into the following components:
Mather and Jinks' theory is used to formulate the relationships between genotypic values and genetic effects, expressed as
The genetic effect parameters are then solved from the genotypic values:
We propose a general model for describing genetic components for a genotype composed of any number of loci. Consider loci which form genotypes. The value of a -locus genotype is composed of the overall mean, the additive and dominant effects for each locus, and epistasis of different kinds and orders among these loci. Let the space of the genetic effects at individual loci be defined as for gene 1, for gene 2, …, for gene . Thus, we can define all possible genetic effects () as
By letting , and (), we express the value of a general multi-locus genotype as
is a logical judgment function that can return 1 if the condition is true otherwise return 0.
The genetic effect parameters can be estimated by solving the linear equations using
Equation (8) gives a general form for main and interaction genetic effects among an arbitrary number of loci. Mathematical algorithms for solving epistatic equations are given in Text S1.
Based on the definitions, we now provide a procedure for testing epistasis of different kinds and orders with multilocus genetic data. Consider a case-control study in which cases (there is a disease) and controls (there is no disease) are selected randomly from a natural population. Case and control groups are matched for demographical factors such as age, race, gender, life style, and body mass. All subjects from the case and control groups are genotyped genome-wide or for particular chromosomal regions of interest, depending on the purpose of the study. Let and denote the observations of a general genotype () derived from three markers A, B, and C. Based on Mather and Jinks' partition of genotypic values , we calculate genetic effect parameters from genotypic values using equation (4). For both cases and controls, the genotypic values used to calculate each effect parameter are dissolved into two groups, plus and minus, which forms a 2 (cases and control)2 (plus and minus) contingency table. For example, the contingency table for testing the additiveadditiveadditive epistatic effect is expressed as table 2.
From the table, the test statistic is calculated and compared with the critical threshold with one degree of freedom. We proved that the test statistics under the null hypothesis calculated from the above contingency table follows a distribution with less than one degree of freedom .
The contingency tables for testing the other parameters can be made similarly. For a particular group (=1 for cases, 2 for controls), the genotypic values used to calculate the three-way epistatic effect parameters are tabulated as table 3.
The thresholds for testing each of these three-locus epistases are derived, which are =3.84, 3.20, 3.20, 3.20, 2.60, 2.60, 2.60, and 2.14, respectively. The genotypic values used to calculate the two-way epistatic effect parameters are tabulated as table 4.
The thresholds for testing each of these two-locus epistases are derived, which are =3.84, 3.84, 3.84, 2.50, 2.50, 2.50, 3.20, 3.20, 3.20, 3.20, 3.20 and 3.20, respectively. The genotypic values used to calculate the main genetic effect parameters are tabulated as table 5.
The thresholds for testing each of these two-locus epistases are derived, which are =3.84, 3.84, 3.84, 2.60, 2.60 and 2.60, respectively. For an arbitrary number of markers, the genotypic values used to calculate the main and epistatic (of different orders) genetic effect parameters can be similarly divided into plus and minus groups, from which the test statistics are calculated.
The model was used to analyze a case-control study aimed to detect genetic variants for inflammatory bowel disease (IBD) with candidate gene approaches . As a member of the membrane associated guanylate kinase family, TDiscs large homolog (DLG5) plays a central role in maintaining cell junctions and cell shape and in clustering channel proteins at the cell surface . Five single nucleotide polymorphisms (SNPs), Arg30Gln, Glu514Gln, Pro979Leu, Gly1066Gly, and Pro1371Gln, genotyped at DLG5 for both cases and controls are hoped to be associated with IBD. The cases include 115 sporadic IBD patients, aged from 22 to 66 years old, from the Milton S Hershey Medical Center, whereas the controls are 172 unrelated healthy individuals, aged from 15 to 81 years, from the Milton S Hershey Medical Center and Philadelphia gift of Life Donor Program. All the human tissues used for pathological studies and genetic analysis were approved by the Human Subjects Protection Offices of The Pennsylvania State University College of Medicine, and were undertaken with the understanding and written consent of each subject.
Because of a modest sample size used, our analysis will focus on a three-SNP analysis, although the model can deal with any number of SNPs. None of the five SNPs displays an additive genetic effect, but Arg30Gln, Pro979Leu, and Gly1066Gly were each found to trigger a significant dominant effect on the disease () (table 6). There are 10 possible pairs for the five SNPs, with each pair subject to a two-locus epistatic analysis. The number and distribution of two-locus epistasis are given in table 7. It is interesting to see that significant two-locus epistasis was observed only between Arg30Gln and other SNPs including Pro979Leu with a significant main dominant effect and two non-significant SNPs (Glu514Gln and Pro1371Gln). The form of significant epistasis is limited to the interactions between the dominant effect at Arg30Gln and the additive/dominant effects at the other SNPs.
The five SNPs produce 10 three-locus combinations which were analyzed by a three-locus epistasis model. Each combination has eight forms of three-SNP epistasis. Table 8 lists the test statistics for all possible combinations and forms of epistasis, with significant epistasis highlighted in boldface. The interactions among the additive effects at any three of the five SNPs were not significant; the same was also observed for the three-way dominant interactions. The significant three-locus epistasis must include both the additive and dominant effect at three SNPs. In general, Arg30Gln have more significant three-locus interactions and display higher three-locus significance level than the other SNPs. Arg30Ln, Glu514Gln, and Pro979Leu produce the most numerous forms of epistasis (3), followed by the combinations of Arg30Ln, Gly1066Gly, and Pro1371Gln (2), Glu514Gln, Gly1066Gly, and Pro1371Gln (2), Arg30Ln, Glu514, and Pro1371Gln (1), Arg30Ln, Pro979Leu, and Pro1371Gln (1), Pro979Leu, Gly1066Gly, and Pro1371Gln (1). The three SNPs with significant main effects (Arg30Ln, Pro979Leu, and Pro1371Gln) do not produce a significant three-locus epistatic interaction. The two SNPs displaying non-significant main effects (Glu514Gln and Pro1371Gln) could generate significant three-locus interactions with SNPs Arg30Ln and Gly1066Gly but not with Pro979Leu (table 8).
After significant high-order epistasis is detected, the next step is to make a biological interpretation of such epistasis. To interpret it, we will use the dominant ()additive ()additive () epistasis among Arg30Ln, Glu514Gln, and Pro979Leu as an example. Table 9 gives the structure of genetic effects for each three-locus genotypic value in terms of the additive, dominant, and epistatic effects of different orders. The epistasis only contributes to the genotypic value of , , , and (table 9). For each of these four genotypes, their values are partitioned into different effect components for both cases and controls (table 10). As can be seen, the epistasis increases, by 9 cases, the incidence of IBD for those with genotype or , but decreases the IBD incidence of those carrying genotype or with the same extent.
Simulation studies were undertaken to examine the statistical behavior of the new model. We will focus on the investigation of the power and false positive rates (FDR) for the detection of three-locus epistasis. Three different simulation schemes will be used with varying numbers of cases vs. controls, 200 vs. 200, 400 vs. 400, and 1000 vs. 1000. The eight possible forms of three-locus epistasis can be sorted into four presentative ones, (1) additiveadditiveadditive (no dominant effect), (2) additiveadditivedominant, additivedominantadditive, and dominantadditiveadditive (no one dominant effect), (3) additivedominantdominant, dominantadditivedominant, and dominantdominantadditive (two dominant effects), and (4) dominantdominantdominant (three dominant effects).
For a real data set, different SNPs may be associated or independent of each other. We will investigate how SNP-SNP associations affect the behavior of the new model. In one data set, three SNPs with the same allele frequency were simulated with pair-wise and three-locus linkage disequilibria. Among the three SNPs, only additiveadditiveadditive, additiveadditivedominant, additivedominantdominant, and dominantdominantdominant were assumed to exist. This can be done by simulating a contingency table with constraints , , , and the test statistics for the other effects the corresponding thresholds. The same parameters, except that there is no linkage disequilibrium, were used to simulate the second data set containing three SNPs.
Table 11, table 12, and table 13 give the power and false positive error rates (FPR) of the three-locus interaction detection by the new epistatic models. The power to detect the three-locus epistasis increase remarkably with sample size in a case-control study. With sample sizes of 200 vs. 200, there is power of about 0.51–0.61, with the additiveadditiveadditive epistasis detected most easily, followed by the additiveadditivedominant epistasis, the additivedominantdominant epistasis, and the dominantdominantdominant epistasis. When sample sizes increase to 400 vs. 400, the power for the three-locus epistasis detection will surpass three quarters. If sample sizes 1000 vs. 1000 are used, the power reaches 0.99 or more. In general, whether the SNPs are associated or independent does not affect the power substantially, although in some cases the power is higher for associated SNPs than independent SNPs.
The power displays a small FPR (tables 11, ,12,12, and and13).13). Even if small sample sizes 200 vs. 200 are used, there is still a small chance that the model provides a false positive result for the three-locus epistasis detection. The FPR was found to be consistent, regardless of sample sizes and the degree of SNP-SNP associations.
The phenotypic variation of a trait or disease is highly complex given its polygenic inheritance and environmental influence. Most original quantitative genetic models generally assume that allelic effects are additive, with the size linearly proportional to the number of alleles. These models are modified by considering that there are genetic interactions between different alleles at the same locus (dominance). It is now recognized that the interactions between different loci (epistasis) within gene networks may play an important role , . More recent evidence shows that high-order epistasis among more than two genes may form a crucial component in genetic interaction networks , , , . In fact, quantitative genetic analyses have detected high-order epistatic effects in plants. For example, high-order epistasis could be correlated with the aggressiveness of the isolate of Phytophthora capsici through influencing double crosses among different loci at meiosis . Wu  used a mating design with clonal replicates to identify the significant contribution of high-order epistasis to genetic variation in stem wood growth traits in poplars.
An increasing availability of high-throughput SNP data has led to the development of various statistical approaches for effectively analyzing epistasis among multiple polymorphisms, including logistic regression, multifactor dimensionality reduction (MDR), Bayesian analysis, and machine learning , , –. In this article, we developed a general model for detecting the episatsis of any order in case-control genetic association studies by integrating traditional quantitative genetic principles. Despite the existence, high-order epistasis may be obscured by metabolic network redundancy . The integration of quantitative genetic principles makes our approach capable to identify high-order epistatic interactions with genetic relevance. The model was tested by simulation studies. It displays adequate power for the detection of high-order epistasis with a modest sample size; for example, 400 cases vs. 400 controls. When sample sizes of cases and controls increase to 1000 vs. 1000, which is currently not a problem for most genetic association studies, the model has almost full power to detect three-locus epistasis of different forms. Even if a small size of samples (say 200 vs. 200), the new model has a low false positive rate for epistatic detection. The practical application of the model is validated by analyzing a real data set for the genetic study of inflammatory bowel disease. The model detected significant three-locus epistatic interactions among different SNPs genotyped from a candidate gene DLG5 .
Our model allows the characterization of epistasis of any order. Its implementation into a practical setting of genome-wide association studies is challenged by an exponentially increasing number of SNP-SNP combinations. To make this tractable, one may incorporate optimization techniques into our model, allowing the selection of the most important combinations. An additional issue is to determine the critical threshold with multiple correlated SNPs in genome-wide association studies. An empirical approach for determining a genome-wide threshold is to employ non-parametric permutation testing (see ref. , , –). Lastly, the model is developed to detect multilocus epistasis at the SNP level, but given recent discoveries for the importance of haplotypes in trait control –, the model should be extended to consider high-order interactions expressed by different haplotypes. In the current model specification, we choose controls that are matched for cases in terms of biological, environmental, or demographical factors. When such matches are not possible, we need to embed these factors as covariates into the model, in which the interactions between genes and these factors can be tested. Third, the model can be extended with multiple diseases to consider the pleiotropic effect of a gene. The results about high-order epistasis detection using the this and extended models could be used for iterative model building and functional annotation of genes. Future applications of these results includes analysis of the metabolic networks of pathogenic organisms and generation of epistatic candidate models for genome-wide association studies.
(0.14 MB PDF)
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work is supported by National Science Foundation (NSF) grant DMS/NIGMS-0540745 and the Changjiang Scholars Award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.