In this report, we have presented results on an information theoretic approach for GEI analysis of QT that uses two complementary information-theoretic metrics, the KWII and the PAI. The dependence of these metrics on biological and study design variables was systematically investigated with controlled numerical experiments. We analyzed the GAW15 data set, which was generated by Miller et al. [

24] from a complex simulation based on rheumatoid arthritis data and two GGI data sets generated from QTL mapping studies of HDL levels/atherosclerotic lesion size [

25] and UV-induced immunosuppression [

32].

The current method assumes that the QT of interest is normally distributed within each strata of the gene-environmental variable combination. The normal distribution is a common assumption in parametric statistics and derives its importance from the central limit theorem. From the information theoretic standpoint, the normal distribution

*N*(

*μ*,

*σ*) has maximum entropy among all real-valued distributions with specified mean

*μ *and standard deviation

*σ*. Therefore, if only the mean and standard deviation of a distribution are known, it is often reasonable to assume that the distribution is normal. As we demonstrated, a variety of data transformations such as log-transformation, arcsine transformation and others can sometimes be used to obtain normal distributions in some cases when the underlying variable is non-normally distributed. Although the normality requirements for each genotype-environment stratum could be considered a strong assumption, it is possible to deal with mixed distributions or empirically estimate the distribution of the QT in each stratum, e.g., with Parzen windows [

44], and use the information theoretic framework and CHORUS in consistent and analogous manner.

In the case of a normal distribution, the entropy expression contains only the variance. As a result, the approach conveys the impression of being driven by the variance. We have not addressed standard deviation estimation issues in detail here because our primary focus was to determine whether the underlying method was capable of identifying GEI. Greenwood and Sandomire demonstrate that at a sample size of 25, a standard deviation estimate is within ± 10% error half the time [

45].

Derivations of information theoretic metrics in terms of statistical parameters such as variance result in analytical expressions that are difficult to interpret intuitively. Set theoretic approaches provide more interpretability because they can account for addition and subtraction of entropies. Unlike the variance or second moment, which measures dispersion around the mean, entropy depends on parameters other than just the second moment, e.g., the shape and scale parameters of the distribution of interest. One advantage of the information-theoretic method is that it is capable of handling mixtures wherein the strata have different distributions.

The KWII definition of an interaction has a strong theoretical foundation from information theory and the statistical significance of the KWII can be assessed using permutation-based methods. Because the distribution of the KWII and PAI of higher order combinations has not been characterized, we used independent replicates for the three Simulated Data Sets and GAW15 data set to directly obtain confidence intervals as well as empirical information on the distribution of the KWII and PAI values. In the case of the GAW15 data set our approach enabled use of the entire data set. This approach is not feasible for real data and permutations are necessary to assess statistical significance via *p*-values. Permutations however, provide information on the null distribution.

As indicated in Methods, the KWII-based definition of interaction yields results that difficult to interpret for completely redundant variables because in the presence of an even number of completely redundant variables, the KWII is positive. This quandary can be addressed by retaining only one representative variable from every group of completely redundant variables in a pre-processing step prior to analysis. However, the PAI does not change when a completely redundant variable is added to combinations containing odd or even number of completely redundant variables. Because the CHORUS search of combinatorial space is directed towards combinations that increase PAI, our approach is less susceptible to identifying combinations comprised of variables that are completely redundant with each other.

The CHORUS algorithm however, is a heuristic method. CHORUS uses a search strategy rather than a dimensionality reduction approach and is capable of conducting efficient search of the large combinatorial space because of the unique nature of the PAI metric, which allows for greedy search identification of the most promising combinations by utilizing the marginal effects. As a consequence however, CHORUS is not capable of detecting pure epistasis. It is possible to develop a "two-locus" variation of CHORUS that utilizes KWII from all one-variable and two-variable combinations.

For Simulated Data Set 3, we adopted the overall structure, key assumptions and numerical values from previous work on pure epistasis in case-control data by Culverhouse [

23]. Our simulations assumed Hardy-Weinberg equilibrium and MAF of 0.5 at the interacting SNPs,

*SNP(1) *and

*SNP(2)*. The frequency of each genotype in our sample was representative of the corresponding population frequencies. The QT value for each subject was a random variate drawn from one of two normal distributions

*N*(

*μ*_{1},

*σ*_{1}) or

*N*(

*μ*_{0},

*σ*_{0}). The probability of drawing the QT random variate from

*N*(

*μ*_{1},

*σ*_{1}) was specified for each combination of

*SNP(1) *and

*SNP(2) *genotypes Equation 1 obtained from [

23]. The probability of drawing from

*N*(

*μ*_{0},

*σ*_{0}) was defined by the complement of the probabilities in Equation 1. These assumptions result in a form of QT epistasis because the two distributions

*N*(

*μ*_{1},

*σ*_{1}) and

*N*(

*μ*_{0},

*σ*_{0}) can be considered cases and controls and the model can be viewed as a binary trait to which normally distributed noise has been added.

CHORUS can be considered complementary to dimensionality reduction methods such as combinatorial partitioning method (CPM), multi-factor dimensionality reduction (MDR) and restricted partitioning method (RPM), which are computationally more burdensome but are sensitive to pure epistasis interactions. The CPM approach is capable of identifying multilocus genotypes capable of predicting QT levels [

46]. The multi-factor dimensionality reduction (MDR) method is applicable to binary phenotypes and uses constructive induction to reduce the dimensionality of the multi-locus genotype systematically by pooling into high and low risk groups [

3,

47-

50]. The CPM is computationally very intensive and Culverhouse et al. advocated the RPM [

33], which is applicable to both binary phenotypes and QT. Although RPM and MDR are computationally more efficient than the CPM, significant computational effort is required for datasets from genomewide association studies, which can contain tens of thousands to millions of predictor variables. The generalized MDR (GMDR) method handles both discrete phenotypes and continuous traits in population-based study designs and employs the generalized linear model (GLM) framework for scoring in conjunction with MDR for dimensionality reduction [

51].

Unlike exhaustive search algorithms, which can identify the global minimum, all heuristic approaches are potentially vulnerable to entrapment in local minima. CHORUS can be modified with established methods such as simulated annealing to reduce this risk. Within the current CHORUS framework, the input parameter θ, which determines the number of combinations retained at each stage of the algorithm, can also be a determinant of power: if too few combinations are retained at the initial stages of the search, the risk of missing key higher-order interactions with intermediate levels of marginal effects is increased. However, increases in θ increase the computational cost. In principle, the computational effort depends exponentially on the input parameter τ, which determines the order of combinations. In practice however, the value of τ is constrained by sample size because the genotype contingency tables for combinations rapidly become sparse and contain numerous empty cells when the order of the combinations increases. Although biological pathways are complex, sequentially ordered actions of protein-protein interactions and enzymatic chemical reactions are frequently involved [

52]. Such sequential interactions typically involve only a small subset of molecules in the pathway. The order of resulting statistical interactions may be limited as a consequence [

15].

There are some fundamental differences and unique advantages to CHORUS compared to the widely used GMDR approach. The metric used by GMDR is based on the GLM, a commonly used and versatile statistical analysis method, and is combined with dimensionality reduction method of MDR. For QTL analysis, GMDR analyzes the QT and covariates first to obtain the GLM score statistic and in a second stage, the interactions of GLM score statistics with the genetic and environmental variables are determined. In contrast, the CHORUS approach analyzes the underlying interactions between the QT of interest and all variables including covariates simultaneously. Another advantage with CHORUS is that it is capable of handling cases-only study designs that are useful for studying the genetic and environmental determinants of important QT such as body weight, height and lifespan. The statistical GLM framework enables GMDR to handle covariates whose distribution follows any of the exponential family distributions (normal, Poisson or Bernoulli distributions) but a limitation of CHORUS that we are working to overcome is that it cannot handle continuous covariates. Continuous covariates can be used after discretization, however. Another notable advantage of CHORUS is that it can be applied to very large data sets. We were able to analyze the 100 replicates in the 10K GAW15 data set without difficulty.

We have described the conceptual framework of CHORUS to highlight its strengths and its differences with other methods. We are exploring a range of enhancements including parallel computing that could enhance the efficiency and effectiveness of CHORUS further.