It is widely accepted that individual variation in risk for complex disorders results from the joint effects of both environmental and genetic factors. Considerable progress has been made in identifying environmental factors associated with specific diseases. However, there remain gaps in knowledge about common genetic variants that also play a role in disease pathogenesis and especially about the extent to which genetic variants interact with each other and with environmental factors to increase disease propensity. There are statistical, computational and methodological challenges associated with discovery of gene–gene and gene–environment interactions. First, there is the computational challenge of analyzing the vast majority of common single nucleotide polymorphisms (SNPs), or SNP subsets identifying common haplotypes, in large numbers of individuals. These datasets, typically with hundreds of thousands of SNPs genotyped for thousands of individuals, are too large for exhaustive evaluation of even moderate-order [e.g. third (SNP–SNP–SNP)] interactions. Even with an initial screening to reduce the SNP pool to less than, e.g. 10 000, it is infeasible for most computational platforms to evaluate all combinations of SNP associations for even moderate interaction orders.
Even if exhaustive evaluations were
possible for large SNP pools, a high proportion of identified associations will be false positives due to chance, as well as due to biases such as admixtures, non-random missing data or genotyping errors. There is also no full consensus on how to set an experiment-wide significance level, consistent with accurate estimation or control of the number of true and false discoveries (Allison et al.
There are also several challenges that may confound ability to detect true interactions. A fundamental problem is that interactions may have undetectable main (single SNP) or even low order (e.g. pairwise SNP) effects
, e.g. consider handedness, attributed to a non-linear XOR interaction Levy and Nagylaki (1972
), which only
has a significant joint
effect. Linear models such as linear support vector machines (SVMs, Guyon et al.
), often applied in past studies, cannot detect interactions without main effects.
Another difficulty is that the mathematical form of interactions is a priori unknown, with many possibilities. Indeed, there is even lack of consensus about what constitutes an interaction. Here, we define an interaction as occurring whenever a combination
of two or more SNP features increases or decreases likelihood of the phenotype more than the features acting singly. Unfortunately, it is difficult to know which forms really occur in biology. The strong prior precedent for main effects between single SNPs and phenotypes argues that techniques should, at a minimum, be proficient at detecting these effects. However, more exotic forms are already known to exist (Levy and Nagylaki, 1972
). Multiplicative interactions, within logistic models, are often assumed. However, model bias associated with this assumed form may reduce power to detect interacting SNPs. Even given a fixed form (e.g. multiplicative), the number of candidate interactions grows quickly with interaction order. Finally, detection power is also affected by low interaction penetrance and by an interaction's subpopulation size. Limitations of generalized linear and generalized additive models motivate the search for alternative methods and more flexible and efficient detection strategies.
One machine learning
strategy is to treat the problem as one of supervised feature selection
within statistical classification, with ‘cases’ and ‘controls’ as the classes. Here, one aims to select the feature subset which, when coupled with a classifier, leads to best classification performance. Standard wrapper-based
feature selection (Guyon and Elisseeff, 2003
involves repeated application of classifier learning for different candidate subsets, with the chosen marker
subset the one maximizing a criterion function closely allied to classification accuracy. A wrapper is thus specified by the subset search algorithm, the classifier and its learning algorithm, and by the feature selection criterion function. Searching can involve greedy forward selection, with ‘informative’ features added; backward elimination, with irrelevant features removed; or more complex algorithms. For SNP marker selection, Bhat et al.
) applied wrappers using multilayer perceptrons (MLPs), while Kim and Kim (2001
) applied SVMs. Several other SNP discovery methods have also been proposed. Ritchie et al.
) proposed multifactor dimensionality reduction (MDR) to encode SNP interactions. Kooperberg et al.
) proposed logic regression and reported promising results. Zhang and Liu (2007
) proposed Bayesian epistasis association mapping (BEAM). In Section 4
, we discuss these and other previous SNP discovery methods and their disadvantages and advantages, compared with the method proposed here.
Unlike most past approaches, the method developed here learns a single
phenotype-predictive model that explicitly specifies all detected markers and marker interactions, and automatically estimates the number of interactions present. Key aspects of this method are: (i) while some methods only detect a marker subset, maximum entropy conditional probability modeling (MECPM) determines this SNP subset, the number of interactions and the specification of the marker SNPs involved in each interaction. For example, from 1000 candidate SNPs, MECPM might determine that there are six interactions—a five-way, three-way and two-way, along with three main-effect SNPs—each involving disjoint marker SNP subsets, and thus together comprising a marker set of 13 SNPs; (ii) MECPM is a posterior model (classifier), predicting phenotype given the genotype, that makes explicit and in fact is defined by its interactions; (iii) it uses a novel heuristic search that evaluates a selected, enriched subset of relatively high-order candidate interactions (up to five-way), while retaining relatively low complexity (worst case, quadratic in the SNP size); (iv) the method allows flexible SNP coding (dominant, recessive) within each interaction; (v) no mathematical interaction form is assumed; (vi) theoretically grounded model structure and order selection, based on the Bayesian Information Criterion
(BIC; Schwarz, 1978
), accounts for the finite sample size in (fairly) comparing candidate interactions at different orders and in determining the number of interactions; thus automatically, via model order selection, the experiment-wide significance level is set. This approach tends to determine markers and interactions with high specificity and
better sensitivity than previous approaches (see Section 3
In Section 2
, we briefly review MECPM modeling and then specialize this framework for interaction discovery. In Section 3
, we experimentally compare MECPM SNP discovery with other methods. Section 4
discusses methodology, capabilities and advantages/disadvantages of previous SNP discovery methods, compared with MECPM.