Genome-wide association studies (GWAS) have been widely applied recently to identify SNPs associated with common human diseases [1
], including cardiovascular diseases [6
], diabetes [6
], lupus [19
], autoimmune diseases [22
], autism [23
], and cancer [24
]. However, with few exceptions [13
], the discovered genetic variants with significant main effects account for only a small fraction of clinically important phenotypic variations for many traits [5
]. While there are multiple causes for missing some well-known genetic risk factors or disease heritability (including e.g., rare variants not genotyped in a GWAS study), a frequently cited reason is that most common diseases have complex mechanisms, involving multi-locus gene-gene and gene-environment interactions [5
]. For detecting interacting loci in high dimensional GWAS data with sufficient power and computational feasibility, some pioneering work, with promising results, has been reported, encompassing: i) real GWAS study papers, as cited above; ii) interaction detection methodology [32
]; iii) theoretical papers that characterize the principle problem (interaction detection) and its challenges [30
]; iv) review and methods comparison papers [29
Novel Methods for Detecting Interacting SNPs
A variety of SNP interaction detection methods have been recently proposed. In particular, multifactor dimensionality reduction (MDR) [33
] measures the association between SNPs and disease risk using prediction accuracy of selected multifactor models. Full interaction model (FIM) [41
] applies logistic regression, 3 usingd
-1 binary variables constructed based on a d
-SNP subset. Information gain (IG) [34
] measures mutual information to assess multi-locus joint effects. Bayesian epistasis association mapping (BEAM) [32
] treats the disease-associated markers and their interactions via a Bayesian partitioning model and computes, via Markov chain Monte Carlo (MCMC), the posterior probability that each SNP set is associated with the disease. SNP harvester (SH) [39
] proposes a heuristic search to reduce computational complexity and detect SNP interactions with weak marginal effects. Random forest (RF) [44
] is an ensemble classifier consisting of many decision trees, each tree using only a subset of the available features for class decision making. Thus, the detected features (SNPs) are the ones most frequently used by trees in the ensemble. Logic regression (LOR) [36
] identifies interactions as Boolean (logical) combinations of SNPs. In [42
], an extension of logic regression was also proposed to identify SNP interactions explanatory for the disease status, with two measures devised for quantifying the importance of these interactions for the accuracy of disease prediction. Treating SNPs and their interaction terms as predictors, penalized logistic regression (PLR) [37
] maximizes the model log-likelihood subject to an L2-norm constraint on the coefficients. Related to FIM and PLR, adaptive group lasso (AGL) [43
] adds all possible interaction effects at first and second order to a group lasso model, with L1-norm penalized logistic regression used to identify a sparse set of marginal and interaction terms. Maximum entropy conditional probability modeling (MECPM) [40
], applying a novel, deterministic model structure search, builds multiple, variable-order interactions into a phenotype-posterior model, and is coupled with the Bayesian Information Criterion (BIC) to estimate the number of interaction models present. Logistic regression with an interaction term (LRIT) has been widely applied to detect interactions [35
]. It treats the multiplicative term between SNPs, along with individual SNP terms, as predictors in the logistic regression model.
Evaluation of Methods to Detect Interacting SNPs
Despite strong current interest in this area and a number of recent review articles [29
], no commonly accepted performance standards for evaluating methods to detect multi-locus interactions have been established. For example, one might choose to evaluate power to detect individual SNPs involved in interactions, or power to precisely detect whole (multi-SNP) interactions. Moreover, the relationship between the power to detect interacting loci and the factors on which it depends (penetrance, minor allele frequency (MAF), main effects, and LD), while considered in some previous studies [32
], has not been fully investigated, either experimentally or analytically. Most importantly, although some assessment and performance comparison was undertaken both in the original papers proposing new methods [32
] and in the comparison papers [49
], it is difficult to draw definitive conclusions about the absolute and relative performance of these methods from this body of studies due to the following: (1) each study was based on a different simulation data set and a different set of experimental protocols (including the detection power definition used, the sample size, the number of evaluated SNPs, and the computational allowance of methods). While use of different data sets and protocols may be well-warranted, as it may allow a study to focus on unique scenarios/application contexts not considered previously, it also makes it difficult to compare the performance of methods, excepting those head-to-head evaluated in the same study. Some methods were found to perform quite favorably in one study but poorly in others. For example, MDR [33
] performed well in the original simulation study and the comparison paper [50
], but poorly in subsequent studies [32
]; (2) often, only simple cases were tested, which may not reflect the realistic application of a method. For example, a common practice is to include only a single
interaction model in the data [32
], whereas common diseases are usually complex, with multiple genetic causes [28
], suggesting that multiple
interaction models should be present. Our previous papers [40
] considered multiple interaction models, but an insufficient number of data set replications to draw definitive conclusions on relative performance of methods [50
]. also evaluated multiple interaction models, but only compared three methods, evaluated only one interaction power definition, and did not comprehensively evaluate the effects of penetrance, MAF, main effects, and linkage disequilibrium (LD) on power; (3) only limited interaction patterns were considered, e.g. 2-way interactions but no higher-order
interactions in [43
]. This is an important limitation, especially considering that data sets with 1000 or fewer SNPs were evaluated in these studies - in such cases, exhaustive evaluation of candidate pairwise interactions is computationally feasible, whereas heuristic
search, which will affect detection power in practice, is necessitated if either higher order interactions or much larger SNP sets are considered. Thus, to more realistically assess detection power, either higher order interactions and/or more SNPs should be considered; (4) Perhaps most critically, methods providing P-value assessments [32
] evaluated power for a given significance threshold, but did not rigorously evaluate the accuracy
of the P-value assessment, i.e. whether the Bonferroni-corrected P-value truly
reflects the family-wise type I error rate [55
]. This evaluation is of great importance for methods that use asymptotic statistics [32
], since it reveals whether or not the asymptotic P-value is a reliable detection criterion. Specifically, the P-value could be too liberal (in which case, more family-wise errors than expected will occur in practice and the estimated detection power is too optimistic) or too conservative (in which case the detection power estimate is too pessimistic). By not performing such assessment, it is unclear even whether use of P-values is providing a fair comparison of detection power between methods (i.e
., for the same family-wise error rate) in [32
]. We further note that although there were efforts to measure the type I error rate in [32
], the evaluations were not based on the commonly used family-wise error rate
, but rather on another definition of type 1 error [32
] that does not directly reflect the Bonferroni-corrected P-value; (5) In most past studies [32
], only a single definition of an interaction detection event (and, thus, a single measure of detection power) was used. However, this does not capture the full range of relevant detection events for some applications of GWAS. In particular, in some works an exact joint detection event is defined, i.e
. detection is successful only if all SNPs involved in the interaction (and only
these SNPs) are jointly detected [43
]. This is a stringent definition that gives no credit to a method that detects a subset of the interacting SNPs (e.g
. 3 of the SNPs in a 5-way interaction), even though such partial detection is clearly helpful if e.g
. one is seeking to identify a gene pathway, or if the remaining SNPs in the interaction can be subsequently detected by applying more sensitive (and computationally heavy) methods. Exact detection is especially stringent when there are multiple
interactions present, with the disease risk effectively divided between the multiple models. Finally, we note that individual methods have their own inductive biases and, thus, may perform better under different detection criteria - one method may find more ground-truth SNPs, while another may be more successful at finding whole interactions. Use of multiple power definitions can reveal these differences between methods; (6) Most of the proposed methods (e.g
. MDR, FIM, BEAM, MECPM, SH) are designed to detect both main effects and interaction effects, while to date they have only been evaluated on data sets containing interactions. It is thus also meaningful to measure how effective they are at detecting SNPs with only main effects, and how many false positive interactions
they detect involving main effect SNPs.
Finally, we note that there are very few true (strict) comparison
papers - most studies have focused on developing new methods, with experimental evaluation not the central paper focus. Two exceptions are [50
] and [49
]. However, they both embedded only a single interaction model in the data and considered data sets with only 100 SNPs; Moreover, [50
] evaluated only 2-way and 3-way interaction detection, while [49
] evaluated only two-way interaction detection.
The aforementioned limitations of previous studies are not surprising because of the following challenges associated with comparison studies: (1) it is impractical to evaluate methods on all of the (numerous possible) interaction models; (2) multiple aforementioned factors (MAF, penetrance, LD) jointly decide interaction effects, which thus entails extensive study design, experimentation, and computational efforts; (3) many replicated data sets are required to accurately estimate power and family-wise type I error rate, further increasing computational burden; (4) computational costs of some methods are inherently high; thus a thorough evaluation of these methods is a difficult hurdle; and (5) fair evaluation criteria are not easily designed because distinct methods have different inductive biases and produce different forms of output (e.g., some give P-value assessments while others only provide SNP rankings); (6) there is no consensus definition of power when seeking to identify multiple sets of predictors that are jointly associated with outcomes of interest.
Addressing the above challenges, a ground-truth based comparative study is reported in this paper. The goals are three-fold: (1) to describe and make publicly available simulation tools for evaluating performance of any technique designed to detect interactions among genetic variants in case-control studies; (2) to use these tools to compare performance of eight popular SNP detection methods; (3) to develop analytical relationships between power to detect interacting SNPs and the factors on which it depends (penetrance, MAF, main effects, LD), which support and help explain the experimental results.
Our simulation tools allow users to vary the parameters that impact performance, including interaction pattern, MAF, penetrance (which together determine the strength of the association) and the sporadic disease rate, while maintaining the normally occurring linkage disequilibrium structure. Also, the simulation tools allow users to embed multiple interaction models within each data set. These tools can be used to produce any number of test sets composed of user specified numbers of subjects and SNPs.
Our comparison study, based on these simulation tools, involves thousands of data sets and consists of three steps, as graphically illustrated in Figure . Step 1 (with no ground-truth SNPs present) measures the empirical family-wise type I error rate, which has not been evaluated in many previous studies, and yet is critically important if the (e.g. P-value based) significance threshold is used as the criterion for detecting interacting SNPs.
A flowchart for the performance evaluation of interaction detection methods.
In particular, foreshadowing our Step 1 results, we will find that most methods (except LR) in this study that produce P-values in fact produce conservative ones, with the degree of conservativeness method-dependent. Thus, using the same P-value threshold for all methods will not ensure the methods are being fairly compared, at a common family-wise error rate. Both for this reason, and because some of the methods do not even produce P-values, in Step 2 we evaluate detection power as a function of the number of top-ranked SNPs, rather than for a specified P-value threshold. Accordingly, note the logical structure in Figure , with the Step 1 results helping us to determine how to evaluate detection power in Step 2.
As aforementioned, Step 2 (with a variety of ground-truth interaction models present) investigates power. We formulate a more challenging, yet more realistic situation than most previous studies by including multiple
ground-truth interaction models in each simulated data set. These simulations are motivated in part by our experience with complex genetic diseases such as autoimmune diseases, diabetes and end-stage renal disease [18
]. In total, ninety different interaction models are investigated in this study, jointly determined by 5 underlying interaction types and 3 parameters, controlling penetrance, MAF, and LD. Step 3 investigates the power to detect main effect SNPs, i.e
. we investigate how the methods (many of which are designed to detect both interactions and main effect SNPs) perform when only main effects are present in the data.
The main contributions and novelty of our comparison study are: (1) comprehensive comparison of state-of-the-art techniques on realistic simulated data sets, each of which includes multiple interaction models; (2) new proposed power criteria, well-matched to distinct GWAS applications (e.g., detection of "at least one SNP in an interaction"); (3) evaluation of the accuracy of (P-value based) significance assessments made by the detection methods; (4) investigation of detection of models with variable order interactions (up to 5th order) in SNP data sets; (5) new analytical results on the relationship between interaction parameters and statistical power; (6) investigation of the flexibility of interaction-detection methods, i.e. whether (and with what accuracy) they can detect both interactions and main effects; (7) discoveries concerning relative performance of methods (e.g., comparative evaluation of the promising recent method, MECPM). Since we are presenting a diversity of results, both experimental and analytical, to assist the reader in navigating our work, Figure gives a graphical summary of our experimental steps, the results produced there from, and the connections between the different results, both experimental and analytical.