Traditional regression-based methods are often criticised ^{8}
^{61}
^{31} for their inability to deal with non-linear models and with HIGH-DIMENSIONAL DATA (containing many potentially interacting predictor variables, leading to sparse contingency tables with many empty cells). For this reason, machine learning or data-mining methods, developed in the field of computer science, are sometimes preferred. The selection of predictor variables, and interactions between them, that predict an outcome variable is a well-known problem in these fields. Data-mining approaches do not fit a single pre-specified model, nor do they attempt exhaustive search, but rather they attempt to step through the space of possible models (including potentially large numbers of main effects and multi-way interactions) in some computationally efficient way. Many data-mining approaches are, in fact, equivalent to stepping through a particular sequence of regression models and attempting to find the model that best fits the data; the distinction that is often made between data mining and regression models is therefore, to some extent, a false one. Non-linearity is not an issue when fitting a SATURATED MODEL (although it may be an issue for more restricted models). One common theme in data-mining is the use of CROSS-VALIDATION ^{62} to avoid problems of OVERFITTING.

Data-mining methods typically have problems dealing with incomplete and/or unbalanced data sets (e.g. when the number of cases and controls are unequal ^{63}). They also do not always deal particularly well with correlated predictors showing colinearity. This has been addressed in the mainstream statistics literature by the introduction of penalized regression approaches ^{64}
^{65} that allow large numbers of predictor variables to be included in a regression model, but with many estimated regression coefficients ‘shrunk’ towards zero. In genetics, use of such techniques is just starting to emerge, including penalized logistic regression ^{66}
^{67} and least angle regression ^{68} for identifying gene-gene interactions ^{69}
^{70} in binary traits.

A good overview of several machine learning approaches for detecting gene-gene interactions is given by McKinney et al. ^{31}. For the remainder of this section, I will focus on several methods that have become particularly popular and/or appear to show particular promise for detection of gene-gene interactions, or, more precisely, for detection of genes that may interact.

Recursive partitioning approaches

Recursive partitioning approaches (

Box 2) have been used as an alternative to traditional regression methods for detecting genetic loci (and their interactions) that influence a phenotypic outcome

^{71}
^{72}
^{73}. These approaches produce a graphical structure (resembling an upside-down tree) that maps possible values of certain predictor variables (e.g. SNP genotypes) to a final expected outcome (e.g. disease status). Each vertex or node of the tree represents a predictor variable, and from each node there are arcs or edges leading down to so-called ‘child’ nodes, where each edge corresponds to a different possible value that could be taken by the variable in the ‘parent’ node. A path through the tree represents a particular combination of values taken by the predictor variables appearing within that path.

Box 2. Recursive partitioning approachesSingle classification tree

Recursive partitioning approaches are based on classification and regression trees (CART) ^{111}. Trees are constructed (see ) using rules concerning how well a split at a node (based on the values of a predictor variable – such as a SNP) can differentiate observations with respect to the outcome variable (such as case-control status). A popular splitting rule is to use at each node the variable that maximises the reduction in a quantity known as the Gini impurity ^{111}
^{112}. In the , SNP 3 maximises the reduction in Gini impurity at the first node and so is chosen for splitting (according to genotype at SNP 3) the original data set of 1000 cases and 1000 controls into two smaller data sets. Once a node is split, the same logic is applied to each child node (hence the recursive nature of the procedure). The splitting procedure stops when no further gain can be made (e.g. when all terminal nodes contain only cases or only controls, or all possible SNPs have been included in a branch), or when some pre-set stopping rules are met. At this stage it is usual to prune the tree back (i.e. remove some of the later splits or branches) according to certain rules ^{111} to avoid over-fitting and to produce a final, more parsimonious, model.

Ensemble approaches: Random Forests

Rather than using a single classification tree, significant improvements in classification accuracy can result from growing an ensemble of trees and letting them in some sense ‘vote’ for the most popular outcome class given a set of input variable values. Such ensemble approaches can be used to provide measures of variable importance, a feature that is of considerable interest in genetic studies and that is often lacking in machine learning approaches. Probably the most widely-used ensemble tree approach is random forests ^{75}. A random forest is constructed by drawing (with replacement), from the original sample, several BOOTSTRAP SAMPLES of the same size (e.g. the same number of cases and controls). For each bootstrap sample, an unpruned classification tree is grown, but with the restriction that, at each node, rather than considering all possible predictor variables, only a random subset of the possible predictor variables is considered. This procedure results in ‘forest’ of trees, each of which will have been trained on a particular bootstrap sample of observations. The observations that were not used in growing a particular tree can be used as ‘out-of-bag’ instances to estimate prediction error. The out-of-bag observations can also be used to estimate variable importance in various different ways including via use of a permutation procedure ^{77}
^{31}
^{113}.

The actual model whereby the important predictor variables act (or interact) to influence phenotype is somewhat obscured because it results from the predictions of many different classification trees, and so one may wish to follow a random forests analysis with another approach. For example, one might choose the top-ranking variables from a random forests analysis as input variables for a simple regression-based search, a standard CART analysis, or for analysis using an alternative data-mining procedure.

See ^{31,113,74} for a good summary of the approach, available R software (the *randomForest*, *cforest* and *party* libraries) and a discussion of some limitations.

Recursive partitioning approaches do not include interaction variables *per se* in the model. Rather, the nature of the trees constructed allows for interaction in the sense that each path through a tree corresponds to a particular combination of values taken by certain predictor variables, thus including potential interactions between them. The aim of tree-based approaches therefore corresponds most closely to testing for association *while allowing for* interaction rather than *testing* for interaction *per se*. One limitation of recursive partitioning is that, since it conditions on main effects of variables at the first stage (and on main effects conditional on previously selected variables at subsequent stages), pure interactions in the absence of main effects can be missed ^{74}.

Rather than using a single tree, significant improvements in classification accuracy can result from growing an ensemble of trees. A popular ensemble tree approach is random forests

^{75} (

Box 2). This approach has been used in several genetic studies

^{76}
^{77}. Apart from the classification of future observations (not our focus of interest), the main result of a random forests analysis is a list of variable importance measures. These measure the impact of each predictor variable both individually and via multi-way interactions with other predictor variables, and therefore have an advantage over a list of significance values from single-locus association testing.

Random forests provide a parallelizable and relatively fast algorithm for measuring variable importance, partly because at each split only a small random subset of predictors is used. To allow each predictor the opportunity to enter the model and to produce accurate prediction, one must choose carefully a number of key parameters such as the number of trees in the forest, the number of randomly-chosen SNPs considered at each node and the number of permutations used to assess variable importance. In an ideal world one might repeat the analysis several times to assess sensitivity to choice of these parameters. An example of applying random forests to the WTCCC Crohn's disease and control data, using the software package Random Jungle ^{78}, is shown in .

Multifactor Dimensionality Reduction method

A variety of other data-mining approaches have been used for detection of interactions or potentially interacting variables in genetic association studies, including logic regression, ^{79}
^{80} genetic programming ^{81}, neural networks ^{54}
^{55} and pattern-mining ^{82}
^{83}. One particularly popular method is Multifactor Dimensionality Reduction (MDR) ^{8}
^{9}
^{10}. MDR has been used to identify putatively interacting loci in several phenotypes including breast cancer ^{8}, type 2 diabetes ^{84}, rheumatoid arthritis ^{85} and coronary artery disease ^{86}, although, to date, it is unclear whether any of these identified interactions have been replicated in larger samples.

The MDR algorithm is described in

Box 3 and in detail elsewhere

^{8}
^{9}
^{10}
^{11}
^{49}. Rather than testing for interaction

*per se*, MDR seeks to identify combinations of loci that influence a disease outcome, possibly via interactions rather than (or in addition to) via main effects. MDR achieves dimension reduction by converting a high-dimensional (multi-locus) model to a one-dimensional model, thus avoiding the issues of sparse data cells and over-parameterised models that can cause problems for traditional regression-based methods. MDR classifies genotypic classes as either ’high risk’ or ’low risk’ according to the ratio of cases and controls that are represented in each class. This could be considered overly simplistic: improvements that embed a more traditional regression-based approach into the cell classification step, allowing application to continuous as well as binary traits and adjustment for covariates, have been proposed

^{87}
^{88}.

Box 3. Multifactor Dimensionality ReductionThe MDR method is a constructive induction

^{40} algorithm that proceeds as follows: The observed data is divided into ten equal parts and a model is fit to each 9/10 of the data (the training data) with the remaining 1/10 (the test data) used to assess model fit via 10-fold cross-validation. Within each 9/10 of the data, a set of

*n* genetic factors is selected and their possible multifactor classes or cells are represented in

*n* dimensional space. For example, for

*n* = 2 diallelic loci, there are nine possible genotype classes or cells (

Supplementary Text S1). The ratio of the number of cases to the number of controls is estimated in each cell and the cell is labelled as either ‘high-risk’ if the case:control ratio reaches or exceeds some predetermined threshold (e.g. ≥ 1.0) and ‘low-risk’ otherwise. This reduces the original

*n*-dimensional model to a one-dimensional model (i.e. one variable with two classes: high-risk and low-risk). The procedure is repeated for each possible

*n*-factor combination, and the combination that maximizes the case:control ratio of the high-risk group (i.e. in some sense ‘fits’ the current 9/10 of the data best, giving minimum classification error among all

*n*-locus models) is selected. The testing accuracy (= 1–prediction error) of this best

*n*-locus model can be estimated using the remaining 1/10 of the data. The whole procedure is repeated for each of the 9/10 partitions of the data, and the final best

*n*-locus model is the model that maximises the testing accuracy or, equivalently, minimizes the prediction error. The cross-validation consistency is defined as the number of cross-validation replicates in which that same model

*n*-locus model was chosen as ‘best’ (i.e. the number of replicates in which it minimized classification error). The average prediction error is defined as the average of the prediction errors over the 10 cross-validation test data sets. (Note that the prediction error of each individual cross-validation replicate refers to the prediction error of the

*n*-locus model chosen as ‘best’ in that replicate, which will not always correspond to the final best

*n*-locus model).

In practice, rather than selecting a single value of *n* in each cross-validation replicate, one may consider all possible values up to a certain maximum e.g. all single-locus genotype combinations (*n* = 1), all two-locus combinations (*n* = 2), all three-locus combinations (*n* = 3) etc. One thus generates a best model within each cross-validation replicate as well as a final best model (with associated cross-validation consistency and average prediction error) for each different value of *n*. The cross-validation consistencies and average prediction errors can be used to determine the ‘best’ value of *n* (that giving the highest cross-validation consistency and/or lowest average prediction error) and thus the resulting overall best model.

The main problem with MDR (in common with other exhaustive search techniques) is that it does not scale up to consideration of large numbers of predictor variables (e.g. large numbers of loci from a genome-wide association study) ^{8}
^{9}. By performing exhaustive search for the best *n*-locus combination (within each of ten cross-validation replicates), anything more than a two-locus screen on more than a few hundred variables will be computationally prohibitive. An additional problem with early versions of the widely-used Java implementation of the MDR software (although note that other software implementations do exist ^{11}
^{88}) is that it was not designed with genome-wide data sets in mind, and thus could fail due to memory and disc-usage issues; these problems, however, appear to have been addressed in the most recent version of the software.

For investigation of higher-order interactions, MDR is therefore perhaps best suited for use with small numbers of loci (up to a few hundred), perhaps from a candidate gene study or selected from a larger set of potential predictors via a prior pre-processing or filtering step ^{40}. This step could be as simple as using a single-locus significance threshold, but that would seem counter-intuitive if the goal is to detect interactions in the absence of marginal effects. Perhaps more appealing would be to use a measure of variable importance that allows for possible interactions, such as the variable importance measure from a random forests analysis or from one of the alternative filtering methods described below.

ReliefF, Tuned ReliefF and Evaporative Cooling

One promising filtering algorithm that has been proposed ^{40} is ReliefF ^{89}, or its modified version, Tuned Relief (TuRF) ^{90}. This approach uses a measure of proximity between observations (individuals) – calculated, for example, on the basis of the genome-wide genetic similarity between individuals – to determine each individual's nearest neighbours from within his or her own phenotype class, and from within the opposite phenotype class. For each predictor variable, its difference in value between pairs of neighbouring individuals, weighted negatively or positively according to whether the individuals come from the same or different phenotype classes, can be used to construct an importance measure for that variable ^{90}. The algorithm is relatively simple and scalable and so should be applicable to large numbers of predictor variables and observations; an in-house C++ implementation was able to analyse 1 million loci in 200 individuals in approximately four minutes ^{90}.

ReliefF and TuRF have both been implemented in the Java version of the MDR software. One problem with ReliefF is that it can be sensitive to large backgrounds of variants that are irrelevant to phenotype ^{74}. This has motivated development of an alternative approach, Evaporative Cooling ^{91}
^{74}, that can be used to combine the strengths of ReliefF with those of random forests ^{74}.

An example of analysis using the Java implementation of TuRF and MDR, applied to the WTCCC Crohn's disease data, is shown in .