Genome-wide association studies for complex diseases such as asthma, schizophrenia, diabetes, and hypertension will soon produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). Due to the large number of SNPs tested and the potential for both genetic and environmental interactions, determining which SNPs modify the risk of disease is a methodological challenge. While the number of genotypes produced by candidate gene approaches will be somewhat less daunting, on the order of hundreds to thousands of SNPs, it will still be a considerable challenge to weed out the noise and identify the SNPs contributing to complex traits.
A logical first approach to dealing with massive numbers of SNPs is to first conduct univariate association tests on each individual SNP, in order to screen-out those with no evidence for disease association. The primary goal of such a procedure is not to prove that a particular variant or set of variants influences disease risk, but to prioritize SNPs for further study. Using a univariate test at this stage will result in low power for SNPs with very small marginal effects in the population, even if the SNPs have large interaction effects. Of course, in addition to taking all individual SNPs, all SNP pairs could also be tested for association. However, when dealing with multiple thousands of SNPs at the outset, such an approach is cumbersome, and raises the question of where to stop: why not all sets of three, four, or even five SNPs as well?
Many model-building methods exist for dealing with large numbers of predictors. For example, stochastic search variable selection (SSVS) [
1], a form of Bayesian model selection, has been explored as a tool to discover joint effects of multiple loci in the context of genetic linkage studies [
2-
4]. However, these methods are limited in the number of predictors that can be included at one time, causing some researchers to resort to a two-stage approach, in which only main effects are considered in a first stage, and interactions between loci with strong main effects are considered in a second stage. This approach can lead to the loss of important interactions with only weak main effects.
Multivariate adaptive regression splines (MARS) models have also been explored in the context of genetic linkage and association studies [
5,
6] with some degree of success. However, these and other model selection methods appear to be limited in the number of predictors that can reasonably be accommodated in one analysis, and the types of possible interactions that are allowed must be specified in advance. They are not suited to the initial task of identifying from a massive set of SNPs a subset for further analyses.
Combinatorial partitioning and multifactor dimensionality reduction [
7-
10] are closely related methods developed specifically to detect higher-order interactions among polymorphisms that predict trait variation. However, these methods are meant to identify interactions among small sets of SNPs, and have minimal power in the presence of genetic heterogeneity [
10]. They are therefore inappropriate for use as a screening tool for searching through thousands of SNPs to identify those contributing to phenotypes in the context of whole-genome association studies. The problem remains: how do we reasonably weed down from thousands or hundreds of thousands of SNPs to a number that can be used by available modeling methods, without losing the interactions that we hope to model in the first place?
An additional concern to be considered is genetic heterogeneity. We define genetic heterogeneity to mean that there are multiple possible ways to acquire a disease or trait that can involve different subsets of genes. Traditional regression models are limited in their ability to deal with underlying genetic heterogeneity (see,
e.g., [
11]). If genetic heterogeneity also leads to phenotypic heterogeneity, then methods that classify individuals into phenotypic subgroups for further analysis can be successful. Likewise, if heterogeneity in genetic etiology is primarily due to ethnic background, sub-dividing samples by self-reported ethnicity or genetically defined subgroups can be a powerful antecedent to data analyses for the identification of complex disease genes. However, even in the realm of Mendelian genetic diseases, heterogeneity is rarely so simple. For example, multiple polymorphisms in each of two different genes are responsible for familial breast cancer in the relatively homogeneous sub-population of Ashkenazi Jewish women [
12]. When the root of the heterogeneity is not known
a priori, traditional regression models, which lump all individuals into a single group and estimate average effects over the entire sample, are unlikely to successfully identify the genetic causes of diseases.
Classification trees and random forests
Tree-based methods consist of non-parametric statistical approaches for conducting regression and classification analyses by recursive partitioning (see, e.g., Hastie et al. [
13]). These methods can be very efficient at selecting from large numbers of predictor variables such as genetic polymorphisms those that best explain a phenotype. Tree methods are useful when predictors may be associated in some non-linear fashion, as no implicit assumptions about the form of underlying relationships between the predictor variables and the response are made. They are well-adapted to dealing with some types of genetic heterogeneity, as separate models are automatically fit to subsets of data defined by early splits in the tree.
The ease of interpretation of classification trees, along with their flexibility in accommodating large numbers of predictors and ability to handle heterogeneity, has resulted in increasing interest in their application to genetic association and linkage studies. Classification trees have been adapted for use with sibling pairs to subdivide pairs into more homogenous subgroups defined by non-genetic covariates [
14], thus increasing the power to detect linkage under heterogeneity [
15]. They have also shown promise for the dissection of complex traits for both linkage and association [
16,
17], and for exploring interactions [
6]. A related adaptive regression method has also shown promise in selecting a small number of predictive SNPs from a set of hundreds of potential predictors [
18]. Tree methods have also been used to identify homogeneous groups of cases for further analyses [
19], and as an adjunct to more traditional association methods [
20].
Classification trees are grown by recursively partitioning the observations into subgroups with a more homogeneous categorical response [
21]. At each node, the explanatory variable (e.g., SNP) giving the most homogeneous sub-groups is selected. Choosing alternative predictors that produce slightly sub-optimal splits can result in very different trees that have similar prediction accuracy. The Random Forests methodology [
22] builds on several other methods using multiple trees to increase prediction accuracy [
23-
25]. A random forest is a collection of classification or regression trees with two features that distinguish it from trees built in a deterministic manner. First, the trees are grown on bootstrap samples of the observations. Second, a random selection of the potential predictors is used to determine the best split at each node. For each tree, a bootstrap sample is obtained by drawing a sample with replacement from the original sample of observations. The bootstrap sample has the same number of individuals as the original sample, but some individuals are represented multiple times, while others are left out. The left-out individuals, sometimes called "out-of-bag", are used to estimate prediction error. Because a different bootstrap sample is used to grow each tree, there is a different set of out-of-bag individuals for each tree. With a forest of classification trees, each tree predicts the class of an individual. For each individual, the predictions, or "votes", are counted across all trees for which the individual was out-of-bag, and the class with the most votes is the individual's predicted class. Random forests produce an importance score for each variable that measures its importance. This score can be used to prioritize the variables, much as p-values from test statistics are used.
Using ensembles of trees built in this manner increases the probability that some trees will capture interactions among variables with no strong main effect. Unlike variable selection methods, interactions among predictors do not need to be explicitly specified in order to be utilized by a forest of trees. Instead, any interactions between variables serve to increase the importance of the individual interacting variables, making them more likely to be given high importance relative to other variables. Thus, random forests appear to be particularly well-suited to address a primary problem posed by large scale association studies. In preliminary studies, we have shown the potential of random forests in the context of linkage analysis [
26]. Other investigators are beginning to recognize the potential of the Random Forest methodology for studying SNP association [
27] and classification [
28].
To fully understand the basis of complex disease, it is important to identify the critical genetic factors involved, and to understand the complex relationships between genotypes, environment, and phenotypes. The few successes to date in identifying genes for complex disease suggest that despite carefully collected large samples, novel approaches are needed in the pursuit to dissect the multiple and varying factors that lead to complex human traits. Ultimately, the challenge in identifying polymorphisms that modulate the risk of complex disease is to find methods that can seamlessly handle large numbers of predictors, capitalize on and identify interactions, and tease apart the multiple heterogeneous etiologies. Here, we explore the use of the Random Forest methodology [
22,
29] as a screening tool for identifying SNPs associated with disease in the presence of interaction, heterogeneity, and large amounts of noise due to unassociated polymorphisms.