PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of biostsLink to Publisher's site
 
Biostatistics. Jan 2011; 12(1): 18–32.
Published online Jul 2, 2010. doi:  10.1093/biostatistics/kxq042
PMCID: PMC3006123
Testing SNPs and sets of SNPs for importance in association studies
Holger Schwender* and Ingo Ruczinski
Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA ; holger.schwender/at/udo.edu
Katja Ickstadt
Department of Statistics, TU Dortmund University, 44221 Dortmund, Germany
*To whom correspondence should be addressed.
Received June 26, 2009; Revised May 16, 2010; Accepted June 1, 2010.
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Keywords: Feature selection, GENICA, Importance measure, logicFS, Logic regression
One of the major goals of association studies concerned with single nucleotide polymorphisms (SNPs), that is, genetic variations occurring at a single base pair position in the genome, is the identification of SNPs and SNP interactions that increase the risk of developing a disease. Since individual SNPs typically exhibit only a slight to moderate effect—in particular, when considering complex diseases such as cancer—the focus of such studies is on the detection of SNP interactions (Garte, 2001). Several procedures have been proposed to tackle this task—including exhaustive searches based on, for example, multiple testing (Ritchie and others, 2001), (Marchini and others, 2005), (Goodman and others, 2006), methods employing evolutionary algorithms (Nunkesser and others, 2007), and approaches based on classification and regression procedures (Lunetta and others, 2004), (Bureau and others, 2005), (Kooperberg and Ruczinski, 2005), (Schwender and Ickstadt, 2008). Overviews on such methods are, for example, given by Hoh and Ott (2003) and Heidema and others (2006).
Popular examples for discrimination and regression procedures are random forests (Breiman, 2001) and logic regression (Ruczinski and others, 2003), as both can cope with the problem of having many more variables than observations, and are able to detect interactions of higher order. In particular, logic regression, which uses Boolean combinations of logic input variables for the prediction of the response, has shown a good performance in comparison to other methods in their application to SNP data (Kooperberg and others, 2001), (Witte and Fijal, 2001), (Ruczinski and others, 2004).
Several modifications of logic regression have been proposed. Logic regression has been embedded into a Bayesian framework to identify SNP interactions (Kooperberg and Ruczinski, 2005) or to include evolutionary information (Clark and others, 2007). Nunkesser and others (2007) employ genetic programming instead of simulated annealing to search for disease-associated logic expressions, whereas Clark and others (2005, 2008) use an evolutionary algorithm to identify Boolean combinations of haplotypes. Logic regression has also been applied to other types of data. For example, Keles and others (2004) modify logic regression for the search of regulatory motifs in transcription control regions of genes.
For a stabilized detection of SNP interactions that might have an effect on the disease risk, Schwender and Ickstadt (2008) introduce a procedure called logic Feature Selection (logicFS) that employs logic regression as base learner in bagging (Breiman, 1996). For the specification of how relevant the identified interactions are for the development of the disease, logicFS also provides measures for quantifying the importance of each of these interactions for a correct prediction of the case–control status.
Having quantified the importance of an interaction does, however, not give any information on how much the individual SNPs composing this interaction contribute to it, and hence, how relevant they are for the disease risk. Some of the SNPs might have a large influence on the disease risk, while others lead only to a small or moderate improvement of the prediction. It is therefore beneficial to also specify the importance of the individual SNPs, which then results in a ranking of these SNPs.
Typically, this problem is approached by testing each of the variables individually with Pearson's χ2-test or a Cochran–Armitage trend test and by ranking the SNPs by their values of this test statistic (or the corresponding p values). This, however, can generate misleading results, in particular, if SNPs show only an effect in interaction with other SNPs, as these univariate tests only consider marginal effects, and might not be able to detect multivariate structures in the data. Lunetta and others (2004), for example, show that the ranking generated by Fisher's exact test can be improved by employing the variable importance measures (VIMs) of random forests, in particular, if SNP interactions are the relevant risk factors.
In this article, we propose a testing procedure that overcomes this problem by applying logicFS, which—similar to random forests—searches for such multivariate structures, to SNP data, and employing a (standardized) importance measure as test statistic for the SNPs.
Although SNP interactions are more relevant risk factors than individual SNPs, they often do not have a huge impact on the disease risk. It is therefore not only interesting to search for SNP interactions but also to test prespecified biological sets of SNPs (e.g. SNPs belonging to the same gene or pathway) and to identify the sets of SNPs that are most consistently associated with the disease status.
Since the problem of relatively small individual effects has also been noticed in the analysis of gene expression data, several procedures such as gene set enrichment analysis (GSEA; Subramanian and others, 2005) that jointly consider genes belonging to the same GO-criterion (The Gene Ontology Consortium, 2000) have been proposed in recent years to borrow strength across these sets of genes (see Efron and Tibshirani, 2007). Critical overviews on such methods are given by Khatri and Draghici (2005) and Allison and others (2006), and methodological issues are discussed by Goeman and Bühlmann (2007).
Wang and others (2007) and Holden and others (2008) modify GSEA so that it can be applied to SNPs, whereas Chasman (2008) adapts GSEA and another gene set method introduced by Tavazoie and others (1999) to the analysis of SNPs in quantitative trait studies. A related method called SNP ratio test is proposed by O'Dushlaine and others (2009). Chapman and Whittaker (2008) compare several approaches for analyzing sets of SNPs such as Hotelling's T2-statistics (also used by Xiong and others, 2002), a Bayesian score test proposed by Goeman and others (2005), and the method of Fisher (1932) for combining p values. Since Fisher's method is based on the assumption of independent p values, Chai and others (2009) propose a correction of this procedure for linkage disequilibrium (LD).
A drawback of most of the gene set methods is that they are based on gene-specific statistics such as (univariate) p values and thus do not consider the joint expression distribution of the genes from a specific set. As noted by Nettleton and others (2008), such approaches might hence prohibit the detection of effects caused by the corresponding multivariate distribution.
If the analysis, however, is based on set-specific importance measures derived from models generated by procedures such as logic regression and logicFS that take the multivariate structure of the data into account, it is much more likely that such effects are identified. In this article, we therefore show how the proposed method for testing individual SNPs can be adapted to testing of sets of SNPs.
This procedure can also be used when applying logicFS to SNPs in strong LD. These highly correlated SNPs steal from each other's importance leading to a substantial reduction of the importances of the individual SNPs. This problem—which is similar to multicollinearity in linear models—can be overcome by jointly considering SNPs belonging to the same LD block and by computing the importance for each of these LD blocks.
Since this procedure is based on logicFS, which in turn is based on logic regression, the following section contains a brief introduction to these 2 methods. In Section 3, we describe how the importances of individual SNPs can be quantified, and how the resulting importance measures can be employed for testing SNPs. This procedure is extended to sets of SNPs in Section 4. In Section 5, the proposed method is applied to the SNPs from the GENICA study concerned with sporadic breast cancer and compared with other statistics usually employed for evaluating whether SNPs or SNP sets are associated with the disease. These procedures are also considered in a simulation study summarized in Section 6 and described in detail in Section C of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).
2.1. Logic regression
Logic regression is an adaptive discrimination and regression procedure that employs Boolean combinations of logic variables as predictors for the response in genetic association studies.
For the application of logic regression to SNPs, each SNP Si, i = 1,…,m, is typically coded by 2 logic variables, Si1 and Si2, one coding for a dominant and the other for a recessive effect of Si.
Such logic variables might be negated by the operator C and are combined by the logic operators ∧ (AND) and [logical or] (OR) to form logic expressions such as L = (S11S21C)[logical or]S32, which in turn are either true or false. If a logic expression L is used by logic regression as classification rule in a case–control study, then a subject will be classified as case if L is true and as control otherwise.
In logic regression, each logic expression is represented by a logic tree (see Figure 1). Using this tree representation and a set of 6 moves, simulated annealing (Kirkpatrick and others, 1983) is applied to a data set to search for the best logic regression model, that is, the model that provides the best prediction of the response, where the move set consists of moves for replacing variables or operators in the logic tree, adding logic variables in 2 different ways to the logic tree, and removing them by employing the countermoves of the growing steps (see Figure 1 for the latter 4 moves).
Fig 1.
Fig 1.
The 2 reduction moves “Pruning a branch” and “Deleting a leaf” and their countermoves “Growing a branch” and “Splitting a leaf,” respectively, from the move set of logic regression. While (more ...)
There are 2 types of logic regression models. Either a single tree can be grown to identify the logic expression that provides the best classification rule for a binary response (classification approach) or several trees Lt, t = 1,…,T, can be adaptively grown and combined by a generalized linear model g(E(Y)) = β0 + β1L1 + (...) + βTLT (regression or multiple tree approach). For more details on logic regression, see Ruczinski and others (2003).
2.2. logicFS
In logicFS (Schwender and Ickstadt, 2008), logic regression is employed for the detection of SNP interactions associated with a binary response. To stabilize the search for such interactions, bagging (Breiman, 1996) with base learner logic regression is used in logicFS. Thus, logic regression is not applied once to the whole data set, but B times to B bootstrap samples from the observations. Each of the logic trees in the resulting B logic regression models is then transformed into a disjunctive normal form (DNF), that is, a disjunction (OR combination) of conjunctions (AND combinations), as each of the conjunctions represents an interaction comprised by the logic tree. For each of the so detected SNP interactions Ih, h = 1,…,H, the value of an importance measure is computed to distinguish between interactions exhibiting an impact on the disease risk and other, possibly spurious interactions also found by logic regression.
As the importance of an interaction should not be quantified on the same data on which it has been delineated, the importance measures used in logicFS are based on the out-of-bag (OOB) observations, that is, the observations that are not part of the respective bootstrap sample. For the computation of the importance measures, the logic regression model built in the bth iteration, b = 1,…,B, is thus applied to the OOB observations of this iteration, and the number Nb of correctly classified OOB observations is determined. In the multiple tree case, the interaction Ih is afterward removed from the model, the logic regression model is refitted without Ih, and this new model—which will be the same as the original model if Ih is not part of the original model—is applied to the OOB observations to calculate the number Nb( − Ih) of correctly classified OOB observations. The importance of Ih is then given by
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx1_ht.jpg
(2.1)
In the classification approach of logicFS, an importance measure slightly different from (2.1) is considered. This measure makes it necessary to add the interaction of interest to the logic tree, which can be unambiguously done by adding the conjunction representing the interaction to the logic tree in DNF. However, neither a single logic variable nor a set of variables can be unambiguously added to a logic tree, as it is not clear whether a variable should be added as a new conjunction or to one of the interactions composing the tree. In the following, we therefore consider only importance measures similar to (2.1) in both the classification and the regression case.
3.1. Importance measures
An approach similar to the one described in Section 2.2 can be employed to quantify the importance of an individual SNP Si, i = 1,…,m. After generating B logic regression models on B bootstrap samples from the data and computing the number Nb of OOB observations correctly predicted by the bth logic regression model, b = 1,…,B, the 2 variables Si1 and Si2 coding for Si are removed from these models, and Nb( − Si) is determined. Finally, the importance of Si is computed as follows:
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx2_ht.jpg
(3.1)
In contrast to (2.1), the computation of (3.1) does not require the transformation of the logic expressions composing the models into DNFs, as Sij, j = 1,2, can be removed from the logic trees by the pruning moves from the move set of logic regression (see Figure 1). If the sibling of Sij, that is the node or leaf of the tree that exhibits the same parent node as Sij, is also a logic variable, then Sij will be removed by performing the move “Deleting a leaf.” If the sibling is an operator, the move “Pruning a branch” will eliminate Sij from the tree. Typically, these logic expressions only contain one of the variables coding for Si so that rarely 2 consecutive reduction moves on the same tree are required.
The measure (3.1) is similar to one of the importance measures provided by random forests. The main difference is that Breiman (2001) permutes the values of Si, which itself is used as categorical input variable in random forests, once to assess how relevant this variable is for the prediction, whereas Si1 and Si2 are totally removed from the model to determine (3.1) since a logic variable can be deleted from a logic tree without destroying the remaining structure of the tree. By contrast, the hierarchical structure of a CART tree used in random forests might be destroyed after removing a variable from it, as no surrogate variables are retained in random forests.
3.2. Testing procedure
One drawback of importance measures such as (3.1) is that they depend on the size of the data set. To overcome this problem, we follow Breiman (2001) and divide (3.1) by its standard error leading to the standardized importance
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx3_ht.jpg
(3.2)
of SNP Si, where sImp(Si) is the sample standard deviation over the improvements Impb(Si) = NbNb( − Si), b = 1,…,B, due to Si.
Since (3.2) approximates the statistic of a one-sided paired t-test for determining whether Nb is on average larger than Nb( − Si), it can also be used to specify a natural cutoff for calling Si, i = 1,…,m, important. If, for example, the Bonferroni correction is used to adjust for multiple comparisons, this cutoff might be set to the 1 − 0.05 /m quantile of the t-distribution with B − 1 degrees of freedom.
If the assumptions of normality and independence should be avoided, it is also possible to employ a permutation method to estimate the p values. Since scalability is a concern in the application of logicFS to a sufficient number of permutations, we use the original predictors, that is, the logic trees found in the original application of logicFS, directly in the permutation method. In each iteration a = 1,…,A, of this procedure, the response is randomly permuted. If the multiple tree approach is considered, the B logic regression models are afterward rebuilt using the original logic trees (i.e. the parameters in these models are reestimated). Then, Si is removed from the logic regression models, the models are refitted, and the numbers of correctly classified OOB observations are computed based on the permuted values of the response and the predictions of either the refitted models containing Si or the ones without Si. Finally, VIMNorma(Si), that is, the value of (3.2) for Si in permutation a, is calculated and compared with VIMNorm(Si) to determine the (unadjusted) p value
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx4_ht.jpg
(3.3)
As shown in Section E of the supplementary material available at Biostatistics online, this approach leads to similar results as the extensive approach of applying logicFS to the A permutations of the response but is much less time-consuming.
Besides employing (3.3) or the p values based on an asymptotic t-distribution for calling variables important, both can also be used to define the p-value-based importance measure
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx5_ht.jpg
(3.4)
where padj(Si) is the p value of Si that has been adjusted for multiple comparisons. Instead of the actual p value, (3.4) is used since an importance measure should meet the criterion “the larger the value, the more important is the variable.” An alternative to (3.4) is
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx6_ht.jpg
(3.5)
where padj*(Si) = padj(Si) when considering a t-distribution, and padj*(Si) is computed based on praw* = max{praw(Si),(10A) − 1} when employing (3.3) to avoid an infinitive importance (if praw(Si) = 0). The advantage of using (3.5) is that it emphasizes differences in very small p values (e.g. 10 − 2 compared with 10 − 5) and reduces the weight on distinctions between large p values.
4.1. Testing sets of SNPs
Besides the identification of disease-associated SNPs and SNP interactions, genetic association studies also aim to evaluate which genes or pathways are relevant for the development of a disease. This task is often approached by considering the list of identified SNPs and attempting to determine which biological sets of SNPs are overrepresented in the list.
A more sophisticated procedure based on a SNP-based measure for genes or pathways would enable an automatic detection of important sets. Such a measure should be able to borrow strength across the SNP sets so that sets can be identified that consist of SNPs that might not have significant effects per se but are most consistently associated with the disease. Furthermore, this measure needs to take the multivariate data structure into account so that sets can be detected in which the SNPs might not have a main effect, but interactions composed of some of these SNPs exhibit an important impact on the disease risk.
The procedure described in Section 3 can also be used for this purpose by removing all logic variables from the logic regression models that code for SNPs belonging to a genetic set Sk, k = 1,…,K, and computing
An external file that holds a picture, illustration, etc.
Object name is biostskxq042fx7_ht.jpg
(4.1)
Analogous to (3.1), (4.1) can then be standardized using (3.2) and employed in the testing procedure introduced in Section 3.2 to identify important SNP sets. For the computation of (4.1), it is not necessary that the sets are disjoint such that a SNP can be part of more than one biological set. Thus, SNPs assigned to more than one gene or belonging to more than one pathway can also be considered with (4.1).
4.2. Testing SNPs in LD
When genotyping several or all SNPs from candidate genes or genetic regions, blocks of SNPs will be in strong LD, as they tend to be inherited together. In this case, the importance measures of logicFS (and also those of random forests) suffer from a problem similar to multicollinearity in linear models, as the highly correlated SNPs substantially reduce each other's importance.
In the analysis of genotype data, this problem is typically tackled by either selecting one (or a few) tagSNPs explaining most of the variation in a block, and using these tagSNPs in the actual analysis, or by estimating haplotypes before or (only occasionally) within the statistical procedure, and considering these haplotypes as variables. For the latter approach within the framework of logic regression, see Clark and others (2005, 2005). A third solution is to employ the individual SNPs in the testing procedure and to summarize their effects by one measure.
While logic regression and thus logicFS were actually developed for the first solution, it is also possible to tackle the problem of highly correlated SNPs in the third way by considering the LD blocks as units/sets and by computing the importance (4.1) for each of the LD blocks instead of determining the importances of the individual SNPs. In Section C.5 of the supplementary material available at Biostatistics online, 2 examples for this setting show that the importances of individual SNPs can be lowered substantially, but the blocks containing disease-associated SNPs exhibit large importances.
5.1. The GENICA study
The GENICA study is a case–control study carried out by the interdisciplinary study group on Gene ENvironment Interaction and breast CAncer in Germany. The cases and controls of this age-matched and population-based study have been recruited in the larger Bonn region, Germany, between 2000 and 2002. For more details, see Justenhoven and others (2004) and http://www.genica.de.
In this article, we focus on a subset of the genotypes from the GENICA study. More precisely, the data of 1199 women (592 cases and 607 controls) typed at 39 SNPs belonging to the estrogen, the DNA repair, or the control of cell cycle pathway are considered in the analysis. Missing data (1.3% of the genotypes) are imputed based on weighted nearest neighbors.
In this data set, the SNP ERCC2_6540 (RefSNP ID: rs1799793) from the gene ERCC2 shows an odds ratio of 1.8 when estimating its association with disease. This effect increases to about 3.0, when this SNP is considered simultaneously with ERCC2_18880 (rs1052559), where ERCC2_18880 exhibits no main effect (see Supplementary Figure 1 in Section B of the supplementary material available at Biostatistics online). None of the other SNPs seems to have an effect—neither themselves nor in interaction with the other SNPs—on the breast cancer risk (see Justenhoven and others, 2004; Schwender and Ickstadt, 2008).
5.2. Testing individual SNPs
The classification approach of logicFS with B = 50 iterations is applied to the GENICA data to generate 50 logic regression models. These models are then employed to determine the standardized importances (3.2) of the SNPs and the values of the p-value-based measure (3.4) using A = 10000 permutations of the response.
The resulting importances are summarized in Table 1. This table reveals that only ERCC2_6540 and ERCC2_18880 are identified to be important for an accurate description of the disease risk, and none of the other SNPs seems to have a substantial effect (see also Supplementary Figure 1 in Section B of the supplementary material available at Biostatistics online).
Table 1.
Table 1.
Values of VIMNorm, VIMPval1, Pearson's χ2-statistic, and the MAX-statistics of Hothorn and Hothorn (2009), as well as the posterior probabilities of association (PPA) based on the Bayes factor (BF) of Marchini and others (2007) and the approximate (more ...)
For comparison, the SNPs of the GENICA study are also tested with Pearson's χ2-test and the MAX-test of Hothorn and Hothorn (2009) that takes the maximum over the Cochran–Armitage trend statistics for an additive, a dominant, and a recessive model as test statistic. Since recently several authors have argued that instead of p values Bayes factors should be considered in the detection of disease-associated SNPs (see, e.g. Wakefield, 2007; Stephens and Balding, 2009), posterior probabilities of association are also computed based on both the Bayes factors determined with the software SNPTEST (Marchini and others, 2007) and the approximate Bayes factors introduced by Wakefield (2007).
Since ERCC2_6540 itself exhibits an effect on the disease risk, it is identified as the most important SNP among all SNPs from the GENICA study by all 4 univariate testing procedures. However, only the importance measures of logicFS enable the detection of ERCC2_18880 as important and rank it as the second most important SNP. By contrast, ERCC2_18880, which only shows an effect in interaction with ERCC2_6540 (see Justenhoven and others, 2004), ranks seventh or fifth with an unadjusted p value of 0.106 or 0.087 when testing it univariately with Pearson's χ2-test or the MAX-test, respectively, and exhibits (marginal) posterior probabilities of associations of about 0.1.
To investigate how stable the values of VIMNorm(Si) are and for a comparison with random forests, both logicFS and random forests are applied 100 times to the GENICA data set, where in random forests, the SNPs are used as categorical input variables, and in logicFS, the logic variables coding for the SNPs are considered.
In Figure 2, the distributions of the resulting values of VIMNorm are displayed for the 5 SNPs from Table 1. This figure reveals that ERCC2_6540 and ERCC2_18880 are found as the 2 most important SNPs in each of the applications with similar importances (standard deviations of the importances are 1.57 for ERCC2_6540 and 0.91 for of ERCC2_18880). Besides these 2 SNPs called important in all 100 replications, only CYP1B1_1358 shows in 12 of the applications an importance larger than 3.179, which is the 1 − 0.05/39 quantile of the t49-distribution. None of the other 36 SNPs exceeds this threshold in any of the replications, and their rank varies strongly between applications. If the importance measure (3.4) derived from permutation-based p values is considered, the values of VIMPval1(CYP1B1_1358) range from 0 to 0.78 so that CYP1B1_1358 is never called important, whereas VIMPval1(ERCC2_6540) is always 1 and VIMPval1(ERCC2_18880) ≥ 0.97.
Fig 2.
Fig 2.
Distribution of the values of VIMNorm for each of the 5 SNPs shown in Table 1 over 100 applications of the classification approach of logicFS to the 39 SNPs from the GENICA study using B = 50 iterations. The dashed vertical line marks the 1 − (more ...)
For the determination of VIMPval1, the p values are adjusted by the Bonferroni correction. Other less strict adjustment methods for controlling the family-wise error rate or the false discovery rate lead to only slight increases of VIMPval1.
As in the applications of logicFS, random forests always detects ERCC2_6540 as the most important variable. However, ERCC2_18880 is identified as second most important in only 93 of the 100 analyses. In the other applications, it either ranks third (4 times), fourth (2 times), or fifth (1 time), and other SNPs that actually do not seem to have an effect on the disease risk are found as more important. Hence, logicFS is the only here considered procedure that enables the detection of both ERCC2_6540 and ERCC2_18880 in all 100 applications, although just one parameter setting is considered in logicFS, whereas the results of random forests are optimized over several parameter settings and 2 importance measures.
5.3. Testing Sets of SNPs
The 39 SNPs from the GENICA study belong to 24 genes, which in turn belong to 3 pathways. Therefore, the importances (4.1) of these biological sets of SNPs and the corresponding standardized and p-value-based importances are computed. In Table 2, the results for the 5 most important of these genes are summarized. Not surprisingly, this table shows that ERCC2 is by far the most important gene and the only gene that exceeds the cutoffs for calling a variable important.
Table 2.
Table 2.
Values of SNP-based importance measures for the 5 most important genes found in a logicFS analysis of the 39 SNPs from the GENICA study. For the computation of VIMPval1, the permutation-based p values are adjusted using the Bonferroni correction. Values (more ...)
When considering the 3 pathways, the DNA repair pathway, to which the gene ERCC2 belongs, exhibits the by far largest importance (VIMNorm = 16.14). However, the estrogen pathway also shows an importance (VIMNorm = 2.25) that is a little larger than 2.19, which is the 1 − 0.05/3 quantile of the t49-distribution. This effect is mostly due to the SNPs from the genes HSD17B1, SRD5A2, and EPHX (see Table 2) that belong to this pathway. The individual SNPs thus do not have an impact on the disease risk, but if the SNPs from the estrogen pathway are considered jointly, the prediction slightly improves. However, this effect vanishes when considering the p-value-based measure (3.4) using A = 10,000 permutations. In this case, this pathway shows an importance of 0.747. In contrast to these 2 pathways, the cell cycle pathway has a negative importance (VIMNorm = − 0.32). If thus SNPs from this pathway are included, they will worsen the prediction.
To confirm the general findings from the analyses of the GENICA data, we set up a simulation study composed of 6 scenarios in each of which 50 data sets consisting of 100 SNPs genotyped at 1000 cases and 1000 controls are considered. The SNPs in the first 4 simulations are drawn independently, and blocks of SNPs in very strong LD are simulated for the last 2 scenarios. In each of these simulations, the response is specified by an underlying model based on a 2-way interaction, where one of the SNPs composing this interaction intended to be associated with a disease shows only an effect in interaction with the other SNP.
The procedures considered in Section 5 are also applied to the data sets from Simulations 1–4. Depending on the effect size in the simulations, the univariate measures identify the SNP with a main effect either always or in most of the applications, whereas the other SNP without main effect is never detected. In fact, this SNP typically does not rank under the top 10 of the 100 SNPs. The performance of random forests highly depends on the actual effect size. If the effect size is relatively large, random forests always detects the SNP with main effect as the most important one. If it is relatively small, random forests ranks this SNP first only in 36 of the 50 applications. Similar results are delivered for the SNP exhibiting only an interaction effect. The only measure that always identifies the 2 SNPs intended to be disease associated as important and ranks them as first and second is the importance measure of logicFS.
For the application of the set measure (4.1), we assume that subsets of the SNPs from the first 4 simulations belong to different genes. Both the standardized version of (4.1) and procedures typically used for analyzing sets of SNPs (see Chapman and Whittaker, 2008) are applied to these 4 simulations and to the LD block data in the fifth and sixth simulation scenarios. As in the comparison of the measures for testing individual SNPs, the logicFS-based importance is the only measure that enables the identification of the sets containing the 2 SNPs intended to be disease associated as the 2 most important ones in any of the applications to the data sets from the 6 simulations. The other approaches usually detect the set containing the SNP with a main effect as the most important one, whereas the set containing the SNP having no main effect is very rarely identified by these procedures.
A detailed description of the simulation study is presented in Section C of the supplementary material available at Biostatistics online.
In genetic association studies, SNPs are typically tested univariately to detect the SNPs that are associated with the response, although it is assumed—in particular when considering complex diseases such as cancer—that interactions of several SNPs are actually the relevant risk factors. However, SNPs showing only an effect when interacting with other SNPs cannot be identified with marginal testing. In this article, we have proposed a testing procedure based on logicFS that enables the detection of such SNPs, as logic regression, which underlies logicFS, takes the multivariate structure of the SNP data into account.
In an analysis of the SNPs from the GENICA study, the proposed procedure is not only able to detect ERCC2_6540 (which exhibits a main effect) as important but also ERCC2_18880, which has no impact on the disease risk when considered marginally, but increases the effect of ERCC2_6540 substantially when interacting with this SNP. In this article, we have focussed on the classification approach of logicFS but have not included the results from the logistic regression approach, as both lead to similar results (see Supplementary Figure 2 in Section B of the supplementary material available at Biostatistics online).
For comparison, we have also applied 5 other procedures—4 univariate approaches and random forests—that are typically used in the analysis of genotype data to the SNPs from the GENICA study. While all these methods are able to detect ERCC2_6540, none of the univariate approaches enables the identification of ERCC2_18880, and the parameter settings of random forests must be optimized to allow a detection of this SNP in 93% of the applications. When testing all possible interactions of 2 SNPs from the GENICA study with an extension of the univariate tests, the interaction of ERCC2_6540 and ERCC2_18880 would have also been identified as associated with the disease. However, such tests do not provide statistics on how much the individual SNPs contribute to this association.
The results of the analyses of the GENICA data are in agreement with the results of a simulation study presented in detail in Section C of the supplementary material available at Biostatistics online in which logicFS is the only procedure that always detects the 2 SNPs designed to have an effect on the disease risk.
In this simulation study, we have also considered LD blocks of SNPs. We have focussed on blocks of SNPs in very strong LD, as high correlation most adversely affects the importance measures. However, there is a wide range of different LD structures that has not been considered in this study. It is an interesting remaining question how these LD patterns interfere with the importances.
In Section D of the supplementary material available at Biostatistics online, we also show that the value of the importance measure is influenced by the observed effect size, which can differ from the actual effect size in the reference population. This, however, does not unduly affect the ranking of the SNPs. While bagging can stabilize the search for important interactions, it does not take the variation of the initial sampling from the reference population into account. This sampling variation might be incorporated in logicFS by, for example, employing Bayesian procedures such as Bayesian bootstrap (Efron and Tibshirani, 1993), (Rubin, 1981).
The importance measure of logicFS, which is related to one of the measures of random forests, is not restricted to logic regression, but can also be used in other regression and discrimination procedures such as Genetic Programming for Association Studies (GPAS; Nunkesser and others, 2007), or can be adapted to the haplotype-based logic regression approaches of Clark and others (2005), Clark and others (2008) to quantify the importances of haplotypes. Moreover, this measure can also be applied to other genetic and nongenetic logic variables. In this case, the importances of the input variables might be of interest, which can be computed analogously to VIMSNP. For an application of this measure to the GENICA data, see Section B of the supplementary material available at Biostatistics online.
In contrast to the predictors, the response does not have to be binary in logicFS. In fact, versions of logicFS for multicategorical and quantitative responses already exist and extensions to, for example, survival data are possible. In Section A of the supplementary material available at Biostatistics online, these adaptations are discussed.
While bagging is employed in logicFS, other ensemble methods such as boosting (Freund and Shapire, 1996) with base learner logic regression can also be used to test SNPs for their importance. An interesting question is whether this would improve the detection of important SNPs.
The computation time of logicFS might be reduced from a few minutes to a few seconds by parallelizing the applications of logic regression to the different bootstrap samples and the fitting of the generalized linear models in the determination of the importance measures. Similarly, the time for computing the permutation-based p values might be reduced substantially.
The software for logic regression is currently restricted to 1000 logic variables, as logic regression has initially been developed for candidate SNP studies. However, logic regression and logicFS are able to handle thousands of variables so that both can be applied to candidate gene studies. For this, only the above limit has to be changed in the software, which then needs to be recompiled. Since probabilistic model searches with, for example, simulated annealing are not scalable for genome-wide association studies, the proposed methods here are not applicable in these settings. However, it would be possible to analyze a few ten thousand SNPs with a version of logic regression adapted to this new situation. First attempts with such a modified logic regression have shown promising results.
The importance measures described in this article are implemented in the R package logicFS version 1.18.0 or later available at http://www.bioconductor.org.
FUNDING
Deutsche Forschungsgemeinschaft (SFB 475: “Reduction of Complexity in Multivariate Data Structures,” SCHW 1508/1-1 to H.S.); the National Institutes of Health (HL090577 to I.R.).
SUPPLEMENTARY MATERIAL
Acknowledgments
The authors would like to thank the Editors, the Associate Editor, and 3 anonymous reviewers for very fruitful comments, Tom Louis for valuable discussions, High-Seng Chai for providing the GLOSSI software, and all partners within the GENICA research network for providing the data used in this article. Conflict of Interest: None declared.
  • Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nature Review Genetics. 2006;7:55–65. [PubMed]
  • Breiman L. Bagging predictors. Machine Learning. 1996;26:123–140.
  • Breiman L. Random forests. Machine Learning. 2001;45:5–32.
  • Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, van Eerdewegh P. Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology. 2005;28:171–182. [PubMed]
  • Chai HS, Sicotte H, Bailey KR, Turner ST, Asmann YW, Kocher JP. Glossi: a method to assess the association of genetic loci-sets with complex diseases. BMC Bioinformatics. 2009;10:102. [PMC free article] [PubMed]
  • Chapman J, Whittaker J. Analysis of multiple SNPs in a candidate gene or region. Genetic Epidemiology. 2008;32:560–566. [PMC free article] [PubMed]
  • Chasman DI. On the utility of gene set methods in genomewide association studies of quantitative traits. Genetic Epidemiology. 2008;32:658–668. [PubMed]
  • Clark TG, De Iorio M, Griffiths RC. Bayesian logistic regression using a perfect phylogeny. Biostatistics. 2007;8:32–52. [PubMed]
  • Clark TG, De Iorio M, Griffiths RC. An evolutionary algorithm to find associations in dense genetic maps. IEEE Transactions on Evolutionary Computation. 2008;12:297–306.
  • Clark TG, De Iorio M, Griffiths RC, Farrall M. Finding associations in dense genetic maps: a genetic algorithm approach. Human Heredity. 2005;60:97–108. [PubMed]
  • Efron B, Tibshirani R. On testing the significance of sets of genes. Annals of Applied Statistics. 2007;1:107–129.
  • Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993.
  • Fisher RA. Statistical Methods for Research Workers. 4th edition. Edinburgh, UK: Oliver and Boyd; 1932.
  • Freund Y, Shapire RE. Machine Learning: Proceedings of the 13th International Conference. San Francisco, CA: Morgan Kauffman; 1996. Experiments with a new boosting algorithm. pp. 148–156.
  • Garte S. Metabolic susceptibility genes as cancer risk factors: time for a reassessment? Cancer Epidemiology, Biomarkers & Prevention. 2001;10:1233–1237. [PubMed]
  • Goeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23:980–987. [PubMed]
  • Goeman JJ, van de Geer SA, van Houwelingen HC. Testing against a high-dimensional alternative. Journal of the Royal Statistical Society. Series B. 2005;68:477–493.
  • Goodman J, Mechanic L, Luke B, Ambs S, Chanock S, Harris C. Exploring SNP-SNP interactions and colon cancer risk using polymorphism interaction analysis. International Journal of Cancer. 2006;118:1790–1797. [PMC free article] [PubMed]
  • Heidema G, Boer J, Nagelkerke N, Mariman E, van de AD, Feskens E. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genetics. 2006;7:23. [PMC free article] [PubMed]
  • Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Reviews Genetics. 2003;4:701–709. [PubMed]
  • Holden M, Deng S, Wojnowski L, Kulle B. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics. 2008;24:2784–2785. [PubMed]
  • Hothorn L, Hothorn T. Order-restricted scores test for the evaluation of population-based case-control studies when the genetic model is unknown. Biometrical Journal. 2009;51:659–669. [PubMed]
  • Justenhoven C, Hamann U, Pesch B, Harth V, Rabstein S, Baisch C, Vollmert C, Illig T, Ko Y, Brüning T, Brauch H. ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population. Cancer Epidemiology, Biomarkers & Prevention. 2004;13:2059–2064. [PubMed]
  • Keles S, van der Laan MJ, Vulpe C. Regulatory motif finding by logic regression. Bioinformatics. 2004;20:2799–2811. [PubMed]
  • Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. [PMC free article] [PubMed]
  • Kirkpatrick S, Gelatt CDJ, Vecchi MP. Optimization by simulated annealing. Science. 1983;220:671–680. [PubMed]
  • Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genetic Epidemiology. 2005;28:157–170. [PubMed]
  • Kooperberg C, Ruczinski I, LeBlanc M, Hsu L. Sequence analysis using logic regression. Genetic Epidemiology. 2001;21:626–631. [PubMed]
  • Lunetta K, Hayward L, Segal J, van Eerdewegh P. Screening large-scale association study data: Exploiting interactions using random forests. BMC Genetics. 2004;10:32. [PMC free article] [PubMed]
  • Marchini J, Donnelly P, Cardon R. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics. 2005;37:413–416. [PubMed]
  • Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–913. [PubMed]
  • Nettleton D, Recknor J, Reecy JM. Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics. 2008;24:192–201. [PubMed]
  • Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics. 2007;23:3280–3288. [PubMed]
  • O'Dushlaine C, Kenny E, Heron EARS, Gill M, Morris DW, Corvin A. The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics. 2009;25:2762–2763. [PubMed]
  • Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics. 2001;69:138–147. [PubMed]
  • Rubin DB. The Bayesian bootstrap. Annals of Statistics. 1981;9:130–134.
  • Ruczinski I, Kooperberg C, LeBlanc M. Logic regression. Journal of Computational and Graphical Statistics. 2003;12:475–511.
  • Ruczinski I, Kooperberg C, LeBlanc M. Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications. Journal of Multivariate Analysis. 2004;90:178–195.
  • Schwender H, Ickstadt K. Identification of SNP interactions using logic regression. Biostatistics. 2008;9:187–198. [PubMed]
  • Stephens M, Balding DJ. Bayesian statistical methods for genetic association studies. Nature Reviews Genetics. 2009;10:681–690. [PubMed]
  • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES. and others. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Science of the United States of America. 2005;102:15545–15550. [PubMed]
  • Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nature Genetics. 1999;22:281–285. [PubMed]
  • The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25:25–29. [PMC free article] [PubMed]
  • Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. American Journal of Human Genetics. 2007;81:208–227. [PubMed]
  • Wang K, Li M, Bucan M. Pathway-based approaches for analysis of genomewide association studies. American Journal of Human Genetics. 2007;81:1278–1283. [PubMed]
  • Witte JS, Fijal BA. Introduction: analysis of sequence data and population structure. Genetic Epidemiology. 2001;21:600–601. [PubMed]
  • Xiong M, Zhao J, Boerwinkle E. Generalized T2 test for genome association studies. American Journal of Human Genetics. 2002;70:1257–1268. [PubMed]
Articles from Biostatistics (Oxford, England) are provided here courtesy of
Oxford University Press