One of the major goals of association studies concerned with single nucleotide polymorphisms (SNPs), that is, genetic variations occurring at a single base pair position in the genome, is the identification of SNPs and SNP interactions that increase the risk of developing a disease. Since individual SNPs typically exhibit only a slight to moderate effect—in particular, when considering complex diseases such as cancer—the focus of such studies is on the detection of SNP interactions (Garte, 2001
). Several procedures have been proposed to tackle this task—including exhaustive searches based on, for example, multiple testing (Ritchie and others
, 2001), (Marchini and others
, 2005), (Goodman and others
, 2006), methods employing evolutionary algorithms (Nunkesser and others
, 2007), and approaches based on classification and regression procedures (Lunetta and others
, 2004), (Bureau and others
, 2005), (Kooperberg and Ruczinski, 2005
), (Schwender and Ickstadt, 2008
). Overviews on such methods are, for example, given by Hoh and Ott (2003)
and Heidema and others
Popular examples for discrimination and regression procedures are random forests (Breiman, 2001
) and logic regression (Ruczinski and others
, 2003), as both can cope with the problem of having many more variables than observations, and are able to detect interactions of higher order. In particular, logic regression, which uses Boolean combinations of logic input variables for the prediction of the response, has shown a good performance in comparison to other methods in their application to SNP data (Kooperberg and others
, 2001), (Witte and Fijal, 2001
), (Ruczinski and others
Several modifications of logic regression have been proposed. Logic regression has been embedded into a Bayesian framework to identify SNP interactions (Kooperberg and Ruczinski, 2005
) or to include evolutionary information (Clark and others
, 2007). Nunkesser and others
(2007) employ genetic programming instead of simulated annealing to search for disease-associated logic expressions, whereas Clark and others
(2005, 2008) use an evolutionary algorithm to identify Boolean combinations of haplotypes. Logic regression has also been applied to other types of data. For example, Keles and others
(2004) modify logic regression for the search of regulatory motifs in transcription control regions of genes.
For a stabilized detection of SNP interactions that might have an effect on the disease risk, Schwender and Ickstadt (2008)
introduce a procedure called logic Feature Selection (logicFS) that employs logic regression as base learner in bagging (Breiman, 1996
). For the specification of how relevant the identified interactions are for the development of the disease, logicFS also provides measures for quantifying the importance of each of these interactions for a correct prediction of the case–control status.
Having quantified the importance of an interaction does, however, not give any information on how much the individual SNPs composing this interaction contribute to it, and hence, how relevant they are for the disease risk. Some of the SNPs might have a large influence on the disease risk, while others lead only to a small or moderate improvement of the prediction. It is therefore beneficial to also specify the importance of the individual SNPs, which then results in a ranking of these SNPs.
Typically, this problem is approached by testing each of the variables individually with Pearson's χ2-test or a Cochran–Armitage trend test and by ranking the SNPs by their values of this test statistic (or the corresponding p values). This, however, can generate misleading results, in particular, if SNPs show only an effect in interaction with other SNPs, as these univariate tests only consider marginal effects, and might not be able to detect multivariate structures in the data. Lunetta and others (2004), for example, show that the ranking generated by Fisher's exact test can be improved by employing the variable importance measures (VIMs) of random forests, in particular, if SNP interactions are the relevant risk factors.
In this article, we propose a testing procedure that overcomes this problem by applying logicFS, which—similar to random forests—searches for such multivariate structures, to SNP data, and employing a (standardized) importance measure as test statistic for the SNPs.
Although SNP interactions are more relevant risk factors than individual SNPs, they often do not have a huge impact on the disease risk. It is therefore not only interesting to search for SNP interactions but also to test prespecified biological sets of SNPs (e.g. SNPs belonging to the same gene or pathway) and to identify the sets of SNPs that are most consistently associated with the disease status.
Since the problem of relatively small individual effects has also been noticed in the analysis of gene expression data, several procedures such as gene set enrichment analysis (GSEA; Subramanian and others
, 2005) that jointly consider genes belonging to the same GO-criterion (The Gene Ontology Consortium, 2000
) have been proposed in recent years to borrow strength across these sets of genes (see Efron and Tibshirani, 2007
). Critical overviews on such methods are given by Khatri and Draghici (2005)
and Allison and others
(2006), and methodological issues are discussed by Goeman and Bühlmann (2007)
Wang and others
(2007) and Holden and others
(2008) modify GSEA so that it can be applied to SNPs, whereas Chasman (2008)
adapts GSEA and another gene set method introduced by Tavazoie and others
(1999) to the analysis of SNPs in quantitative trait studies. A related method called SNP ratio test is proposed by O'Dushlaine and others
(2009). Chapman and Whittaker (2008)
compare several approaches for analyzing sets of SNPs such as Hotelling's T2
-statistics (also used by Xiong and others
, 2002), a Bayesian score test proposed by Goeman and others
(2005), and the method of Fisher (1932)
for combining p
values. Since Fisher's method is based on the assumption of independent p
values, Chai and others
(2009) propose a correction of this procedure for linkage disequilibrium (LD).
A drawback of most of the gene set methods is that they are based on gene-specific statistics such as (univariate) p values and thus do not consider the joint expression distribution of the genes from a specific set. As noted by Nettleton and others (2008), such approaches might hence prohibit the detection of effects caused by the corresponding multivariate distribution.
If the analysis, however, is based on set-specific importance measures derived from models generated by procedures such as logic regression and logicFS that take the multivariate structure of the data into account, it is much more likely that such effects are identified. In this article, we therefore show how the proposed method for testing individual SNPs can be adapted to testing of sets of SNPs.
This procedure can also be used when applying logicFS to SNPs in strong LD. These highly correlated SNPs steal from each other's importance leading to a substantial reduction of the importances of the individual SNPs. This problem—which is similar to multicollinearity in linear models—can be overcome by jointly considering SNPs belonging to the same LD block and by computing the importance for each of these LD blocks.
Since this procedure is based on logicFS, which in turn is based on logic regression, the following section contains a brief introduction to these 2 methods. In Section 3, we describe how the importances of individual SNPs can be quantified, and how the resulting importance measures can be employed for testing SNPs. This procedure is extended to sets of SNPs in Section 4. In Section 5, the proposed method is applied to the SNPs from the GENICA study concerned with sporadic breast cancer and compared with other statistics usually employed for evaluating whether SNPs or SNP sets are associated with the disease. These procedures are also considered in a simulation study summarized in Section 6 and described in detail in Section C of the supplementary material
available at Biostatistics