The mixed success of attempts to identify genetic variants that account for a large part of the heritability of common disease has focussed attention on the need to develop new methodological approaches to the analysis of GWAS data. A number of factors that might explain this ‘missing heritability’ have been suggested, including the failure of many current models to capture the presence of gene-gene and gene-environment interactions, of multiple SNPs with small effect and of rare variants (Manolio et al., 2009
, Goldstein, 2009
). One promising approach uses prior information on functional structure present within the genome to group genes and associated SNPs into gene sets or pathways. The motivation here is that genes do not work in isolation, but instead work together through their effect on molecular networks and cellular pathways. The hope is that by jointly considering the effects of multiple SNPs or genes within a biological pathway, significant associations might be identified that would otherwise be missed when considering markers individually (Wang et al., 2010
). First developed in the context of gene expression studies (Mootha et al., 2003
), pathways-based methods have more recently been extended to the analysis of GWAS data (Holmans et al., 2009
, Luo et al., 2010
, Lango Allen et al., 2010
, Lambert et al., 2010
). This has led to the identification of putative causal pathways for a number of diseases including Parkinson’s Disease (Lesnick et al., 2007
), Crohn’s Disease (Wang et al., 2009b
) and rheumatoid arthritis (Eleftherohorinou et al., 2011
). As well as offering the potential for increased statistical power, pathways-based genetic association studies (PGAS) can aid the biological interpretation of results through the identification of causal pathways, and may also facilitate comparisons between studies genotyping different variants that nonetheless map to common pathways (Ma and Kosorok, 2010
, Cantor et al., 2010
The majority of existing PGAS methods begin with a univariate test of association, in which individual SNPs are scored according to their degree of association with disease status or a quantitative trait. Various techniques are then used to combine these univariate statistics into pathway scores. For example, the GenGen method (Wang et al., 2007
) first ranks all genes according to the value of the highest-scoring SNP within 500kb of each gene. Pathway significance is then assessed by determining the degree to which high-ranking genes are over-represented in a given gene set, in comparison with the genomic background. The PLINK tookit (Purcell et al., 2007
) also features a ‘set-based test’, in which pathway significance is measured by taking the average, marginal p-value of a pre-determined maximum number of ‘uncorrelated’ SNPs within the pathway. Here, uncorrelated SNPs are defined as those whose pairwise linkage disequilibrium (LD) is below a certain threshold value. As a final step, where more than one pathway is considered a correction for multiple testing is generally made.
In contrast to univariate, ‘one SNP at a time’ methods, multivariate or multi-locus methods allow all SNPs to be considered in the model at the same time, which can aid the identification of weak signals while diminishing the importance of false ones. One such approach consists of fitting a penalised, multivariate regression model, in which a subset of SNPs is selected by imposing a penalty on some suitably selected norm of the regression coefficients, as in Lasso regression (Tibshirani, 1996
). This approach has been shown to yield higher statistical power, compared to more common ‘mass univariate linear models’, especially with multivariate and high-dimensional quantitative traits (Vounou et al., 2010
). Several other studies have demonstrated the advantages of this approach for the detection of genetic associations. For example, Wu et al. (2009)
use penalized logistic regression to select SNPs in a case-control study, and analyse two-way and higher-order SNP-SNP interactions. Hoggart et al. (2008)
propose a similar method for SNP selection in a Bayesian context.
A number of penalized regression techniques that allow prior information on the relationship between SNP markers to be incorporated into the model selection process have recently been proposed. For example, Zhou et al. (2010)
group SNPs into genes, and utilise a useful property of the group lasso (Yuan and Lin, 2006
) to aid the detection of rare variants within genes. The GRASS method (Chen et al., 2010
) begins by characterising within-gene variation as ‘eigenSNPs’, obtained by principal component analysis (PCA). A combination of lasso and ridge regression, followed by permutations is then used to measure significance for a single pathway. Finally, Zhao et al. (2011)
use a combination of PCA and lasso regression to identify a subset of genes within a candidate pathway, followed by permutations to measure pathway significance. Once again this method considers one pathway at a time.
The search for SNPs, or quantitative trait loci (QTL) influencing quantitative traits is gaining momentum as a potentially more powerful way to study the underlying causes of complex disease (Plomin et al., 2009
). In the emerging field of neuroimaging genetics for example, in which we have a particular interest, quantitative data in the form of MRI or PET scans serve as a type of intermediate phenotype in the study of complex disorders such as Alzheimer’s Disease (AD) or schizophrenia (Bigos and Weinberger, 2010
). We use genotype data from the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset in this analysis.
Our focus here is on the identification of biological pathways associated with a quantitative trait. Our assumption is that where causal SNPs are enriched in a pathway, the use of a regression model that selects SNPs that are grouped into pathways will have increased power, compared to a more traditional approach in which SNPs are considered one at a time. We also seek a true, multivariate model which includes all mapped pathways at the same time. The hope is that this will confer some of the benefits, in terms of detecting weaker signals and diminishing false positives, described earlier. To achieve these ends, we use a modified version of the group lasso (GL) with SNPs grouped into pathways, and develop a fast estimation algorithm applicable to the case of non-orthogonal groups. In order to rank pathways, we use a bootstrap sampling procedure to rank pathways in decreasing order of importance. We face a number of challenges in applying GL to SNP and pathway data for the identification of implicated pathways. These include the fact that pathways overlap, since many SNPs map to multiple pathways; the problem of selection bias, that is the tendency of the model to select pathways having specific statistical properties irrespective of their association with phenotype; and the sheer scale of SNP datasets, making efficient estimation a necessity.
We have found that the issue of overlapping pathways receives surprisingly little attention in the PGAS literature, given that the presence of overlaps might be expected to have a significant impact on the results of any PGAS analysis. For example, variation in the number and distribution of causal SNPs with respect to genes that overlap multiple pathways will affect the number of pathways defined to be ‘causal’, and different PGAS methods will be affected by such variation in different ways. Additionally, the inclusion of multiple pathways in a single GL regression model presents a particular problem, since GL in its original formulation will not select pathways in the manner that we would wish. To account for this we employ a variable expansion procedure, originally proposed in the context of microarray data analysis by Jacob et al. (2009)
, that ensures that overlapping SNPs enter the regression model separately, for each pathway that they map to.
A number of factors may bias PGAS results, exaggerating pathway significance and giving rise to inflated numbers of false positives. Depending on the methods used, and the underlying disease-causing mechanism, such factors are likely to include pathway size (measured in number of SNPs and/or genes), and the extent and distribution of pathway LD. Common strategies employed by existing methods to reduce this bias include the use of permutation (of genes or phenotypes), and dimensionality reduction techniques such as PCA (Fridley and Biernacka, 2011
, Wang et al., 2010
). We propose a procedure that reduces bias by adjusting pathway weightings in the regression model according to the empirical bias in pathway selection frequencies obtained by fitting the GL model with a null response.
One potential drawback of using a regression model in the analysis of genetic data is the typically very large number of predictors (here SNPs) that must be analysed. While the use of penalized regression techniques at least makes the problem tractable when the number of predictors vastly exceeds sample size, the very large matrix calculations required can still make model estimation computationally infeasible. To address this, we combine a number of techniques that speed up the estimation process including the use of an ‘active set’ of predictors, a Taylor approximation of the GL penalty and efficient computation of pathway block residuals. The final estimation algorithm, which we call ‘Pathways Group Lasso with Adaptive Weights’ (P-GLAW), is sufficiently fast to obviate the need either to undertake a preliminary stage of dimensionality reduction, or to consider pathways individually.
We evaluate our method’s performance in a Monte Carlo (MC) simulation study, using real genetic and pathway data with quantitative phenotypes simulated under an additive genetic model. We consider a range of scenarios with different causal SNP distributions and effect sizes. We feel the use of real genotype and pathway data is crucial, so as to capture the complex distributions of gene size and number within a pathway, together with SNP LD patterns and overlaps between pathways, all of which may have a significant effect on pathway ranking performance. To our knowledge, this is the first such PGAS power study using GL with real SNP and pathway data. The evaluation of GL pathway ranking performance however presents a number of challenges. Firstly, as described above, variation in the number of causal pathways due to overlaps must be taken into account when evaluating performance over multiple MC simulations. Secondly, we require a means of evaluating the degree to which causal pathways are represented amongst the top ranks. Thirdly, since GL performs variable selection, not all causal pathways may be ranked, and ranking performance measures must reflect this. To address these issues we devise a battery of measures that aim to capture different aspects of ranking performance. Finally, we compare our method’s performance with another common PGAS method, derived from univariate SNP statistics.
The article is organised as follows. Section 2 describes the GL model; our strategy for dealing with overlapping pathways, model estimation and speed-ups; our proposed bias-adjusted pathway weighting update procedure; our strategy for ranking pathways using a resampling procedure, and our proposed ranking performance measures. In Section 3 we describe the real biological data sets used in the experiments, the SNP to pathway mapping process, and the simulation framework used to evaluate both methods under consideration. The results from these simulation studies are provided in Section 4, and we conclude in Section 5 with a discussion and final remarks.