Benefiting from recent advances in high-throughput genotyping technology, researchers are now able to evaluate hundreds of thousands of SNPs throughout the genome in search for disease-susceptible loci. This agnostic single-SNP testing approach enables one to discover disease-associated loci in regions of genome where scientists have little or no prior knowledge about the biologic function (e.g., the 8q24 “gene-desert” region that has been recently associated with the prostate cancer [
Yeager, et al. 2007]). A complementary and potentially rewarding strategy for genetic association studies is to study jointly the association between the trait and multiple genetic variants within a pathway defined according to the current biological knowledge.
In this report, we proposed a class of pathway analysis approaches based on the adaptive rank truncated product (ARTP) statistic and an associated computationally efficient permutation algorithm for evaluating the significance of the test-statistics. Through simulation studies, we compared gene-based and SNP-based strategies for the pathway analysis and found that the former approach have more robust performance. In particular, we found that when a pathway consists of genes of highly variable sizes, the gene-based method can have a major power advantage over its SNP-based counterpart if the causal variants reside in the relatively smaller sized genes. In contrast, there was no setting in which the SNP-based approach was clearly more superior among the two methods we compared. Furthermore, we found an adaptive approach for choosing the truncation point for rank-truncated-product statistics can improve the power of the method compared to fixing the truncation point to a pre-defined value. These observations were further reinforced in an application of considered methods for the study of the association between smoking behavior and nicotinic receptor pathway.
The ARTP method provides an efficient and flexible way to accumulate association evidence across individual genes within a pathway. In this report, we have obtained gene-level summary of association by combining results from the single-SNP test-statistics, using a RTP or an ARTP method, within a gene. Alternatively, one can obtain gene-level summary by constructing a multi-locus test for association that involves simultaneous analysis of all the SNPs within a gene. A variety of such powerful methods have been recently become available in the literature [
Gauderman, et al. 2007;
Kwee, et al. 2008;
Schaid, et al. 2005;
Yu, et al. 2004;
Yu, et al. 2005;
Zaykin, et al. 2006]. The proposed ARTP procedure could be easily adapted based on these alternative multi-locus test-statistics. The method is computational feasible even if the evaluation of the gene-level P-value associated with the chosen multi-locus test requires a permutation procedure, since only a single-level permutation procedure is needed for evaluating the significance of the final test-statistics. This offers great flexibility in incorporating a wide range of gene-level or SNP-level summary statistics.
Dudbridge and Koeleman [2004] proposed a computationally efficient permutation approach to evaluate the significance level for the ARTP statistic based on the extreme-value distribution theory. But their approach is appropriate only when the number of testing units (e.g., the genes in the gene-based pathway analysis) is much larger than considered truncation thresholds. Thus their method is most suitable for GWAS where we expect only a handful of true disease-association SNPs among over 100,000 testing SNPs, but not for pathway analysis where the number of genes could range between ten to a few hundreds. Also, a multi-level permutation procedure cannot be avoided by using their approach when the evaluation of the gene-level P-value itself requires a separate permutation procedure.
Besides the RTP, there are other types of P-value combination approaches that use a fixed truncation point, such as the one proposed by
Zaykin et al. [2002] that combines P-values less than a given threshold. It is possible to develop an adaptive version of this method using the algorithm described in this report.
Recently, gene set enrichment analysis (GSEA) algorithm [
Subramanian, et al. 2005] has been proposed for the identification of disease related pathways by measuring the overrepresentation of disease-gene associations within a given pathway compared to a list of reference genes [
Wang, et al. 2007]. The underlying null hypothesis is that the set of genes in a given pathway has no enrichment of association signals compared to the rest. In contrast, in this report, we focus on testing for the effect of a specific pathway without reference to any larger gene list. The underlying global null hypothesis is that there is no association of the disease with any of the genes in the given pathway. We believe that for GWAS, where the vast majorities of the reference genes are likely to be unrelated to a particular trait, the “global” vs. the “enrichment” null hypotheses are approximately the same and both types of approaches could be valuable for testing and prioritizing candidate disease susceptibility pathways. Many of the statistical and computation issue regarding how to combine evidence of association from SNPs to genes and then to pathways are similar between the two approaches. Thus, some of the tools we utilized, including the efficient permutation algorithm, could be useful for GSEA type analysis as well.
Several areas of research remain open. For example, once the association between a pathway and an outcome has been established, methods are needed for identifying the specific subset of the genes and the SNPs within the genes that are actually responsible for the association. This task can get particularly challenging partially due to LD between physically nearby SNPs and genes.
The proposed P-value combining methods gain efficiency by accumulating marginal association signal across individual testing units. Although a particular disease model was chosen for the simulation studies, we think the general conclusion still hold as long as there is association evidence from individual testing units. This approach, however, may not be very powerful in situations when there is none or very weak marginal effects from the individual genes, but there is strong epistatic interactions among the genes. In principle, the proposed ARTP procedure can be also adapted to account for gene-gene or/and SNP-SNP interactions within a pathway. For example, one can consider ARTP statistics by accumulating evidence of associations over joint analysis of pairs of genes within a pathway. A variety of advanced methods can be used to allow for epistatic interactions in such joint analysis [
Chapman and Clayton 2007;
Chatterjee, et al. 2006;
Chen, et al. 2007;
Ritchie, et al. 2001;
Ruczinski, et al. 2003;
Zhao, et al. 2006].
In summary, the proposed gene-based ARTP procedure, given its power, flexibility and computational efficiency, is a promising approach for pathway-based association analysis. We believe that in future this efficient single-level permutation algorithm will allow the method to adapt itself to incorporate more complex information, such as epistatic interactions or biologic knowledge about gene networks, into pathway analysis without increasing the associated computational burden dramatically.