|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association (GWA) study aims to identify the genetic factors associated with the traits of interest. However, the power of GWA analysis has been seriously limited by the enormous number of markers tested. Recently, the gene set analysis (GSA) methods were introduced to GWA studies to address the association of gene sets that share common biological functions. GSA considerably increased the power of association analysis and successfully identified coordinated association patterns of gene sets. There have been several approaches in this direction with some limitations. Here, we present a general approach for GSA in GWA analysis and a stand-alone software GSA-SNP that implements three widely used GSA methods. GSA-SNP provides a fast computation and an easy-to-use interface. The software and test datasets are freely available at http://gsa.muldas.org. We provide an exemplary analysis on adult heights in a Korean population.
Genome-wide association (GWA) study of a large population offers potential genetic causes of complex disease or the traits of interest (1,2). The typical approach assesses the association of each SNP independently with binary phenotypes (case–control) or continuously represented phenotypes (quantitative trait). However, due to the enormous number of SNPs analyzed, such an individual association analysis produces only a handful of significant SNPs from a stringent cutoff. The problem with this approach is that we may not delineate the underlying biological mechanism from the small number of SNPs beyond individual markers or genes. Moreover, many of those prominent SNPs are not reproducible among independent experiments. Another important problem is that many moderate but meaningful associations are lost below the stringent cutoff. In recent years, the gene set analysis (GSA) methods were taken into account in GWA studies which may address these problems.
GSA methods were originally developed for a transcriptome analysis to assess the differential expression of pre-defined gene sets that share common biological functions. They exhibited stronger statistical power than the individual gene analysis, and have revealed many novel gene sets with ‘subtle but coordinated’ expression patterns (3–5). Given that the basic goal of GWA studies is to prioritize the biological networks or processes associated with the trait of interest, it may be reasonable to consider the pre-defined gene sets or pathways as the units of an association analysis. Indeed, by analyzing SNPs on the gene set level, GSA was able to reveal many coordinated association patterns that might be lost by the individual marker analysis.
Several case–control studies employed GSA methods. Wang et al. (6) devised a GSEA framework for SNP arrays. They assigned the most highly associated SNP (best SNP) to each gene to summarize the association of multiple SNPs in each gene. Using the method, they successfully identified the Parkinson’s disease susceptibility pathways. Wang et al. (7) applied the same methods which implicated the molecular mechanism of autism beyond individual genes. Chen et al. (8) employed a gene set score weighted by the network connectivity in KEGG pathways, and analyzed five complex disease data sets. Lesnick et al. (9) showed the significance of the joint effect of weak variants within a candidate pathway for a brain disorder which may predispose to Parkinson’s disease. Askland et al. (10) using their pathway-based analysis tool, prioritized the ion channel gene set as a strong candidate that contributes to the susceptibility of bipolar disorder.
We present a general approach that enables the application of some state-of-the-art GSA methods in GWA studies. Because our approach uses the P-values of each marker, it is applicable to both case–control and quantitative trait studies. We then offer a JAVA-based stand-alone software, called GSA-SNP, that actualizes the methods. We provide an analysis example for adult heights in a Korean population
The GWA analysis software such as PLINK (11) produces a list of P-values for each SNP either case–control or quantitative trait study. We present three different GSA methods that make use of the P-values for GWA analysis.
We take ‘–log’ on each P-value. Because we will use the ‘k-th best P’ (k = 1, 2, 3, 4 or 5) in each gene, this will give a more symmetric distribution to the gene scores. Each SNP is assigned to a gene whose extents with some padding encompass the SNP. We considered ±20 K more bps in the neighborhood of each gene. We chose the second best SNP in each gene as a default option to summarize the information of multiple SNPs instead of the best SNP (6). Random associations can make a SNP highly significant by chance, and can be more problematic in a quantitative trait study where GWA analysis softwares such as PLINK apply a linear regression to approximate the significance of each SNP. Using the second best P-values, we expect to evade many spurious associations. Such an effect is demonstrated in the Supplementary Data. Larger k may yield more conservative predictions, but may reduce the power. Therefore, in general, we recommend using both the best and the second best P-values and compare the biological relevance of the predictions.
We compile the gene sets to be analyzed. When the P-values of each gene set are computed, we apply the Benjamini–Hochberg multiple testing correction (12).
We employ the Z-statistic method (13). Each gene set (GS) is assessed by the Z-statistic:
where is the average of the gene scores  in a gene set, m0 and σ are the mean and the standard deviation of all the gene scores, n is the number of genes in the gene set.
In transcriptome analyses, though simple and sensitive, the Z-statistic method exhibited two drawbacks. First, it assumes that each gene set is a collection of independent samples. This causes some false positives, because many biologically determined gene sets have co-expression patterns. Second, some gene sets may have bi-directional expression changes to maintain homeostasis, but Z-statistic cannot detect such bi-directional patterns because they cancel each other. These problems, however, may be ameliorated in the GWA analysis context. First, the correlations of the gene scores are mostly determined by the linkage disequilibrium (LD); hence the gene scores in a gene set can be at most partially correlated if some of the genes are located within an LD block which is not often the case. Second, only one direction of the scores is meaningful: low P-values, equivalently high – – log(P) scores. If a gene set is associated, we would observe some high gene scores in the set, but the scores of other members are expected to be randomly distributed rather than concentrated at a low score which prevents significant offsets.
We considered two sample-permutation based GSA methods. One is the recently developed restandardization method (14), and the other is GSEA (15) given in the next section. The restandardized GSA has several advantages over existing methods on such as power and reproducibility combined with the maxmean set statistic. Another advantage is that we can compute the P-values accurately from a relatively small number of sample permutations (50–100) if we use the pooled set scores. This saves much computational costs in GWA analysis. See the Supplementary Data for how to compute the P-values of the restandardization GSA (pooled version). We used the maxmean statistic for the gene set score which is defined as follows:
However, only the positive parts are meaningful in GWA analysis; hence we use the non-negative mean in practice for the gene set score.
GSEA (15) is of the most widely used class of GSA method and its R code is developed for the case–control studies (16). Our method is based on the P-values of each SNP; hence is also applicable to quantitative trait studies. We employed the maxmean (non-negative mean) set statistic, which has shown favorable properties over the Kolmogorove–Smirnov type statistic in statistical and computational perspectives (14). We also used the Z-statistic for the set score. While the original GSEA used the normalized set scores (denoted NES) that divides each set score by the mean of the randomized ones, we applied the Z-normalization using the mean and standard deviation of the randomized set scores (6).
We developed a JAVA-based stand-alone software, called GSA-SNP, that implements the three GSA methods described above. The detailed user’s manual for the software is available at our web page.
Two kinds of input data are required: marker association (typically P-values) and gene sets.
(i) The list of P-values for each SNP which is typically obtained from the computation of a GWA anlaysis software such as PLINK. The SNP ID should be dbSNP RefSNP rs numbers. The P-values represent the association levels of each SNP with the given phenotype. For this single column input values, the Z-statistic method (13) is provided. (ii) The program also reads the lists (50–1000 columns) of randomized P-values each column of which can be obtained by permuting the sample labels of SNP arrays and running a GWA analysis software. These randomized P-values should be attached aside the list of original P-values. For this type of dataset, three widely used GSA methods are available: the Z-statistic method, the restandardized GSA with the maxmean statistic (14), and the GSEA based on maxmean or Z-statistic (14,15). (iii) The program also accepts haplotype association data. The input in this case replaces the SNP ID by chromosomal extents, i.e. the combination of chromosome number, chromosomal start and end points. Each SNP is assigned to the genes whose extents with some padding encompass the SNP. For a given haplotype, the overlapping genes are identified and P-values are assigned to each gene. For this type of dataset, the Z-statistic method is applied. The gene IDs (gene symbol) instead of the SNP IDs can also be used when the user wants to apply another method to summarize the association of SNPs in a gene.
We provide the Gene Ontology gene sets for a default analysis. Therefore, the user need not upload gene sets if (s)he wants to use the Gene Ontology gene sets. Otherwise, the user is required to upload their own gene set data in a tab-separated value format. The gmt format of MSigDB is acceptable (http://www.broadinstitute.org/gsea/msigdb/); hence the user can make use of the rich source of gene sets from MSigDB in our software.
If the user uploads the marker association data, the program detects the data type and automatically shows relevant methods and parameter options. If the SNP data are uploaded, the user can choose the padding size for genes among 0, ±10 000 or ±20 000. The user can also choose the kth best P, k = 1, 2,…, 5 to summarize the SNPs for each gene. The padding of ±20 000, and the second best P are the default options. For an input of a gene list, there is no parameter to choose. For a haplotype input, the user is requested to choose the minimum overlap size between haplotype intervals and genes. The user can also determine the range of gene set sizes and the cutoff for q-values.
The list of significantly associated gene sets with P-values, corrected P-values, and their members that are sorted in the descending order of their association strength. The results are given both on the program window and as a csv file. If the user clicks a gene symbol in the csv file, the web-browser will show its information on GeneCards® (17) (http://www.genecards.org).
As the power of GSA in GWA analysis was being conceived, several software tools were developed recently. Holden et al. (16) devised a GSEA method for case–control arrays using R. They used all the SNPs in a gene instead of selecting the best SNP (6). O’Dushlaine et al. (18) developed PERL codes, called SRT, that generate simulated P-values from randomized phenotypes using the PLINK program and perform a sample-randomizing GSA. In this program, they first select a list of significant SNPs from a cutoff threshold and compute the ratio of the significant SNPs in each pathway (gene set). These two methods allow multiple SNPs in a gene to contribute to the set score, which may increase the power of the methods. However, it is possible that significant SNPs in a haplotype block are concentrated in one or two genes of the pathway, which may overly amplify the contribution of a gene. On the other hand, Wang et al.’s method (6) and GSA-SNP choose the kth best P-value to summarize the information of each gene. Therefore, contributions by multiple genes are more emphasized in this approach. Besides, the best P-value is not merely the information of a single SNP because it is determined by comparing the significance of all the SNPs in a gene. One possible concern with using the best P is whether it is from a random association or not. This is why we considered the kth best P-value: the kth P-value can be significant only if all the first k P-values are simultaneously significant which is possible for true positives, but is very difficult to happen by chance.
Because SRT does not pool the set scores of different gene sets, it takes much time to attain a high level of significance. Medina et al. (19) developed a web server, called GeSBAP, that provides a SNP (marker)-randomizing GSA method. It also takes much time for uploading and computing for an SNP input. GSA-SNP provides a fast computation. Several minutes are sufficient on a PC if the input data are properly prepared. It is equipped with an easy GUI that automatically displays relevant options and methods according to the input data types. Because GSA-SNP takes P-values as input data, it is applicable to both case–control and quantitative trait studies. This advantage is also shared with other P-value based methods, SRT and GeSBAP, but GSA-SNP provides three widely used GSA methods if the simulated P-values are prepared. How to choose a GSA method may depend on the preference of the user for each GSA approach, but a general guidance is given in Supplementary Data.
The most time-consuming part for our sample-randomizing methods (restandardized GSA and GSEA) will be to prepare the simulated P-values using a GWA analysis software. Because our methods pool the randomized scores of different gene sets, about a hundred simulations of P-values will be sufficient.
We evaluated the performance of GSA-SNP using a large-scale GWA study dataset, previously reported by the Korea Association Resource (KARE) project (20). The dataset comprised the genotypes of two population-based cohorts recruited in Korea (8842 unrelated individuals), measured using Affymetrix Genome-Wide Human SNP Array 5.0 (352 228 SNPs after QC). Cho et al. (20) reported the results of GWA analyses of eight quantitative traits in the dataset. Among the traits, we focused on height due to the following reasons. Adult human height is a polygenic trait with a highly heritable component (21) and several GWA studies have been reported on this trait (22–27). The comparison of the previous GWA studies shows that the height association signals in a Korean population are weaker than those of European populations: some of the strong association signals in the European populations were also caught in the Korean population, but other signals were lost below the threshold cutoff (see the Supplementary Data). Therefore, it would be interesting to see whether gene-set analysis of the single-marker analyses of Korean dataset would capture biological processes similar to the ones reported by the European studies.
Before applying a GSA, the KARE genotype data were supplemented by imputing SNP genotypes based on the genotypes of the JPT + CHB panel of the International HapMap Phase II (The International HapMap Consortium, 2005). Details of SNP imputation and filtering have been published elsewhere (28) and were summarized in the Supplementary Data. We used this imputed dataset as the input of GSA-SNP. We show the analysis result by the Z-statistic method. A total of 12 Gene Ontology gene sets having corrected P < 0.05 were tabulated. The resulting screen shot is shown in Figure 1. for each of those 12 gene sets, we gathered the P-values of the member genes and compared their distributions with that of all the genes together. As shown in Figure 2, relatively smaller P-values were enriched in the distributions of these 12 gene sets compared to the background distribution. It should be noted that the first quartiles of the gene sets were around 10−2, which implies GSA-SNP successfully identified moderate but consistent association patterns that were not dominated by only a few strong association signals.
Many of the gene sets we identified made good biological senses. For example, the GO term ‘skeletal development’ was identified by Gudbjartsson et al. (26) in their GO-based GWA analysis of European adult heights. Similarly, Weedon et al. (25) reported ‘extracellular matrix’ as one of the key biological functions implicated in height regulation. Since collagens are the most abundant proteins in the extracellular matrix, the term ‘collagen’ may be understood in the context of ‘extracellular matrix’. Although ‘glutamate receptor activity’ has not been implicated through GWA studies to our knowledge, one member of that gene set, GRIA1, was implicated near a loci associated with height in Croatian population (27). Endogenous activation of metabotropic glutamate receptors is known to modulate GABAnergic transmission of gonadotropin-releasing hormone (GnRH) neurons (29). Moreover, treatment with a GnRH agonist in short adolescents increased adult height (30). These imply that glutamate receptors might affect adult height in man. The top ranking genes in ‘anion cation symporter activity’ were SLC17A family members, vesicular glutamate transporters (31). Other GO terms such as ‘transmembrane receptor protein phosphatase activity’, ‘golgi stack’ and ‘phosphoric ester hydrolase activity’, were not supported in our literature survey and may deserve further investigation.
Here, we proposed a P-value based approach that enabled the application of three widely used GSA methods in both case–control and quantitative trait studies. We also suggested using the kth best P to summarize the association of SNPs in a gene to remove randomly associated signals. However, we also provide the option for the best P-value (k = 1) in our software such that the user can perform the analyses for different k’s, and compare the relevance of the results.
As a stand-alone JAVA program, GSA-SNP provides a fast and secure computation as well as an easy-to-use GUI. Unlike other tools, GSA-SNP provides three powerful GSA methods for GWA analysis. Because it employs methods that pool the randomized set scores, it provides a high level of significance from a relatively small number of simulated analyses.
We expect the demand of GSA in GWA analysis will be increasing rapidly in the near future as has been in the transcriptome analyses, and the GSA-SNP provides a useful tool. GSA methods and tools will be further investigated and optimized in the context of GWA analysis.
Supplementary Data are available at NAR Online.
National Institute for Mathematical Sciences, Daejeon, Korea (NIMS); Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korea government (MEST) (grant no. 2010-0016668 to D.N.); MEST, NRF (grant R11-2008-0062293 to S.K.). Funding for open access charge: MEST, NRF (grant R11-2008-0062293).
Conflict of interest statement. None declared.
The KARE genotype and epidemiological data were gratefully made available by National Institute of Health, Korea Center for Disease Control, which supported this work through the KARE Analysis Consortium. We would like to thank the help and support of all the staff at KNIH. Computing cluster used in this study was kindly granted from Korean Bio-information Center (KOBIC).