|Home | About | Journals | Submit | Contact Us | Français|
A central goal of biology is understanding and describing the molecular basis of plasticity: the sets of genes that are combinatorially selected by exogenous and endogenous environmental changes, and the relations among the genes. The most viable current approach to this problem consists of determining whether sets of genes are connected by some common theme, e.g. genes from the same pathway are overrepresented among those whose differential expression in response to a perturbation is most pronounced. There are many approaches to this problem, and the results they produce show a fair amount of dispersion, but they all fall within a common framework consisting of a few basic components. We critically review these components, suggest best practices for carrying out each step, and propose a voting method for meeting the challenge of assessing different methods on a large number of experimental data sets in the absence of a gold standard.
Understanding of complex polygenetic phenotypes—stage of differentiation, disease state, responsiveness to exogenous perturbations and so on—requires a combination of high performance experimental and analytical methods for identifying related sets of genes (e.g. genes in pathways or functional classifications) associated with phenotypic changes. Identification generally means the discovery of gene sets that were not previously known to be related, as well as the determination of which sets among a known collection (e.g. ). The former, more difficult problem is also known as the pathway reconstruction or pathway annotation problem and discussed elsewhere [2–5]; here we focus on the latter, including the topological structure of the sets.
Early methods for associating gene sets with phenotype changes first identify individual, potentially relevant genes by making a binary decision based on a quantity that measures the extent of differential expression between the phenotypes (e.g. performing a t-test and requiring P-value <0.01), and then use a Fisher's exact test to determine whether a significant number of these genes belong to a prespecified gene set [6, 7]. An alternative approach begins by ranking all genes according to differential expression, and then determines if a prespecified gene set is significantly overrepresented toward the top or bottom of the ranked list. Such a procedure was first introduced by Subramanian et al.  and their particular method was named Gene Set Enrichments Analysis (GSEA). Here we use GSEA broadly to describe all methods for associating gene sets with phenotype changes. We abbreviate prespecified gene sets among a known collection as gene sets or simply sets. We use the terms enrichment and overrepresentation as they are in normal parlance. They do not distinguish method, but frames of reference: the members of a set are said to be overrepresented in a group of genes being examined, and the group is enriched in members of the set.
The GSEA method by Subramanian et al.  consists of the following specific steps: (i) rank all genes by the magnitudes of their differential expression and select a window in the ranked list, i.e. a contiguous run of some number of genes starting at any rank, (ii) define an enrichment score based on a weighted Kolmogorov Smirnov (WKS) test that measures the difference between the number of genes in a prespecified gene set that are observed in the window, and the number of occurrences if the genes in the set were uniformly distributed in the list, (iii) simulate a background distribution of the enrichment score by shuffling samples and estimate the statistical significance of the gene set, (iv) repeat steps i–iii for all prespecified gene sets (hypotheses) and for various window sizes, (v) correct for multiple hypotheses testing.
The strategy for performing GSEA has numerous variants, depending on the method for estimating significance (WKS test, mean test, median test, Wilcoxon rank sum test, etc.); background distribution, which is related to the method for estimating significance, but not always dictated by it; choice of the shuffling method when the background distribution is simulated; the method for multiple hypothesis correction; and the choice of weights to account for auxiliary information such as topology of gene sets [8, 9]. Several excellent reviews compared these variants [10–14]. In particular, Ackermann and Strimmer  established a general modular framework for performing GSEA. They critically assessed a subset of gene set enrichment procedures using 10 simulated data sets and 2 experimental data sets. In addition, they assessed three so-called global procedures, which do not compute a gene-level enrichment score; rather, they test gene sets as the unit. Ackermann and Strimmer concluded that the choice of gene-level statistics was inconsequential, while the performance of gene set-level statistics was more variable. All of the gene set-level statistics rejected the only simulated negative control data set, and based on the percentage of the other nine simulated data sets being detected, Ackermann and Strimmer concluded that the mean test was more sensitive than the median test and Wilcoxon rank sum test, while GSEA was the least sensitive. Furthermore, they concluded that global procedures yielded worse results than the best of the six gene set-level statistics they tested. Their results on the two experimental data sets were inconsistent with each other and inconsistent with the results based on simulated data. They reported that the results varied greatly depending upon the choice of the null hypothesis, and one global procedure called Hotelling's T2 test was most sensitive for one data set using one type of background distribution, while tests based on conditional FDR and enrichment score, but not the mean test, were more most sensitive in other combinations of experimental data set and null hypothesis. Although the test on experimental data sets is far more biologically relevant than on simulated data sets, the results of the former are harder to interpret due to the lack of a gold standard, i.e. which gene sets are true positives and which are true negatives.
Indeed, the lack of a gold standard has greatly hampered the effort of assessing gene set enrichment methods using experimental data sets. Biological systems are so complex that simulated data sets simply cannot substitute for experimental data sets. Furthermore, it can be argued that the ability of rejecting false positive predictions is even more important than detecting lowly ranked true positives, because it is costly to validate a large number of predictions. Therefore, in this article, we focus on experimental data sets. First we review the core components of gene set enrichment methods in the aspects that are important to experimental data sets. Then we use 132 experimental data sets to critically assess six gene set-level statistics and one global test, which received favorable ratings in previous reviews [10, 13]. To tackle the lack of a gold standard, we introduce the concept of mutual coverage (MC), which reflects the extent to which the gene sets predicted by a particular method are reproduced by other methods. Our results suggest that: (i) the Wilcoxon rank sum test and the WKS test as implemented in GSEA provide the most effective gene set-level statistics for obtaining high MC, (ii) simulated background distributions are more accurate than analytic backgrounds, (iii) the mean and median tests achieve the highest sensitivity but poor MC, and (iv) incorporating topology of gene sets in the analysis increases the sensitivity for all six procedures without reducing MC.
In this section, we review the five core components of GSEA methods as illustrated in Figure 1, focusing on the aspects that are particularly important for experimental data sets.
There are two important but frequently overlooked data preprocessing steps. Normalization allows expression values obtained from different experiments to be directly comparable [15, 16]. The expression values of a small, but different set of genes may be missing in different microarray experiments due to technical issues. Imputation of missing data is thus important for maximal data coverage when the results of multiple experiments are compared.
A number of methods are available for normalization [16, 17], yet this critical step is frequently omitted . The most common normalization algorithms—RMA  and MAS 5.0 —are designed for expression levels generated with microarrays that follow a lognormal distribution. Thus it is important to log transform the raw intensity values from microarrays. Failure to do so would bias toward high expression values, reducing statistical power because of increase in variance . Log transform is also applied to RNA-seq data. Expression level determined by RNA-seq is usually quantified in Reads Per Kilobase exon Model per million mapped reads (RPKM; density of reads that map to a gene normalized for the length of its mature transcript and for the sequencing depth of the experiment ), which after log transform correlates well with normalized intensity measured with microarray, also after log transform .
Missing data can be imputed using methods based on K nearest neighbors, singular value decomposition, or least square regression models. Least square regression algorithms were reported to produce lower estimation error than other methods [20, 21]. In this article we use a popular least square regression algorithm, LSimpute_gene , to impute missing values in all 132 experimental data sets.
The first step in GSEA is to compute a gene-level statistic of differential expression, e.g. a t statistic, a signal to noise ratio (mean to standard deviation ratio), a fold change or a Wilcoxon rank sum statistic. Because phenotype change can affect different genes in opposite directions, i.e. increase the expression levels of some genes while decrease the levels of others, and we want to be able to identify the gene sets that contain both types of genes, it is desirable to eliminate the direction of differential expression by taking the absolute or square of the statistic [13, 23]. However, data transformations that eliminate direction—such as absolute values—lead to asymmetrical distributions, and can nullify some analytical estimates of significance based on analytical background distributions such as the χ2 test [24, 25] (see ‘Estimating significance’ and ‘The validity of analytical background distributions’ sections).
The many-to-many correspondence between genes and probe sets on a microarray creates ambiguity in determining expression levels of genes [26, 27]. A common practice is to calculate the mean or median expression levels of the probe sets that correspond to the same gene; however, doing so usually increases the number of false negatives . An alternative is to perform meta analysis , using for example, the method proposed by Fisher  or by Stouffer [30, 31]. Rather than merging the expression values directly, these methods merge probe set-level statistics. A similar problem exists in RNA-seq, where some sequencing reads are mapped to multiple genomic locations. Such multi-mappers originate from paralogs, segmentally duplicated regions and low sequence complexity . Ignoring multi-mappers reduces sensitivity and undercounts some genes . Strategies for assigning multi-mappers are discussed in [19, 32, 33].
The purpose of a gene set-level statistic is to decide whether a gene set is distinct in some statistically significant way. A gene set statistic can be defined in terms of properties of the genes in the set, e.g. the mean, median, variance, etc. of a gene-level statistic (see Table 1 for more details). When a property (and its corresponding statistic) is chosen, the null hypothesis must, of course, also be specified. There are two null hypotheses as defined by Tian et al. . In one case (Q1) the background distribution is obtained by shuffling genes; in the other (Q2), the background distribution is obtained by shuffling phenotypes, i.e. samples (see ‘Estimating significance’ section). The rationale for using Q1 is that a significant gene set should be distinguishable from an equal size set composed of randomly chosen genes. On the other hand, Q2 focuses on a gene set and tests whether its association with the phenotype change is distinguishable from randomly shuffled phenotype changes. Q2 is generally favored because it preserves the relationship of the genes in the set [11, 12, 34] and directly addresses the question of finding gene sets whose expression changes correlates with phenotype changes.
Gene set-level statistics generally ignore clinical covariates—factors such as age, sex and weight—which can also cause differential expression, confounding the impact of phenotype changes . The effect of covariates can be estimated using, for example, a linear regression model . If a t statistic is used in the gene level, it can be generalized using a linear regression model for covariate correction, after accounting for the increased number of variables to avoid over fitting .
Most gene set-level statistics also ignore relationships among genes within the set. For example, if the gene set is a pathway, its topological information is ignored. Including topological information is important for accounting for the effect of genetic buffering , which deduces that if a gene fails to propagate its influence to a pathway neighbor, its biological role is buffered. Conversely, a gene that regulates many of its downstream genes may play a pivotal role in the expression changes of the pathway associated with phenotype changes. Methods for including topological information by weighted gene set-level statistics are discussed in [9, 37, 38].
We use significance in the standard way: the probability that the null hypothesis, evaluated on the background (or null) distribution, is correct. The background distribution can sometimes be written analytically, as in the case of a Gaussian distribution, and it can always be simulated by shuffling experimental data. As noted in the above section, simulated background is dictated by the choice of the null hypothesis (Q1 or Q2), which often leads to different conclusions .
Most frequently the gene set-level statistic, e.g. the mean of the t-statistic values of genes in the set, is assumed to follow a normal distribution when expression change has no association with phenotype change . In such case the significance (P-value) of a gene set can be computed analytically . Such an assumption is in question when the expression levels of genes in a set are dependent on one another, which is common for genes in a pathway . In ‘The validity of analytical background distributions’ section we will discuss analytical backgrounds and empirical corrections  to make them more useful.
To be concrete in illustrating how significance is estimated using a simulated background distribution, suppose we are interested in estimating the probability that the enrichment score obtained for a particular gene set is a chance occurrence of phenotype changes. The procedure would be to shuffle the phenotype labels, calculate the differential expression of each gene, rank all genes and compute an enrichment score for the same gene set. The entire process is repeated multiple times to obtain a distribution of enrichment scores, and the P-value of the actual enrichment score is simply the fraction of shuffles that produce enrichment scores at least as great as observed. Although simulating the background distribution obviates the need of an analytical background, it can be computational demanding—at least N shuffles need to be performed to achieve a P-value resolution of 1/N .
P-value is the appropriate measure of statistical significance when only one gene set is tested. When a large number of gene sets are tested, there can be many false positives among the gene sets that receive seemingly highly significant P-values; this is called the multiple hypothesis testing problems. The simplest procedure is to choose a P-value which, when multiplied by the number of hypotheses, i.e. the total number of tested gene sets, gives a sufficiently low corrected P-value, e.g. <0.05. This Bonferroni correction  is, however, very conservative and sometimes results in an unacceptably large number of false negatives. An alternative is to control the expected fraction of false positives among the predictions, or the false discover rate (FDR), using the method by Benjamini and Hochberg . The original Benjamini–Hochberg procedure assumes a uniform distribution for the P-values . In some cases when there are relatively many ‘non-null’ tests, i.e. when low P-values are prevalent, an FDR variant, positive FDR (pFDR) can be applied [42, 43]. The corrected P-value is called Q-value, defined as the ‘minimum FDR at which a test is called significant’ [42, 44]. The relationship between Q-value and FDR is analogous to that between P-value and type I error . The final significant gene sets are the ones whose Q-values are smaller than an FDR threshold.
Although appropriate usage of a standard analytical background distribution facilitates rapid computation of P-values with high precision, the critical question is how well the analytical background represents the actual background. One way to test the validity of an analytic background is to examine the distribution of P-values under the null hypothesis, which should be uniform [45, 46]. To address this question, we generated 500 null data sets by shuffling the sample tags of an arbitrarily chosen human data set from the Gene Expression Omnibus (GEO; a public database of high throughput gene expression data), GDS2835 . We constructed the histograms of the null P-value distributions using five gene set-level metrics summarized in Table 1, with analytical and simulated backgrounds.
The results indicate that P-values are uniformly distributed when background distributions are generated by shuffling, irrespective of metric (Figure 2A), whereas the distributions obtained using analytical background distributions are highly nonuniform, and metric dependent (Figure 2B). The null P-value distribution of the χ2 test as shown in Figure 2B deviates greatly from a uniform distribution, because taking the absolute value of the gene-level statistic causes the background to no longer follow the χ2 distribution. Taking the absolute value does not violate the analytical background distributions of the mean, median and Wilcoxon rank sum tests; nonetheless, their null P-value distributions deviate from the uniform distribution. The biased null P-value distribution further violates the assumption of the FDR procedure (‘Correction for multiple testing’ section). Thus we conclude that it is more accurate to use simulated backgrounds than analytical backgrounds.
The problem of comparing different methods for gene set enrichment analysis is made difficult by the lack of a gold standard. One can mine the literature to obtain evidence on whether a gene set is associated with the phenotype change, but this can only be done on a small scale. An alternative is to quantify the number of overlapping predictions (gene sets called significant) by several methods [9, 48]. Because each method can capture a piece of the evidence (from the location of mean, shape of distribution, etc.) of biological perturbation, gene sets predicted by multiple methods should be more reliable than gene sets predicted by only one method. In this article we propose a formal way to use the MC of multiple methods for evaluating gene set-level statistics.
We state the concept of MC by multiple methods in precise terms:
Observation 1: given several orthogonal predictors (e.g. gene set-level statistics), a gene set deemed significant by more predictors is less likely to be false than a gene set deemed significant by fewer predictors. The term orthogonal here means that different gene set-level statistics do not use correlated properties to make the prediction. Since ‘mutually supported gene sets’ identified by multiple predictors show statistical significance from multiple distinct properties, they might better reflect the underlying biology of phenotype changes.
Observation 2: if a particular predictor reports a high fraction of ‘mutually supported gene sets’, we assume it has high positive predictive value, especially if the number of predictors is exhaustive. We define MC of a predictor as the ratio of the number of votes from other predictors agreeing with its predictions divided by the maximum number of votes possible (see ‘Methods’ section for details).
The condition that predictors be orthogonal is almost never met. Figure 3 illustrates the Pearson correlation coefficient between the gene set-level statistics in Table 1. For example, mean and median are strongly correlated. If a predictor has many ‘echoes’, its MC can be overestimated. One way to correct for correlations among predictors is to down weight the votes from the echoes. An intuitive approach is to weight each vote according to the probability that two gene set-level statistics agree with each other by chance. In other words, if predictor A has a probability π of voting for all predictions of predictor B by chance, votes from A will be weighted by a function of π; in this article, we simply take it as the inverse of π. Therefore correlated predictors having higher π receive lower weights. To compute π, we generated 500 randomly phenotype-shuffled data sets from the aforementioned experimental data set (GEO GDS2835) and computed the expected frequency that two gene set-level statistics predict the same gene set as significant (see ‘Methods’ section for more details). After weighting and normalization, a controlled MC score (CMC; see ‘Methods’ section) is calculated and the effects of echoes are controlled.
We generated 132 null data sets by randomly shuffling the phenotype tags of an experimental data set (GEO GDS2835), and computed the CMC scores of the 5 original gene set-level statistics and Hotelling's T2 test in one group, and the CMC scores of the 5 topology impact factor (TIF)  weighted gene set-level statistics in another group. All 1l statistics show similarly low CMC scores (the first row of Table 2, labeled as ‘Null data set’), indicating that the contributions from correlated predictors have been substantially diminished. We did not compute CMC for a TIF weighted gene set-level statistic with its original statistic in one group, because they are highly correlated (Figure 3).
We collected and processed 132 human data sets from GEO, in accordance with the procedure in the ‘Components of GSEA’ section (see ‘Methods’ section for details). The CMC scores of the 6 statistics in Table 1 and 5 TIF weighted statistics are listed in Table 2. The Wilcoxon rank sum test and WKS test show significantly higher CMC in both original (CMC=0.4 and 0.35, respectively) and TIF weighted forms (CMC=0.39 and 0.35, respectively) compared with their CMC for the null data sets (0.07). To test whether the CMC score is still biased by dependency, we iteratively removed one statistic at a time and the resulting CMC scores are also listed in Table 2. The results indicate that the Wilcoxon rank sum test and WKS test still perform better than the other statistics, and the order of performance of all predictors does not change (Table 2). We conclude that predictions based on the Wilcoxon rank sum test or WKS test are more likely to be covered by other gene set statistics that use mean, median, χ2 and Hotelling's T2 test, and may imply higher biological inference power. In addition, the TIF weighted WKS test (PWEA ) reports 15% more positive prediction and still retains the higher level of CMC (see ‘Supplementary Materials’), suggesting that it is more sensitive than the original WKS test. Note that TIF weighting was not applied to Hotelling's T2 tests since this method performs principal component analysis which can not be applied to the topological information and moreover, one of the purposes of Hotelling's T2 test is to reduce the influence of correlation structure inside a gene set, thus weighting gene based on the correlation of neighbor genes is against its design.
We reviewed approaches to gene set enrichment analysis and attempted to clarify a number of concepts that are important for application to experimental data sets, such as preprocessing of raw data, imputation of missing data, the choice of null hypothesis, and methods for generating null distributions. Our analysis of null P-value distributions indicates that analytical background distributions are less accurate than simulated background distributions. As shown previously , the choice of gene set-level statistics is the most important component for gene set enrichment methods; however, and it is difficult to compare the performance of different gene set-level statistics when a gold standard is absent.
In order to address this difficulty we propose a new metric, CMC, essentially a positive predictive value where the true positives are determined by weighted overlaps between different methods. The results of testing 132 experimental data sets suggest that the Wilcoxon rank sum test and WKS test can better cover the predictions of the mean test, the median test, the χ2 test and Hotelling's T2 test, but not vice versa. We postulate that it is because WKS covers not just location shift but also shape changes of the observed distribution of differential expression compared with the background distribution and Wilcoxon rank sum test is robust to the extreme values. Since the WKS test reports more gene-sets than the Wilcoxon rank sum test in our experiments, it is likely to have higher sensitivity (see Supplementary Materials).
To further improve the biological utility of gene set enrichment analysis, we believe that the inclusion of additional biological features such as topology or covariates as discussed in the ‘Gene set-level statistics’ section would be more useful than changing statistics. Utilizing more domain knowledge is likely to reveal more insights in the analysis. The concept of gene set enrichment analysis has been applied to biological features in addition to expression, such as SNPs, copy number variation  and protein–protein interaction networks.
We collect all human gene expression data sets based on microarrays from GEO, and split each data set according to their phenotypes. All GEO data sets have proper gene name annotations for each probe-set. Subsets within the same GEO entry are scrutinized and only one subset is chosen to avoid redundancy. The data sets with fewer than 10 samples per phenotype were discarded. We imputed missing values of each test set using the LSimpute_gene algorithm , which construct weighted multiple regression models based on other genes that best correlated with the genes with missing values. We ensured all expression values were log transformed. Data sets that were normalized by approaches other than RMA or MAS 5.0 were also discarded. In total 132 test sets remained. We then perform t-test and use absolute values of t-statistic as the gene-level statistic of each probe-set. Probe-sets that share the same gene name were combined according to Stouffer's method .
We use 201 pathways from KEGG as the collection of gene sets for all analysis. We tested 5 gene set level statistics and one global test, Hotelling T2 test, which were reviewed favorably by Ackermann and Strimmer . A Hotelling T2 statistic for a gene set is calculated as follows:
where and are vectors having k (total number of genes in the gene set) elements, representing the mean expression levels of the genes in the gene set among two phenotypes, with nx and ny samples, respectively, and S−1 is the inverse of the pooled covariance matrix. The problem of Hotelling T2 test in practice is that when k>nx+ny, which is common, the singularity of S makes finding the inverse difficult. Kong et al.  solved the issue by using PCA (principal component analysis) to reduce the dimensionality, and we use the same approach to compute the Hotelling T2 statistic.
For all five gene set-level statistics (Table 1), we applied the method from Hung el al. to weight the gene level statistics by a topological influence factor (TIF). Hotelling's T2 statistic cannot be separated to gene-level and gene set-level, so TIF weighting cannot be performed. The significance levels (P-values) are calculated using simulated backgrounds and corrected for multiple testing using the FDR procedure .
To calculate CMC, we define Wk as the prediction profiles of predictor k under the FDR cutoff of 0.05. W is a two dimensional M by N matrix, where M is the total number of data sets (M=132) and N is the total number of gene sets (N=201). Wk (i,j)=1 if predictor k assigns a Q-value below the FDR cutoff for gene set j in data set i, otherwise Wk (i,j)=0.
The MC of a predictor k is defined as:
where S is the total number of predictors.
The numerator is the total number of votes that predictor k gets from other predictors and the denominator is the maximum votes it can get.
In order to control dependency among different predictors, we first model the probability π that predictor k agrees with predictor l in 500 null data sets, generated by shuffling the sample tags of an arbitrarily chosen human data set from the Gene Expression Omnibus (GEO; a public database of high throughput gene expression data), GDS2835 . Altering the total number of null data sets and the source experimental data set does not change π appreciably. π is defined as:
We use π as the weight of each vote accordingly and define CMC for predictor k as follows:
CMC is then a weighted ratio that indicates the fraction of predictions supported by other predictors.
National Institutes of Health (grant RR022971) (partially); National Science Foundation (grant DBI-0850008) (partially).
Jui-Hung Hung is a post-doc in Zhiping Weng's lab. He develops bioinformatics algorithms and tools.
Tun-Hsiang Yangis a post-doc in Charles DeLisi's lab. He primarily works on statistical genetics.
Zhenjun Hu is a research assistant professor in Charles DeLisi's lab. His research interests are mainly in computational methods for network analyses, integration and visualization.
Zhiping Weng is Director and Professor of Program in Bioinformatics and Integrative Biology at U Mass Medical School. Her lab develops and applies computational methods to study gene regulation, small silencing RNAs and protein docking.
Charles DeLisi is the Metcalf professor of science and engineering at Boston University. His lab focuses on all sorts of topics in systems biology.