|Home | About | Journals | Submit | Contact Us | Français|
Since its first publication in 2003, the Gene Set Enrichment Analysis (GSEA) method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach, using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes GSEA’s nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with GSEA’s on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods.
The availability of global transcription profiling with microarrays in the mid 1990s made possible the analysis and interpretation of the activity of the entire transcriptome to provide insights into biological function and the mechanisms of disease. Early on, it became clear that focusing on long lists of differentially expressed genes gave limited understanding of underlying pathways and processes. Several approaches to consider testing for the over-representation of gene categories rather than genes were introduced by [1,2,3,4,5,6]. In  we introduced a knowledge-based approach analysis method, Gene Set Enrichment Analysis (GSEA) to address this problem. Briefly, this initial approach ranked genes by their differential expression between two phenotypes, used the Kolmogorov-Smirnov statistic to score the enrichment of an a priori defined sets of genes that share common biological function, chromosomal location or regulation, and evaluated the significance of the score using an empirical permutation test correcting for multiple hypothesis testing. Thus, GSEA provided a prioritized list of sets of genes for hypothesis generation and further study. In this first application we identified a set of genes involved in oxidative phosphorylation with reduced expression in diabetic patients. None of these genes were down-regulated by more than 20%, but as a group their coordinate down-regulation was significant and, together with subsequent work, lead to a better understanding of the regulation of the oxidative phosphorylation pathway [8,9] as many its components turned out to be controlled by the PCG1A transcription factor, which was itself down-regulated in diabetic patients.
As soon as this original version of GSEA appeared, objections were raised to the approach , some of which were immediately refuted in , and the rest were met by our subsequent improvement of the GSEA methodology. In Subramanian and Tamayo et al.  we introduced a version of GSEA that used a correlation-weighted Kolmogorov-Smirnov statistic, an improved enrichment normalization procedure, and an FDR-based estimate of significance that collectively made GSEA appreciably more sensitive, more general, and more robust. As a result of these improvements, and the public availability of the software and companion Molecular Signatures Database (MSigDB) [www.broadinstitute.org/gsea], GSEA became a widely used method and was applied to numerous problems across many application domains. Notably, since the original release of the software and database in 2005, the number of GSEA user registrations has grown to over 33,000, and the method used and cited in more than 3,100 scientific publications. GSEA and other gene set analysis methods have also motivated the development of general statistical methodologies for large-scale inference for “sets” of variables [13,14].
The specific knowledge-based approach pioneered by GSEA is now standard practice in the analysis of gene expression data and inspired the development of a large and growing family of conceptually similar methods. For example, Huang et al.  identified at least 68 different gene set enrichment methods in their survey. A family of popular methods estimate the over-representation of Gene Ontology (GO) annotations using a hyper-geometric statistic or Fisher’s exact test (e.g., GoMiner , FatiGO , GoSurfer , EasyGo , David ). These methods restrict consideration to the “top” of the list, and may miss more subtle signals. They also assume gene independence and thus produce overly optimistic results [15,21,22,23,24,25]. In addition, several improvements to gene set enrichment analysis itself have been proposed. These include those used in , GSA , SAFE , Catmap , ErmineJ , and SAM-GS , and PROPA . They employ alternative ranking metrics, enrichment statistics, and several variations on significance estimation schemes. Notably,  demonstrates the difficulty of finding a single, optimal statistic due to the complexity, heterogeneity and multi-modal distribution of the expression levels of genes within gene sets. Other somewhat more sophisticated methods (e.g., FunNet , PARADIGM , COFECO ), take a network-based approach, but restrict the analysis to processes where a deeper understanding of gene-gene interactions is already available. The primary advantages of GSEA are that it only requires gene set membership information to compute enrichment scores, considers the entire ranked list of genes, and maintains the gene-gene dependency that reflects real biology. This yields a good compromise between sensitivity, performance and applicability.
Recently, Irizarry et al.  in their “Gene Set Enrichment Made Simple” article proposed a “simpler” approach to gene set expression analysis assuming gene independence and using a one-sample t-test to estimate enrichment. Here we will refer to their method as SEA (Simpler Enrichment Analysis). The rationale for SEA is based on their perception that that gene independence is a reasonable simplifying assumption and thus simpler parametric approaches to gene set analysis have been ignored. Both of these assumptions disregard a large body of literature where many authors have already introduced “simple” parametric methods for gene set analysis [38,39,40,41,42]. Many researchers have demonstrated the unrealistic nature and limitations of the gene independence assumption [21,22,23,24,25,43]. In addition, they criticized GSEA for its use of an empirical null distribution, which they argue is unnecessary and hard to compute; and for using a non-parametric weighted statistic that they believe inherits the lack of sensitivity of the original Kolmogorov-Smirnov statistic. The paper concludes by proposing SEA as a serious contender and argues against the use of GSEA in any of its forms.
The aim of this paper is twofold. First, we present the SEA approach and contrast it with GSEA using the statistical framework of . We then carefully consider Irizarry et al.’s  criticism of GSEA and refute their claims by systematically comparing SEA and GSEA on a large benchmark set of 50 datasets. We show, in agreement with earlier observations, that the gene independence assumption is not realistic because gene correlations are non-trivial and produce a substantial amount of variance inflation in the global statistic that in turn produces a large number of false positives results. Second, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose for gene set enrichment methods in general and propose that future progress in gene set analysis will result from improving the resolution of the gene sets and better ways to model the complex gene set correlation structure.
Here we review first the SEA approach proposed in  and then the GSEA approach proposed in  using the statistical framework of Barry et al. 2008 . Several other valuable and complementary statistical frameworks for gene set analysis have been introduced in recent years [13,25,44,45,46].
First we will define the quantities relevant for the analysis: the input gene expression dataset X corresponds to N genes and M samples and contains gene expression profiles xij, where i indicates a specific gene (row) and j a specific column (sample). The relevant phenotype of interest is defined as a vector Y of M binary values categorizing the sample in two groups (Y0, Y1). The gene sets are represented as gk.,where k runs from 1 to K, the maximum number of gene sets. We define a gene set enrichment method as a two-stage procedure incorporating i) a local test statistic si (xi, y)that measures the association between a gene expression profile (xi•) and the phenotype (Y); ii) a global test statistic Si to assess the degree of the gene set’s enrichment, and iii) a specific null hypothesis and error rate controlling procedure to assess statistical significance and provide a final sorted gene set enrichment result list. The global statistic can be either a parametric or non-parametric function of the local test statistics corresponding to the gene set in question (“self-contained” statistic) and potentially all other genes (the complement or “competitive” statistic);
The SEA method uses as local test statistic a two-sample t-test statistic ti that quantifies the magnitude of differential gene expression for each gene,
where the expected values and standard deviations (σ) are computed in each phenotypic class (Y0 / Y1). The SEA method can also be used with modified versions of this score  or with a signal to noise ratio as is used in GSEA. SEA’s global test statistic is a one-sample t-test (z-score) that is used to estimate the “enrichment” of the entire gene set,
The SEA method assumes that these global statistics are independent and identically distributed (gene-independence), proposes a theoretical normal-theory null hypothesis,
and estimates FDR q-values following the methodology of Storey 2002 . Realizing that this global statistic only detects changes in location and fails to detect other more complex types of differential gene set behavior, the SEA method adds a second χ2 global statistic,
and, as in the case of the z-score, an associated normal-theory null hypothesis,
with the FDR computed in the same way as for the z-score. The SEA method is therefore based on the assumption that gene-gene correlations and gene set overlaps have negligible effects and therefore both global statistics (z-score and χ2) are independent and identically distributed under normal-theory null hypotheses and . The SEA null hypotheses are instances of the gene-sampling “Class 1” null hypothesis of . We will say more about the applicability of this assumption to real datasets in the next section below.
Before concluding our review of the SEA method, we point out that the second global statistic ironically renders SEA as less “simple,” since the potential user now has to consider two sets of results and their corresponding FDRs. Moreover, no formal procedure is specified by SEA in order to produce a final single list of results. Instead, the authors suggest choosing high scoring gene sets according to either one of the two global statistics.
The GSEA method uses a signal to noise ratio as its local test statistic,
Where σ′ are the intra-class standard deviations thresholded from below at 20% of the class means,
GSEA uses a weighted Kolmogorov-Smirnov global statistic to assess gene set enrichment,
An important fact, not often appreciated, is that this global statistic is weighted using a power of the local statistic |sh|α (typically with α =1) and it is therefore much more sensitive to differences at the top and bottom of the gene list than the standard Kolmogorov-Smirnov statistic. Because this statistic cannot be expressed as a simple function of the local statistic, it posses challenges to formal analysis based on parametric modeling.
The global statistics depend on size of gene set and therefore are not identically distributed. GSEA addresses this issue by normalizing values to factor out the intrinsic gene set size dependence. The relevant normalization is a change or scale using the expected value of the positive (negative) null distribution statistic induced by sample permutation,
This rescaling is motivated by the asymptotic behavior of the Kolmogorov-Smirnov statistic (for details see supplementary information in ). This normalization effectively puts the gene set enrichment scores on the same scale analogous to how the factor does it for the SEA z-score. This makes it possible to define a null distribution for the GSEA global statistics assuming a null distribution induced by sample permutation,
The null distribution corresponds to the Class 2 type subject-sampling null hypothesis of .
Our review of both methods clearly shows that SEA and GSEA differ in two important aspects: the choice of global statistics (z-score combined with χ2 vs. weighted Kolmogorov-Smirnov) and the specific type of null hypothesis being assumed: gene-sampling/Class 1 vs. subject-sampling/Class 2.
The authors of SEA apply the method to the handful of examples used to introduce GSEA in  and make the following claims in favor of the gene independence hypothesis and against GSEA:
We will refute the authors’ claims below by studying the gene independence assumption behind the SEA method and by performing an empirical study of 50 datasets where we focus on the consequences of assuming gene independence.
In their SEA approach Irizarry et al.  justify the gene-independence assumption based on the seemingly “straight” behavior of the gene expression scores in a handful of Q-Q plots (Figure 3, Irizarry et al. ) and conclude that “Barring a few outliers, which are likely associated to differentially expressed genes, the assumption appears appropriate in all datasets.” We believe this is an over-simplification and that the gene independence assumption is not appropriate in general. Here we present the results of an empirical study to systematically evaluate SEA and GSEA and assess the effect of the gene independence assumption in a representative benchmark set of 50 expression datasets, including some from the GSEA paper , as well as many more from GEO (Gene Expression Omnibus), the InSilico DB database of datasets [http://insilico.ulb.ac.be] and others from the literature. The complete list of these publicly available datasets is included in Supplementary Table ST1.
This benchmark set is much more comprehensive and representative of the universe of datasets that may be used in gene set enrichment analysis. Most of the datasets derive from more recently generated data than the original GSEA examples, contain expression levels for greater numbers of genes, and display a much larger variety of phenotypic distinctions.
In order to perform our comparative analysis we ran implementations of GSEA and SEA on all 50 benchmark sets and computed the corresponding results tables, including enrichment scores, p- and q-values for each gene set. In order to make a fair comparison of both methods we computed the GSEA q-values using exactly the same procedure as in SEA (i.e., computing q-values using the nominal p-values as inputs to the q-value R function/package ). For SEA, in order to produce a single score per gene set, we generated both proposed scores (z-score and χ2) and chose the one with smaller p-value as was suggested in .
First we note that SEA uniformly produces many more significant gene sets than GSEA. For example in the Pancreas dataset , featured in Fig. 1B, GSEA produces 121 significant gene sets, out of a total of 1,368, at the suggested threshold (q-value < 0.25). In contrast, SEA produces 570 significant gene sets at the most stringent threshold of q-value < 0.05. This number is almost 5 times more than GSEA and accounts for 42% of the total number of gene sets. Similar remarkably large results sets are produced by SEA in other datasets. This overproduction of significant results is further exacerbated in newer more comprehensive datasets with larger numbers of genes and stronger gene-phenotype correlations. SEA produces are large number of significant sets many of which we suspect are false positives due to the assumption of gene-gene independence. It is, therefore not surprising, that among the SEA sets we find many of the significant results produced by GSEA .
It is difficult to assess in more direct terms the specificity/sensitivity of each method and specifically the exact percentage of false positives in SEA results because in general we do not have validated “ground truth” results for any given dataset. However, in our study we will use the methodology of Gatti et al. 2010  where for each benchmark dataset, besides the observed global statistics corresponding to the relevant phenotype, we produced results for 1,000 randomly permuted phenotypes. Because the phenotypes have been randomized there is no significant correlation structure between the class labels and the gene profiles but the gene-gene correlations are preserved. We will use the observed and random-permutation global statistics to perform two groups of analyses: i) a study of the amount of variance inflation and the ii) inflation of p-values.
One of the effects of gene-gene correlations and dependency structure in microarray datasets is the increase of variance in the global statistics. For example the SEA null hypothesis assumes the global statistics are i.i.d. and normally distributed,
Barry et al 2008  studied the effects of gene dependency and found that for a difference of means statistic, quite similar to the SEA z-score, and gene set sizes that are small compared with the length of the gene list, the true variance of the statistic will differ from that under the i.i.d. class 1 gene-sampling null by a “variance inflation” factor Γ that can be approximated by,
where is the average correlation between the global statistic (z-score) inside a gene set. For global statistics that are linear functions of the local statistic, can be approximated with , the average correlation between gene expression profiles of genes in a dataset. From eq. 12 it is evident that the variance inflation increases with the product of the average intra-gene set correlation with the size of the gene set. Therefore, a small number of positive correlations within a gene set can result in substantial variance inflation if the gene set is large enough. For SEA the assumption of gene independence implies and therefore a variance inflation of one. Gatti et al. 2010  studied about 200 real datasets and found strong variance inflation effects, roughly in the range from 1 to 6, as a result of rather modest positive gene correlations. Their results demonstrated the importance of this effect and the unrealistic nature of the gene sampling (class 1 and SAE) null hypothesis.
In order to further investigate this issue, we analyzed the variance of the distribution of SEA z-scores under 3 null distributions: sample/phenotype-sampling (class 2), gene-sampling (class 1) and SEA (theoretical normal null). In order to produce histograms we computed z-scores generated by 1,000 permutations of the samples and 1,000 permutations of the gene identifiers. Figure 1 shows the histograms of z-scores for P53  and Pancreas  datasets from the benchmark set.
The histograms of z-scores for 1,000 permutations of the samples (shown in grey in Fig. 1) show clearly how the gene correlations noticeably increase the width of the distribution and consequently produce significant variance inflation. For example, in the P53 and Pancreas datasets the variance of the z-score distribution for the sample permutations are 2.9 and 4.12, respectively. In contrast the variances for the gene-sampling distributions are 0.96 and 1.02, which are very close to the SEA null value of 1.
Additional insight can be obtained from the histograms of individual average gene set internal correlations and variance inflation as shown in Figure 2. The gene correlations are on average mostly positive and relatively small, e.g., in the range [0, 0.30] (Figs 2A and 2C), but their effect on variance inflation can be significant and equivalent to a few times the variance of the SEA null distribution as estimated by eq. 12 (Figs 2B and 2D).
Because of the strong effects of the gene correlations on the global statistic distribution the sample/phenotype-sampling class 2 (GSEA) null hypothesis is a much more realistic representative of the situation found in real datasets where there is always a nonnegligible amount of gene correlations.
The variance inflation in the global SEA statistic presented in the previous section is not only of academic interest. It has deleterious consequences in the form of high rates of false positives and inflated p-values. In Figure 3 we show histograms of p- and q-values for SEA and GSEA applied to the P53 (Fig 3A) and Pancreas datasets (Fig 3B) for 1000 random permutations of their phenotype labels. Recall that these permutations are performed to eliminate, to the extent possible, the biological differences between the two resulting groups. SEA produces a spurious over-population of low p-values that can be seen as a spike on the left side of the histogram in panel I of both Fig. 3A and Fig. 3B. These spikes include at least about 15% of all gene sets against the P53 dataset and about 20% of all gene sets against the Pancreas dataset.
The over-population of low p-values translates into a significant number of gene sets with low q-values (panel II of Figs. 3A and 3B). Using a q-value threshold of 0.05 this spurious population includes 11.5% of all gene sets against the P53 dataset and 20.1% of all gene sets against the Pancreas dataset. In contrast GSEA produces a practically flat histogram of p-values and no significant numbers of gene sets with low q-values as expected (Figs. 3A and 3B, panels III–IV). In this case, for example, the number of gene sets with q-values less than 0.05 is only 0.0248% for the P53 dataset and practically zero for the Pancreas dataset. Thus we see that SEA produces many significant gene sets in the absence of a biological signal as a consequence of ignoring gene-gene correlations. This is a major drawback of the method.
In Figure 4 and Table ST3 we summarize the randomized phenotype results for the entire benchmark set by showing the percentage of gene sets with q-values less than 0.05 and 0.25 by SEA and GSEA. SEA uniformly produces a large number of false positives regardless of the choice of FDR threshold. To further demonstrate that this inflationary effect is indeed produced by the gene-gene correlations we used the same 1,000 randomly permuted phenotypes as before, but we also randomly permuted the gene identifiers effectively destroying the gene-gene correlations. In this case, SEA shows similar behavior to GSEA and neither shows inflation of p-values (Fig. 5 A–B).
The above analysis convincingly demonstrates that ignoring gene-gene correlations and using the class 1(gene sampling) theoretical null distribution  have very negative consequences on the final results of SEA, and in fact on any enrichment method that assumes gene-gene independence. These results are very similar to those obtained by Gatti et al. 2010  using different data sets and a similar, but different, statistic.
All these results cast doubt on the validity of Irizarry et al.’s claims (i)–(iv) above.
The heterogeneous distribution of the expression of genes in gene sets poses a technical challenge to properly define enrichment scores. This derives from the fact that most gene sets are noisy and imperfect, or too context-dependent to be modeled with a simple assumption or statistic. Sometimes the genes sets are derived from generic “textbook” descriptions of a biological process, e.g., Biocarta pathways or GO ontologies, and thus have little relevance to coordinately expressed components of these pathways. In other cases, they are defined in a specific cellular context different from the one in which an investigator wishes to assess enrichment or they are a mixture of multiple biological processes that may not occur coherently in any single biological sample. For all these reasons, the distribution of gene scores displays rather complex multi-modal behavior and this it turn makes it difficult to define a single enrichment score that will work well across gene sets and datasets.
Examples of this complex behavior are illustrated in Figure 6, where the behavior of three selected gene sets is shown across three of the benchmark datasets. The different panels illustrate the complex multi-modal behavior of gene sets that may occur. We note that in panels I, III, V and IX, a relatively high enrichment score of the gene set is produced by a subset of the genes, rather than the entire gene set. The genes responsible for the enrichment appear at either the top or bottom of the ranked gene list and are representatives of relevant biological processes that are indeed enriched in the studied phenotype. These examples are typical and show the complexities in modeling that gene set enrichment analyses encounter in practice. They also explain why it is difficult to describe the behavior of gene sets analytically and why overly simplistic assumptions such as those used in  are not likely to work. By taking these complexities into account, one can better appreciate the motivation for the weighted Kolmogorov-Smirnov statistic as a good compromise between expecting all or most genes to be coherent on one side, and overweighting one or a few genes and allowing them to dominate the score on the other. It provides a reasonable distribution-free, but empirically adaptable, way to deal with the limitations and idiosyncrasies of real life gene sets. It may not be the most powerful statistic for any given simplistic circumstance, but the modification we made in  of weighting by phenotype correlation is sensitive where it has to be and deals well with the behavior of real gene sets.
The SEA method justifies the gene independence assumption based on the apparent normality of the local statistics (Q-Q plots), and the assumption that because intra-gene set gene correlations do not impact significantly the distribution of the global statistics. These assumptions are not supported by our empirical results. The strong effect of gene-gene correlations is well known and the need to take into account this dependency structure as part of gene set analysis has been well documented [22,23,24,25]. Of special note is the study of Gatti et al.  who performed a large empirical study of 200 datasets in their suggestively titled article “Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets” and also demonstrated the practical impact of strong gene-gene correlations patterns and strongly criticized the use of the gene independence assumption.
One limitation of the SEA study  is that they evaluated their method, and obtained their conclusions, in only a handful of datasets, namely the examples from the GSEA paper . Those datasets were good examples to illustrate the use of the recently revised GSEA method at that time, but they do not constitute a comprehensive benchmark to systematically study the properties of a new method or to perform a comparison between competing methods. The collection of gene sets they used (MSigDB in 2005) is also relatively small (522 gene sets) compared with later releases of the same collection (e.g. 1,893 gene sets in MSigDB v2.5) where the overlap between gene sets is more significant.
The authors of SEA also criticized the use of the Kolmogorov-Smirnov statistic in GSEA based on their view that it lacks sensitivity and is rarely used (their claim (v) above). However they failed to appreciate the fact that the weighted version used in GSEA is not the standard Kolmogorov-Smirnov statistic and was developed specifically to be more sensitive to differences in the tails of the distribution. Non-parametric statistics based on empirical cumulative distribution functions are the basis of new and powerful “goodness of fit” tests .
Irizarry et al.  also listed as one of the motivations for SEA the fact that GSEA is slow and hard to compute (claims (iii) and (iv)). In Table 1 we show typical CPU execution times (Mac book Pro 8.3) for SEA and GSEA applied to two benchmark datasets (P53 and Pancreas). As seen in the table, SEA is indeed faster than GSEA because it avoids the generation of an empirical null distribution. However, a typical GSEA run takes only about 7 minutes and for almost all users this is an acceptable running time. Moreover, SEA may be faster but this speed up comes at the high cost of large numbers of false positives as described above.
We have shown strong empirical evidence that gene-gene correlations cannot be ignored and should be taken into account by gene set enrichment methods. Our results agree with the extensive literature providing theoretical or empirical evidence against the gene independence assumption [22,23,24,25] and counter the chief assumption of SEA .
We benchmarked SEA against GSEA in a collection of 50 expression datasets. By randomizing phenotypes, we demonstrate that gene-gene correlations produce significant variance inflation in SEA results, which also exhibit very high false positive rates and significant numbers of inflated p- and q-values. Based on our empirical results we have refuted the claims of Irizarry et al.  and more broadly we recommend that methods that ignore gene-gene correlations, such as SEA, be avoided.
Finally, we believe that the most important improvements that can be made to gene set enrichment methods are i) the improvement of gene set databases, like the MSigDB, so that their sets are less redundant and have more coherent behavior in actual biological samples; and ii) the development of more sophisticated methodologies that more accurately take into account or model gene correlations and the dependency structure in the data (i.e., the opposite of the SEA approach).
By improving the resolution of gene sets we may overcome many of the limitations we have described above: noisiness, redundancy, multiple-process representation, poor specificity, etc. We are currently experimenting with the creation of a “hallmark” collection of gene sets in MSigDB that will contain more targeted representations of biological processes. In addition, we are investigating the generation of coordinately expressed gene sets derived from the activation or repression of pathways in the laboratory.
Our group, and many others, are currently engaged in efforts to improve gene set analysis through better modeling of the datasets’ correlation structure and by introducing additional information about the behavior of genes, (e.g., on a sample per sample basis and in supplementary datasets) as part of the gene set analysis. There are indeed many alternative approaches to pursue these goals and here we conclude by listing just a few: single-sample gene set enrichment [52,53,54], computational and theoretical methods for assessing size and effect of correlation in large-scale testing , eigenvalue-decomposition of covariance matrixes , rotation-based sampling , correlation-adjusted t-scores , modeling dependency among the genes within and across each gene set  and multiple testing procedures using dependency kernels . Given the complexities of genomic data it is worthwhile to pursue new methodologies such as these; however, we should resist the temptation to over-simplify and remember the admonishments of Pearson and Tukey:
“…it is not enough to know that a sample could have come from a normal population; we must be clear that it is at the same time improbable that it has come from a population differing so much from the normal as to invalidate the use of the ‘normal theory’ tests in further handling of the material.”--E. S. Pearson
“Far better an approximate answer to the right question, than the exact answer to the wrong question, which can always be made precise.”--J. Tukey
Our implementation of GSEA is as described in Subramanian and Tamayo et al. . The implementation of SEA followed the description given in Irizarry et al. . We validated our implementation by exactly reproducing the t-test score and χ2 statistics from the result set of Irizarry et al.  in one of the datasets.
We obtained the differential gene-expression scores using the signal to noise ratio as computed in GSEA and in SEA. We also normalized the differential gene expression scores by subtracting the median and dividing by the median absolute deviation as is done in SEA
To perform the analyses of variance inflation and over-production of false positives and inflated p-values we followed the approach of Barry et al. 2008  and Gatti et al. 2010 . For each dataset in the benchmark we randomized the phenotype labels 1,000 times and ran both algorithms. We also computed enrichment scores by randomizing the gene labels. The p-values are computed using the areas under the empirical null histograms fro GSEA and areas under the normal distribution for SEA. The q-values were computed using the R function qvalue from package qvalue which implements the method described in Storey and Tibshirani .
We thank Helga Thorvaldsdóttir, Aravind Subramanian and other members of the Cancer Program at the Broad Institute for suggestions, and Jon Bistline for help with references and formatting the figures. This project was supported in part by grant from the National Cancer Institute (R01-CA121941).