Gene-set simulation experiment
Using a mouse-microarray kidney-transplant dataset, we assessed if GSEA and SAM-GS satisfy the following simple requisite properties for any methods designed to perform a gene-set analysis:
(a) If the gene set S consists of genes whose expressions are consistently not associated with the phenotype D, the method should not indicate that S is associated with D.
(b) If the gene set S consists of a mix of genes with moderate to strong and weak associations of expressions with the phenotype D, such that an appreciable subset of the genes in S are moderately or strongly associated with the phenotype D, the method should indicate that S is associated with D.
(c) The size of the gene set S should not greatly alter the statistical significance in (a) and (b).
We performed two tests using the mouse-microarray kidney-transplant dataset with simulated gene sets. For Test 1 intended to assess property (a), the gene sets were randomly generated from the null hypothesis region: the expression of genes in these null gene sets were not associated with the phenotype. A sensible gene-set analysis method should not identify these gene sets as statistically significant. For Test 2 intended to assess property (b), the gene sets were generated randomly from an alternative hypothesis region: the expressions of 50% of genes in each gene set were associated with the phenotype either moderately or strongly. A sensible gene-set analysis method should identify these gene sets as statistically significant.
Tables and show the percentages of 100 randomly-generated gene sets whose expressions were found to be associated with the phenotype with a p-value ≤ 0.05, by the gene-set analysis method of interest (GSEA or SAM-GS), under each gene-set sampling region considered. In Test 1, null gene sets were generated by randomly sampling genes from genes with |r| <c, where r denotes the Pearson correlation coefficient of the gene expression with the phenotype and c was 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, or 0.5. The results in Table indicate that GSEA does not satisfy requisite property (a), because it identified many of these null gene sets to be associated with the phenotype with a p-value ≤ 0.05. In Test 2, truly-associated gene sets were generated by randomly sampling half of the gene set's genes from genes with |r| = c and remaining half from genes with |r| <c, where c was 0.4, 0.5, 0.6, 0.7, 0.8, or 0.9. The results in Table indicate that GSEA does not satisfy requisite property (b), because it did not identify many of these truly-associated gene sets as statistically significant with a p-value ≤ 0.05. Moreover, GSEA does not satisfy property (c), as the performance of the method varies greatly with gene-set size.
Performance of GSEA and SAM-GS on Test 1. Proportions of randomly generated null gene sets that are identified by each method to be associated with the phenotype (p-value ≤ 0.05) in a mouse-microarray study.
Performance of GSEA and SAM-GS on Test 2. Proportions of randomly generated non-null gene sets that are identified by each method to be associated with the phenotype (p-value ≤ 0.05) in a mouse-microarray study.
The results of these tests illustrate two situations where GSEA fails. One is where genes in a gene set cluster somewhere other than in the strong-association region (e.g., all individual genes could have no or very weak association with the phenotype) and GSEA identifies the gene set to be statistically significantly associated with the phenotype. Figure illustrates this situation with one of the null gene sets of Test 1, where the gene set's genes clustered in the region of no or weak association with the phenotype, and yet GSEA p-value of this gene set was < 0.001. In short, GSEA will indicate that gene sets with any clear clustering are statistically significant, regardless of where the clustering occurs. The other situation is where a gene set has a mixture of moderately or strongly associated genes and weakly associated genes. This mixture within a gene set seems biologically plausible: not all genes in a phenotype-associated pathway will show changes in relation to the phenotype. GSEA has very poor power for detecting a differentially-expressed gene set under such mixed situations (Table ), unless the clustering of some of the gene-set members occur at the moderate-strong association region (e.g., Table with c = 0.9)
A statistically significant GSEA result. An illustration of a statistically-significant GSEA result with 100 genes selected at random from genes with no or weak correlation of expression with the phenotype (|r| < 0.4).
To check whether SAM-GS satisfies the three requisite properties of a gene-set analysis, the same tests were applied as for GSEA, using the same randomly-sampled simulated gene sets. The results of Tests 1 and 2 displayed in Tables and indicate that SAM-GS satisfies properties (a) and (b). Moreover, SAM-GS satisfies property (c), as its performance did not vary with the size of the gene set.
Gene-set analyses of the three datasets with biologically defined gene sets
We compared the performance of the two methods, GSEA and SAM-GS, on the analyses of biologically defined gene sets using three microarray datasets considered in [3
]: the sex, p53
, and leukemia datasets. The sex dataset consists of mRNA expression profiles from lymphoblastoid cell lines derived from 15 males and 17 females (the phenotype is sex). The p53
dataset consists of expressions for 50 cell lines from the NCI-60 collection of cancer cell lines, for which mutational status of the p53
gene has been reported, with 17 being classified as wild-type, and 33 as carrying mutations in the gene (the phenotype is the mutation status of p53
). The leukemia dataset consists of gene expression profiles of cells from 24 acute lymphoid leukemia (ALL) patients and 24 acute myeloid leukemia (AML) patients (the phenotype is ALL vs. AML). The pathways are organized in two catalogs, C1 and C2. The C1 catalog includes gene sets corresponding to human chromosomes and cytogenetic bands, while the C2 catalog includes gene sets that are involved in specific metabolic signaling pathways [3
]. The results by GSEA and SAM-GS are summarized in Table . In the sex-comparison analysis, the two methods agreed on the associations (FDR ≤ 0.01) with the three Y associated gene sets, the testis-expressed gene set (GSEA FDR = 0.02), and the gene set with genes that escape X inactivation. In addition, SAM-GS established an association with the chrXp22 gene set (SAM-GS FDR ≤ 0.16 vs. GSEA FDR = 1.00). In the p53
-comparison analysis, SAM-GS and GSEA agreed on a subset of gene sets with an FDR ≤ 0.01 that included the gene sets of hsp27, p53
_UP (GSEA FDR = 0.013), p53
hypoxia, radiation sensitivity (GSEA FDR = 0.07), and p53
(BioCarta). However, SAM-GS identified additional 31 gene sets with an FDR ≤ 0.01, all of which had an FDR ≥ 0.49 for GSEA. These gene sets are shown in Table . In the leukemia-comparison dataset, the two methods gave even more discrepant results than the p53
-comparison analysis. GSEA identified only five gene sets with an FDR ≤ 0.25 (none with an FDR ≤ 0.01), whereas all of the 182 gene sets were statistically significant (FDR ≤ 0.01) by SAM-GS. Note that the individual-gene analysis showed that 80% of the individual genes in this comparison had an FDR ≤ 0.25, which is in line with the gene-set analysis results of SAM-GS.
Results of the analyses of three datasets by GSEA and SAM-GS.
The 31 gene sets for which SAM-GS and GSEA strongly disagreed (SAM-GS FDR ≤ 0.01, GSEA FDR ≥ 0.49) in the p53 analysis.
These discrepancies between the two methods are summarized along with the sensitivity and specificity of the GSEA p-value ≤ 0.05 and the area under the receiver operating characteristic curve of GSEA p-value in predicting the SAM-GS p-value ≤ 0.05 (Table ). Specifically, sensitivity was calculated as the proportion of gene sets with GSEA p-values ≤ 0.05, out of all gene sets with SAM-GS p-values ≤ 0.05. Specificity was calculated as the proportion of gene sets with GSEA p-values > 0.05, out of all gene sets with SAM-GS p-values > 0.05.
Our Tests 1 and 2 suggest that GSEA does not meet some simple requisite criteria for a gene-set analysis method. In particular, Test 1 results suggest that, in a typical microarray experiment involving genes with different degrees of association with the phenotype, GSEA would frequently identify gene sets as statistically significant when all of its genes have observed expressions completely uncorrelated with the phenotype (e.g., Pearson correlation between -0.1 and 0.1). This is not a logical behavior for a gene-set analysis method. Biologically, if a gene set is identified as having expressions that are significantly associated with a phenotype, the gene set should contain at least some genes whose observed expressions are associated with the phenotype. Statistically, the false discovery rate of GSEA for a truly null gene set, tested among truly non-null gene sets, would be appreciably elevated because the observed correlations of the null-gene-set genes with the phenotype would tend to cluster near zero. Although the gene sets in Tests 1 and 2 are randomly-sampled simulated sets, they are not unrealistic gene sets. For example, a Test 1 situation was encountered in the analysis of the sex dataset, where GSEA gave the "cell-cycle arrest genes" a p-value of 0.015 in association with sex (SAM-GS p-value = 0.84). No gene in this gene set has an absolute value of the Pearson correlation of 0.33 or greater, or the SAM p-value < 0.06: this clustering is thus identified incorrectly by GSEA as showing a significant association, failing Test 1. A Test 2 situation was encountered, for example, in the analysis of the leukemia dataset, where GSEA failed to identify the gene set "chr10q24", even though 13 of the 43 genes in the gene set had absolute values of the Pearson correlation of 0.5 or greater (4 genes greater than 0.7) and the chromosomal location of the gene set is biologically relevant given the role of HOX11 in T-cell ALL. The use of GSEA is subject to appreciable false positive and negative findings, as illustrated by the two tests and the results shown in Table .
Another critical problem of GSEA is its relative ranking of genes in a gene set in relation to the other genes outside of the gene set. The use of a relative measure in GSEA, rather than an absolute measure, means that important information on the degree of association between each gene and the binary phenotype is discarded. For example, the leukemia dataset had 80% of its 10,056 individual genes with an FDR ≤ 0.25. Regardless of whether such clear differences in gene expression across the binary phenotype are determined by biology, or by more mundane (and biologically irrelevant) differences in sample collection or handling, a gene-set analysis of this dataset should find that many gene sets are associated with the phenotype. GSEA, however, found only five gene sets with an FDR ≤ 0.25 in the leukemia-comparison analysis, inconsistent with the individual-gene analysis results. The cause of the inconsistency is the use of the relative ranking in GSEA. In contrast, SAM-GS found all gene sets in the leukemia dataset to have an FDR ≤ 0.01.
A related, perhaps less serious, issue with GSEA is that, when an individual gene set is of biologic interest, the SAM-GS analysis requires measurement only of the expression of the genes in the gene set to construct the test statistic (except the calculation of s0), whereas GSEA requires measurement of the expression of all genes to provide a relative ranking of all genes. The expression levels of the other genes should not affect the inference on an individual gene set of interest, if the individual set is, indeed, the only biologically relevant variable.
Another problematic aspect of GSEA is that its enrichment score considers genes with positive and negative associations with the phenotype separately, even when they have the same degree of associations with the phenotype. Thus, a gene set with a mix of genes with positive and negative associations with the phenotype, although biologically plausible (for instance, due to feedback loops in pathways), is not appropriately evaluated for association with the phenotype by the enrichment score and, therefore, has an improperly low probability of being detected as a phenotype-associated gene set by GSEA.
A gene-set analysis utilizes existing biologic knowledge that maps individual genes into gene sets or pathways. Because of the utilization of existing knowledge in the analysis, a well conducted gene-set analysis can be remarkably powerful. The p53 analysis illustrates this point. Although a very small proportion of individual genes had low p-values in the p53 dataset, SAM-GS indicated larger proportions of gene sets with low p-values. This is because a valid gene-set analysis would take into account a tendency among multiple genes in a gene set. Thus, even if the association of each gene with the phenotype is only moderate, a collection of such genes can be indicated as a phenotype-associated gene set; genes in a gene set need not have the same degree and direction of association with the phenotype for the gene set to be identified as statistically significant by SAM-GS.
In addition to the leukemia-comparison analysis discussed above, which showed an advantage of SAM-GS over GSEA empirically through the consistency of the gene-set analysis results with the individual-gene analysis results, the other two DNA-microarray analyses (sex- and p53
-comparison analyses) provided empirical biologic evidence supporting the advantage of SAM-GS over GSEA. Regarding the sex-comparison analysis, Subramanian et al. [3
] specifically argue that they would not expect to find enrichment for bands on the X chromosome because most X-linked genes are randomly silenced in females and, therefore, are unlikely to show a male-female (gene-dose) difference. This argument has general merit; however, in the specific case of the chrXp22 gene set, it does not hold because, on the distal portion of the short arm of X, there is a cluster of genes that escape X-inactivation. Indeed, the top five genes of the chrXp22 gene set escape inactivation: two of the five are members of the X-inactivation-escape gene set whose FDR was ≤ 0.01 by both methods; and the other three have been reported to escape X-inactivation [6
The differences in the results of the p53
-comparison analysis illuminate biologically relevant performance differences between the two methods. It is appropriate to ask whether the 31 additional pathways identified by SAM-GS over GSEA are plausibly associated with the presence vs. absence of p53
. Of the 31 gene sets, 11 actually involve p53
directly as a member. A further 6 gene sets directly involve the extrinsic and intrinsic apoptosis pathways [9
], 3 involve the cell-cycle machinery, and 3 involve cytokines and/or JAK/STAT signaling [10
]. Each of these 12 gene sets, then, is in a direct, well-established relationship with aspects of p53
signaling. Of the remaining 8 gene sets, 6 have plausible, if less well established, links with p53
. In the Ck1 pathway, cdk5 phosphorylates p53
so the presence vs. absence of p53
is likely to modify profoundly the effectiveness of this pathway [11
]. Ets1 (ets pathway) has been shown to be essential, in mouse embryonic stem cells, to maintain the ability to undergo UV-induced, p53
-dependent apoptosis. Ets1, more broadly, may be necessary for p53
-dependent gene transactivation [12
]. Akt and p53
are, respectively, essential to the transduction of anti-apoptotic and pro-apoptotic pathways. There is an integrated negative feedback loop whereby p53
-dependent down regulation of Akt promotes cell death but cell survival signals will recruit Akt, leading to activation of Mdm2 and the inhibition of p53
-dependent apoptosis [13
]. This may account, in part, for the association between the presence vs. absence of p53
and differences in the SA-TRKA receptor pathway. Proline oxidase is induced by p53
and mediates apoptosis via a calcineurin-dependent pathway [14
]. Coproporphyrinogen oxidase (CPO) is a key compound of the MAP 00860 porphyrin/chlorophyll metabolism gene set. It catalyzes a rate-limiting step in heme biosynthesis and may contribute to mitochondrial redox balance. It has recently been shown to be regulated by p53
]. Finally, the Wnt and p53
pathways have also been shown to be linked via pro-apoptotic Dkk1, a wnt antagonist [16
Regarding the form of SAMGS
is simply the L2
-norm of the t-like-statistic vector d
), the length of the line segment joining the two phenotypes' mean gene-expression vectors of a gene set S
. Our null hypothesis is that the mean vectors of expressions of genes in a gene set S
do not differ by the phenotype of interest (i.e., this line-segment length is zero), a two-sample multivariate mean test in statistics. The classical multivariate statistics method for a two-sample mean test, Hotelling's T2
, addresses this question, but it cannot be applied when |S
| > n1
- 2, where n1
are the sample sizes in the two groups defining the phenotype D
. We would like to emphasize that this condition is often unmet in gene-set analyses of DNA microarray data. Dempster [16
] introduced a test statistic for comparing highly multivariate samples of two groups, an alternative for Hotelling's T2
, when the number of multivariate measurements is large, relative to the sample sizes. Using Dempster's test in the context of microarray data, a potential candidate for a test statistic to be used in Step 2 of SAM-GS, would be the weighted Dempster's (WD) statistic:
in the denominator is the average of (n1
- 2) statistically-independent quantities that have the same mean and variance as the numerator
under the null hypothesis, created by an orthonormal transformation of multivariate gene expressions in the set S
. This test statistic seems to have the advantage of taking into account the multivariate structure of the gene expression measurements in a gene set by dividing the numerator, the L2
-norm of the mean-vector difference, by its approximate expectation. However, since a permutation-based test is used, the denominator of WD statistic is unnecessary: as Dempster [18
] stated, a permutation test based on the numerator only is equivalent to using the quotient. Given the computational simplicity and the use of permutation in SAM-GS, the L2
-norm used in SAMGS
is preferred over WD.
-norm of d
can be considered, similar to Chung and Fraser [19
] who proposed the L1
-norm as an alternative to Dempster's use of the L2
-norm. While one might expect the two norms to give similar performances overall, since the L1
-norm would be less sensitive to extreme values than the L2
-norm, the L1
-norm may be less powerful in detecting a gene-set with a small number of genes being strongly associated with the phenotype. Test 2 simulation above confirmed this point: as the proportion of genes in a gene set that are correlated with the phenotype (|r
| ≥ 0.6) becomes smaller than approximately 30%, the two norms performs differently and the L1
-norm is less powerful in detecting the gene set being associated with the phenotype (data not shown).
To account for multiple comparisons (statistical testing of many hypotheses) when multiple gene sets are to be tested, SAM-GS takes the same approach as SAM, estimating a q-value, an upper limit for the FDR, for each gene set. The q-value of a gene set can be determined solely from the p-values of all gene sets tested [20
]. The collection of p-values of all gene sets contains information, not only on the statistical significance of each gene for its association with the phenotype, but also on the proportion of gene sets that are not associated with the phenotype, the "null gene-set proportion." Note that the null gene-set proportion is determined by biology: the phenotype is either biologically associated or not associated with each gene set. However, the p-value is a function of sample sizes and noise levels in gene-expression measurements as well as the degree of underlying biological associations. Thus, even if a strong biologic association between a gene set and the phenotype exists, because of small sample sizes and/or high measurement noise levels (features of many DNA microarray observations and experiments), the p-value of the gene set can be large. This is another aspect of the p53
analysis discussed above, where many gene sets have low FDR estimates in spite of the fact that the p-values are not correspondingly low: this is due to an estimated small null gene-set proportion which lowers FDR estimates.