Modern high-throughput methods deliver large sets of genes or proteins that can not be evaluated manually. For example, cDNA microarrays are used to measure the expression of a variety of genes under different conditions, e.g. in normal and cancer tissues. Usually, for each gene the expression quotient is computed and the genes are sorted by their expression quotient. The question of interest is whether over-expressed or under-expressed genes accumulate in certain biological categories, as for example biochemical pathways or Gene Ontology categories. To answer this question different approaches can be applied. First, the so-called "Over-Representation Analysis" (ORA) that compares a reference set to a test set of genes by using either the hypergeometric test or Fisher's exact test. Second, "Gene Set Enrichment Analysis" (GSEA) evaluates the distribution of genes belonging to a biological category in a given sorted list of genes or proteins by computing running sum statistics.
Performing GSEA for a biological category C and sorted list L of m genes of which l belong to C means that a running sum statistic RS is computed for L. RS statistics evaluate whether the genes of C are accumulated on top or bottom of the sorted list or whether they are randomly distributed. Hereby, the sorted list is processed from top to bottom. Whenever a gene belonging to C is detected, the running sum is increased by a certain number, otherwise it is decreased. The value of interest is the running sum's maximal deviation from zero, denoted as RSC. An example is provided in Figure for a list containing 8 genes of which 4 belong to C. The black graph corresponds to all possible running sum statistics. The red pathway represents the example where the first three genes and the seventh gene belongs to C. The RSC value of the red path is 12.
Figure 1 Example of possible running sum statistics. The figure shows all possible running sum statistics for an ordered list of 8 genes of which 4 belong to a functional category. The red labeled running sum statistic has a RSC value of 12 and the corresponding (more ...)
Usually, the p-value is computed by nonparametric permutation tests, i.e. RSC
is calculated for permuted gene lists. Two approaches to compute these lists exist. First, the sorted gene list is randomly permuted. Second, if L
is sorted by the median expression quotient of expression values in one group divided by the median expression value in another group, the samples are randomly assigned to the two groups and thereby permuted gene lists are generated. Notably, these methods do not always yield the same results. The permutation procedure is repeated t
times and the running sum statistics together with the corresponding maximal deviations from zero, denoted as RSi
}, are computed. Usually, the p-value computes as the fraction of RSi
values that are larger or equal than RSC
Since its development in 2003 [1
], Gene Set Enrichment Analysis has been enhanced [3
] and integrated in a number of analysis tools [4
]. Among the most popular programs are "ermineJ" [5
] and "GSEA-p" [6
]. These two tools estimate the significance values by using nonparametric permutation tests. However, such tests entail three disadvantages:
First, repeated runs of the permutation test algorithm may lead to different significance values because of the random sampling.
Second, the permutation test procedure causes problems if the significance values are small. Given a running sum statistic whose true p-value is 0.00001. If, as usual, 1000 permutation tests are performed, probably none will have a higher maximal deviation as the original running sum statistics. According to the formula given above, the p-value would compute as
, which may be a bad estimation. Since the next iteration may lead to a higher deviation, a more reasonable estimation would be
Since GSEA is often applied to many biological categories, p-values have to be adjusted for multiple testing by using Bonferroni Hochberg [7
], Benjamini [8
], or similar adjustment approaches. However, given the above estimation and the known multiple testing methods, the p-value cannot be adjusted in an appropriate way.
Third, it is difficult to estimate how many permutations should be performed to obtain a sample of reasonable size. Obviously, if m
= 20000 and l
= 2000, a sample size of 1000 permutations may be by far too small. Remarkably, the number of possible different running sum statistics amounts to
. On the example given above, the number of different running sums adds up to approximately 4·102821
, emphasizing that 1000 permutation represent a very small sample. The example shown in Figure indicates that for a sorted list of length 2000 and a biological category including only 14 genes 1000 permutations do not yield reliable significance values. The required large number of permutation tests leads to an unacceptable computational effort, especially if thousands of biological categories are tested. An alternative, parametric method is the so called Parametric Analysis of Gene Set Enrichment "PAGE" method [9
] that calculates a z-score for a given gene set and infers the significance value of this z-score against standard normal distribution.
Figure 2 p-value as function of permutation test number. For each number of permutation tests we performed 100 runs. The figure shows the mean value of these runs together with the respective standard deviation. The dashed lines represent the maximum and minimum (more ...)
In this study, we address the exact and efficient p-value computation for unweighted Gene Set Enrichment Analysis. Unweighted means that the number by which the running sum statistic is increased if a gene of C
is found and the number by which the running sum statistic is decreased if the gene does not belong to C
are constants. In our case, whenever a gene of C
is found the running sum is increased by m
, and otherwise it is decreased by l
. The dynamic programming method is similar to the "DRIM" approach ([10
]) that computes the optimal partition of a gene set in a target and a background set.
We integrated our dynamic programing algorithm into the gene set analysis tool "GeneTrail" [11
] that is freely available at genetrail.bioinf.uni-sb.de
. GeneTrail tests a wide variety of biological categories, among them Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [12
], TRANSPATH pathways [13
], transcription factors [14
], Gene Ontology GO, [15
], granzyme B clevage sites [16
], and protein-protein interactions [17
]. GeneTrail relies on the Biological Information System BN++ [21
] that provides easy access to a wide variety of biological data.