|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Gene set enrichment analysis (GSEA) is a microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. GSEA is especially useful when gene expression changes in a given microarray data set is minimal or moderate.
We developed a modified gene set enrichment analysis method based on a parametric statistical analysis model. Compared with GSEA, the parametric analysis of gene set enrichment (PAGE) detected a larger number of significantly altered gene sets and their p-values were lower than the corresponding p-values calculated by GSEA. Because PAGE uses normal distribution for statistical inference, it requires less computation than GSEA, which needs repeated computation of the permutated data set. PAGE was able to detect significantly changed gene sets from microarray data irrespective of different Affymetrix probe level analysis methods or different microarray platforms. Comparison of two aged muscle microarray data sets at gene set level using PAGE revealed common biological themes better than comparison at individual gene level.
PAGE was statistically more sensitive and required much less computational effort than GSEA, it could identify significantly changed biological themes from microarray data irrespective of analysis methods or microarray platforms, and it was useful in comparison of multiple microarray data sets. We offer PAGE as a useful microarray analysis method.
High-throughput technologies such as DNA microarrays and proteomics are revolutionizing biology and medicine. Global gene expression profiling using microarrays monitors changes in expression of thousands of genes simultaneously. At the data acquisition level, gene expression profiles from a given system should be reproducible and yield statistically significant changes in gene expression . The large amounts of data acquired must then be reduced or "translated" to a smaller set of genes representing meaningful biological differences between control and test systems and validated in an experimental or clinical setting . Since inception of the microarray technology, significant technological and analytical improvements have been introduced to meet these challenges, from experimental design , probe-level analysis of oligonucleotide chips [3,4], data normalization , statistical analysis , clustering techniques [7-9], to various data mining tools [10-12]. A large number of studies used microarrays successfully to discern changes in gene expression patterns either in well defined cellular populations responding to a specific stimulus in vitro [13,14] or in complex clinical settings such as cancer and neurological diseases [15-17].
While individual microarray studies can be highly informative, it is generally difficult to compare independently obtained data sets addressing the same biological problem , regardless of whether the same or different microarray platform was used [19-23]. The poor congruence of cross-study comparisons was attributed to incorrectly annotated probes, non-sequence overlapping but Unigene-matching probes, variation in experimental conditions, and actual biological variations among different clinical or experimental materials used [19-23]. Several new bioinformatics programs have attempted to circumvent the variability between different published data sets at the data acquisition level by comparing gene expression results for coordinate changes in biological themes , for similarity of significance values for each gene obtained through meta-analysis methods , and for reproducible gene expression patterns revealed by integrative correlation statistics . Each of the methods revealed some congruency among the data sets analyzed [12,15,24] indicating that results from various transcriptional profiling studies can eventually be integrated for better general definition of normal and disease-related processes.
Another challenge of microarray data analysis is that the majority of genes in any genome-wide transcriptional profile data set are excluded from consideration because they show only subtle changes in expression. This problem was recently addressed in a gene expression profiling study of human diabetic muscles, in which no single gene (out of over 20,000) showed significant difference in expression between control and patients groups . Assuming that gene expression changes can be detected at the level of co-regulated gene sets rather than individual genes, the authors devised a new analytical tool, GSEA, that tested predefined gene sets for association with disease phenotypes . GSEA successfully detected oxidative phosphorylation as a biological theme that coordinately changed in diabetic muscles . Although subsequent analysis suggested that the statistical tools used in GSEA may be biased toward assigning higher enrichment scores to gene sets of large size , the program significantly expands the potential for discovery of important process or disease related genes in a given microarray data set.
In the present work we describe a modified gene set enrichment analysis strategy that improves analysis of minimally changed gene expression profiles. PAGE employs fold change between experimental groups or other parametric data to calculate Z scores of predefined gene sets and use normal distribution to infer statistical significance of gene sets. We show here that PAGE has several advantages over GSEA and is useful in comparison of multiple microarray data sets.
According to the Central Limit Theorem in statistics, the distribution of the average of randomly sampled n observations tends to follow normal distribution as the sampling size n becomes larger, even when the parent distribution from which the average is calculated is not normal. The distribution of the average of randomly sampled observations has the same mean as the parent distribution and its variance is equal to the variance of the parent divided by the sampling size . In other words, when the mean and variance of the parent distribution (whether it is normally distributed or not) are μ and σ2 the average of n observations from the parent distribution will follow a normal distribution of mean μ and variance σ2/n when the sampling size n is large enough. In PAGE, the parent distribution is a distribution of any numerical values (also termed parameters here) that describe differential expression of genes among samples in a microarray data set. Usually, the values are a fold change for an individual gene between two experimental groups or they can be a correlation coefficient between clinical indices and individual gene expression values in a microarray data set. In most cases, the distribution of a parameter, i.e., a fold change values for all genes in a gene set between two experimental groups, is not normally distributed. However, as the Central Limit Theorem states, when we sample n observations from the parent distribution of a parameter, the average of the sampled observations tends to follow the normal distribution as our sampling size n becomes larger. Here, we define sampled observations as expression values for randomly chosen individual genes within pre-defined gene sets, which may be any randomly chosen groups of genes, groups of genes representing close family members with similar functions, genes in the same biological pathway, and so on. If we define a gene set of sufficiently large size, we can use the normal distribution to test the statistical significance of that gene set.
To determine the minimal gene set size m, we first examined the distribution pattern of several microarray data sets. We used fold change between two experimental groups as a parameter and observed the distribution of fold change values in a microarray data set. As an example, we show the distribution of fold change values from microarray data set that compared gene expression of diabetic muscles with that of normal control muscles  (Fig. (Fig.1).1). The histogram of fold change values (Fig. (Fig.1A)1A) and quantile-quantile plot of fold change values against standard normal distribution (Fig. (Fig.1B)1B) suggested that fold change values were not normally distributed. The null hypothesis that the distribution of fold change values was normal was rejected by Kolmogorov-Smirnov normality test (D = 0.08, p-value < 2.2e-16).
Subsequently, we sampled m genes from the parent population of fold change values, calculated the average of the sampled m observations, and observed the distribution pattern of the average values. We began from sampling size two, incremented by one until the sampling size m became 50. As expected, as the sampling size m increased, we found that the distribution of the average of sampled observations became closer to normal. As an example, we show the distribution of the average with sampling size of 10 (Fig. (Fig.1C1C and and1D).1D). Both histogram (Fig. (Fig.1C)1C) and quantile-quantile plot against standard normal distribution (Fig. (Fig.1D)1D) suggested that the distribution of the average of 10 observations was close to normal and the null hypothesis that this distribution was normal was not rejected by Kolmogorovo-Smirnov normality test (D = 0.0239, p-value = 0.1783). Based on these observations, we set the minimal gene set size as 10. We performed the same analysis with several microarray data sets using diverse data processing procedures and found similar results (data not shown).
GSEA calculates an enrichment score (ES) for a given gene set using rank of genes and infers statistical significance of each ES against ES background distribution calculated by permutation of the original data set. In contrast, PAGE calculates a Z score for a given gene set from a parameter such as fold change value between two experimental groups and infers statistical significance of the Z score against standard normal distribution.
To compare the statistical sensitivity of PAGE with that of GSEA, we analyzed human diabetic muscle microarray data sets used in the initial description of GSEA . We calculated fold change values between diabetic muscles and control muscles as described above, calculated Z scores and corresponding p-values for each gene set, and compared these parameters with enrichment scores and corresponding p-values available as supplementary information (Table (Table1).1). We found that both PAGE and GSEA detected OXPHOS_HG-U133A as the most significantly changed gene set; however, the statistical significance of PAGE detection of this gene set was much greater, p = 1 × 10-11 versus p = 0.003. PAGE and GSEA ranked the next three gene sets, human_mitoDB_6_2002_HG-U133A, mitochondr_HG-U133A, and MAP00190_Oxidative_Phosphorylation, as the second through fourth significant gene sets and again, the statistical power of PAGE analysis (the p-value) was greater than that of GSEA (Table (Table1).1). Overall, PAGE detected seven gene sets as statistically significant at p < 0.05 in this data set whereas GSEA detected only one gene set at this significance. For all gene sets, p-values obtained by PAGE analysis PAGE were generally smaller than p-values of corresponding gene sets obtained by GSEA.
We extended the comparison of PAGE and GSEA to additional data sets including gene expression profiles of young and aged muscles from males (GDS 287) and females (GDS 472), or dermal fibroblasts subjected or not subjected to oxidative stress (GDS 963n1). Unlike the analysis shown in Table Table1,1, we applied GSEA here as a multiple comparison testing tool  to directly compare the ability of both programs to detect multiple significant gene sets. The results of these analyses are shown in Additional files 1, 2, 3, 4. In data set GDS 472, PAGE detected 15 out of the first 30 gene sets as significantly up-regulated versus 1 out of 30 for GSEA (see Additional file 1). The corresponding numbers in GDS 287 were 26 of 30 for PAGE and 12 of 30 for GSEA (see Additional file 2). In data set GDS 963n1, both PAGE and GSEA detected 14 significant gene set out of 32 (see Additional file 3). It should be noted that the results in Additional files 1, 2, 3 were ranked by the GSEA NE scores. Although in many cases the same gene sets were identified as significant by both programs, PAGE generally detected a larger number of significant gene sets than GSEA across the entire range of GSEA NES. This is illustrated by expressing the results of the analysis shown in Additional file 3 with gene set rankings by the PAGE Z score (see Additional file 4). These results further demonstrate the utility of PAGE for sensitive detection of significantly altered biological pathways in various publicly available microarray data sets.
To further compare statistical robustness of PAGE and GSEA for detection of minimally changed gene sets we performed a simulation study using 10 hypothetical "experimental" and 10 "control" data sets, each containing expression values for 2,000 genes that were randomly chosen from standard normal distribution curve (see Additional file 5). Of the 2000 genes, 20 were designated as a hypothetical gene set of interest. The method and parameters of this simulation are described in the legend to Additional file 5. The simulation indicates that PAGE is more statistically sensitive than GSEA, being able to detect the test gene set as significant when the mean difference between the "experiment" and "control" was as small as 0.25 (see Additional file 5).
For a given microarray data set, use of different methods of data preprocessing, array normalization, and statistical inference can lead to different end results [3,4,29]. We tested the general applicability of PAGE to different Affymetrix probe analysis programs using the Duchenne muscular dystrophy (DMD) data set GDS 563 which contains 29 CEL files available from Gene Expression Omnibus website. Starting with 11 control and 23 DMD sets, we calculated expression values by MAS5 , MBEI , and RMA  programs, logarithm transformed expression values calculated by MAS5 and MBEI by base two, determined fold changes between two groups, and performed PAGE with pathway gene sets. With a cut-off p-value of < 0.05, PAGE identified eight significantly impaired gene sets with MAS5 and RMA platforms and six suppressed gene sets with MBEI platform, although the next three gene sets with lower significance derived from MBEI analysis were the same as those identified with MAS5 or RMA platforms (Table (Table2).2). For significantly induced gene sets, all three methods identified identical five gene sets (Table (Table22).
We then tested whether PAGE could detect common biological themes from microarray data sets produced using different microarray platforms. We analyzed breast cancer cell line experiment GSE 1299 in which three different arrays (U133A, U95A, and Agilent Human cDNA) were employed with the same RNA to analyze platform dependency of microarray data . We programmed PAGE to identify pathway gene sets in each of the data sets obtained by three microarray platforms used in the experiment. Among significantly down-modulated pathways, two pathways (gap_junction proteins-connexins and cholesterol_biosynthesis) were identified by PAGE as common to all three microarray platforms (Table (Table3)3) and for up-regulated gene sets, Krebs_TCA was detected with all three microarray platforms (Table (Table33).
Comparison of different microarray data sets dealing with similar biological questions often poses a problem of poor congruency among data sets when compared at gene level. We tested whether comparing at gene set level was better than at individual gene level to reveal congruency among different microarray data sets.
We compared two microarray data sets, GDS 287 and GDS 472, produced by the same authors using the same microarray platform Affymetrix U133A (Fig. (Fig.22 and Table Table4).4). The data set GDS 287 records differential gene expression of muscles of young and old aged males, and GDS 472 records differential gene expression of muscles of young and old aged females. We analyzed both data sets in the same manner, selected significantly changed genes (|fold change| > 1.5 and t-test p-value < 0.05), and found the percentage of genes common to both data sets. Only 12.4% of significantly up-regulated genes occurred in both data sets and 4.4% of significantly down-regulated genes occurred in both data sets (Fig. (Fig.2A).2A). In contrast, when we compared both data sets at gene set level, 62.5% of significantly up-regulated gene sets occurred in both data sets and 49.6% of significantly down-regulated gene sets occurred in both data sets (Fig. (Fig.2B).2B). Actually, gene set level comparison correctly pointed out that energy metabolism such as electron transport, tricarboxylic acid cycle, and glycolysis was impaired and genes involved in mRNA processing and cell cycle regulation were up-regulated in both old aged male (GDS 287) and old aged female (GDS 472) data sets (Table (Table44).
Our initial objectives in developing PAGE were to increase the statistical power of the existing gene set enrichment program for analysis of subtle changes in microarray data and to simplify the laborious computational process involved. We designed PAGE as a parametric statistical test that uses normal distribution to infer the statistical significance of Z scores calculated from actual numerical parameters such as fold change between two experimental groups. Distribution-free, non-parametric methods such as the ones used in GSEA [25,30] make no assumptions about variability or the form of the population distribution and are useful when the population distribution is not normal or unknown. However, because non-parametric tests use ranks instead of measured values, they tend to be less powerful, informative, and flexible than corresponding parametric tests .
As described earlier, the theoretical basis for using normal distribution in PAGE is Central Limit Theorem, which states that when sampling size n is large enough, distribution of an average of sampled observations is normal regardless of the nature of parent distribution. In statistics, sampling size of 30 is generally sufficient, although the actual sampling size fulfilling Central Limit Theorem is dictated by how parent distribution is close to normal distribution. In our case, sampled observations were fold changes in expression of randomly chosen genes in a microarray data set grouped into pre-defined randomly chosen gene sets. We found that sampling size of 10 was sufficient for demonstrating close to normal distribution of averages of fold changes of constituent genes in gene sets as inferred by several normality tests including Kolmogorov-Smirnov (Fig. (Fig.1C),1C), Anderson-Darling and Cramer-von Mises (data not shown). In our opinion, the reason that normality analysis of microarray data sets can be performed with a much smaller sampling size (10) than generally required is because parent population of parameters, i.e., fold changes of all genes being compared in microarray data sets, is already somewhat close to normal. Indeed, most fold change values lie in the center position of the distribution and the proportion of significantly changed genes decreased along the axis to both directions (Fig. (Fig.1A1A).
It is clear that the statistical tools used in PAGE direct the program to analysis of pre-defined gene sets in microarray data sets rather than individual genes. This design was intentional. Regardless of the experimental paradigm, the majority of the cellular transcripts analyzed for differential expression on genome-wide microarray chips such as the Affymetrix 133A/B show statistically insignificant changes. For example, our analysis of gene expression profile of HIV-1-infected astrocytes on U133A/B detected about 740 different transcripts with fold changes of > 2 or < -2 and p ≤ 0.05 . This result also means that the signals obtained with over 40,000 other probes on the chips in these experiments were not considered as significant. Thus, many potentially relevant but subtle changes in biological systems may not be readily detectable by individual gene analysis of differentially expressed gene lists. PAGE, like GSEA , attempts to resolve this problem by utilizing the phenomenon of gene co-regulation. In complex biological systems, many genes belonging to the same family and performing similar functions or genes acting in the same biological pathway are co-regulated. Conversely, in disease states, these genes may be coordinately dysregulated. Characterization of gene co-regulation (or co-dysregulation) under different physiological and pathological conditions is an important research problem that can now be approached by bioinformatics tools . The assumption behind the gene set enrichment concept is that the statistical significance of coordinated changes in a set of co-regulated genes will be greater than that for individual genes in the set. This assumption was at least in part validated by applications of GSEA and seems to be borne out for PAGE as well (this work). In fact, we consider PAGE (as well as GSEA) not only as a program for detecting correlations between experimental conditions and changes in behavior of known gene sets containing co-regulated genes, but also as a tool for intentional search for novel gene co-regulation (or co-dysregulation) in microarray data sets as part of testable hypotheses.
It may be considered paradoxical to apply a statistical test based on normal distribution to an explicit goal of detecting sets of co-regulated, that is, interdependent genes. The normal distribution paradigm requires that sampled observations are independent and identically distributed, or IID. However, we would like to argue that gene dependency caused by co-regulation in a given microarray data set should be regarded as rare, and thus statistically significant. In developing this program, we started with a basic assumption, a null-hypothesis, that all genes in a given microarray data set are independent of each other and identically distributed, that is, they are not co-regulated. With given gene sets as testable hypotheses, we then tested whether there is a significant shift of behavior of genes as a group. When we observed a significant change in a given gene set, we rejected the null-hypothesis and concluded that those genes in a gene set are co-regulated and dependent on each other. We found that with the statistical tools we used, we matched and in most cases exceeded the ability of GSEA to detect co-regulated genes.
In a direct comparison of PAGE and GSEA using published data bases, the p-values of PAGE were lower than the respective p values obtained by GSEA and as a result, the number of gene sets that can be considered significantly changed was larger (Table (Table11 and see Additional files 1, 2, 3, 4). Similar results were obtained in an extensive simulation study (see Additional file 5). This confirms that similar to other applications  the parametric statistical test is more powerful than the non-parametric method when applied to gene set enrichment analysis Two other features of PAGE facilitate the computational process involved in running the program. First, because PAGE uses standard normal distribution as a background distribution, there is no need for the preceding permutation step required for this calculation in GSEA . This reduces computation time at least 1,000 times when one performs 1,000 permutation of data set to get a background distribution. Secondly, the Z score of PAGE is two-tailed, showing gene sets of both increased and decreased expression in a single analysis. In one-tailed programs [11,12,25], the entire process from ranking gene lists to class permutation and statistical inference must be repeated after analysis in one direction. Thus PAGE is a statistically powerful gene set enrichment analysis tool with features that decrease the computational burden of such programs and increase the amount of information obtained per one analysis.
With wider availability of the gene microarray technology, there is an exponential increase in publicly accessible microarray data bases obtained on different platforms, by different laboratories, and addressing a variety of biological questions. A number of increasingly advanced data analysis tools have been developed to begin to compare and integrate this diverse and often incompatible information, including programs to identify biological themes instead of differentially expressed gene lists [12,25] or programs which identify significant genes displaying consistent changes across biologically different systems [15,24]. Each approach was shown to lead to better congruency among diverse data sets than would be achieved by direct comparison of data sets, whether in demonstrating common transcriptional profiles of prostate cancer , common molecular markers of lung cancer , or a common biological theme in diabetic muscle .
We have found that PAGE also can be applied to integrative data analysis across various microarray platforms and biological systems. As with GSEA  or EASE , the key to PAGE utility for this purpose is the ability of the program to compare microarray data sets for gene sets rather than individual genes. Our results indicate that PAGE works well with different probe level analysis methods (Table (Table2)2) and different microarray platforms (Table (Table3),3), in each case being able to identify several common biological pathways in the same starting material tested irrespective of the platform or primary analytical method used. Gene set analysis by PAGE was also far more discriminatory than individual gene analysis in finding common biological pathway changes in different microarray data sets generated to address the same biological question, the difference between young and aged muscles (Fig. (Fig.22 and Table Table4).4). Another feature of PAGE that is useful for comparison of multiple microarray data sets is the Z score. Z score is a normalized and linear-scale value which is microarray platform independent and which is convenient to use as an input for subsequent analysis. It is possible to generate a data matrix containing Z scores of pre-defined gene sets of multiple data sets obtained on different microarray platforms and then perform cluster analysis to identify gene sets of specific interest or to identify relationships among data sets. We applied this approach recently to cluster analysis of multiple microarray data sets of macrophages infected with bacteria, protozoa, HIV-1, or treated with cytokines, and identified gene sets that were specifically changed in HIV-1-infected cells (S.-Y. Kim and M.J. Potash, unpublished), suggesting that the Z score system of PAGE will be useful for asking broad biological questions.
The increasing use of microarrays comparing a carefully selected baseline to transformed tissue, differentiated tissue, or pathogen-infected tissue among others is creating a truly global database of gene expression profiles. Integrative analysis of this immense amount of information by programs such as EASE , meta-analysis of microarrays , or GSEA  begins to discern general patterns governing fundamental and disease-related biological processes. The program described here, parametric analysis of gene set enrichment analysis or PAGE is statistically sensitive and requires less computation than many other programs. PAGE identified significantly changed biological themes from microarray data set irrespective of microarray data analysis methods or microarray platforms. PAGE identified more common biological themes in different microarray data sets addressing the same biological problem than analysis of individual gene level. Finally, the Z score of PAGE is a normalized, linear scale value that can be used in subsequent meta-analysis. We offer PAGE as a useful microarray analysis tool.
The microarray data set, GSEA results for NGT versus DM2, and probe sets corresponding to gene sets of Mootha et al.  were downloaded from the authors' website . Microarray data sets of the muscle function and aging studies (GDS 287 and GDS 472) [36,37], Duchenne muscular dystrophy (DMD) study (GDS 563) and breast cancer cell line experiment (GSE 1299)  were downloaded from Gene Expression Omnibus  website.
Each Affymetrix microarray data set was normalized to have mean expression value of 1,000. Expression values below 100 were floored to 100 and then all expression values were transformed by logarithm base two. Fold change between two experimental groups was measured to identify differentially expressed genes and unpaired t-test was used to infer statistical significance of differentially expressed genes.
We created predefined gene sets for major mammalian Affymetrix platforms (human: U133A, U95A, HuFL6800; mouse: M74Av2; and rat: U34) from corresponding Affymetrix annotation files  with gene ontology (GO) biological processes, GO cellular components, GO molecular functions, and pathways as the main categories.
Z score for each gene set was calculated as follows. First, from input data containing fold change values for each genes between two experimental groups, mean of total fold change values (μ) and standard deviation of total fold change values (δ) of a given microarray data set were calculated. Then, when the mean of fold change values of genes for a given gene set was Sm and the size of a given gene set was m, the Z score was calculated as Z = (Sm – μ)*m1/2 / δ
The statistical tools within the Microsoft Excel program were used to calculate p-values from Z scores. We used an R package nortest  to test whether a given distribution is different from standard normal distribution or not. The R statistical programming language  was used for general statistical analysis and computing.
• Project name: PAGE
• Project home page: In construction
• Operating system: Platform independent
• Programming language: Python, Windows-executable through py2exe 
• Other requirements: None
• License: None
• Any restrictions to use by non-academics: None
• The Python script PAGE is available from SYK upon request
• The beta version of GSEA is now available at 
SYK designed and implemented PAGE method, performed the bioinformatics analyses, and drafted the manuscript. DJV supervised the development of PAGE method and wrote the manuscript. All authors read and approved the final manuscript.
Comparison of GDS 472 by PAGE and GSEA: Ranking by GSEA.
Comparison of GDS 287 by PAGE and GSEA: Ranking by GSEA.
Comparison of GDS 963n1 by PAGE and GSEA: Ranking by GSEA.
Comparison of GDS 963n1 by PAGE and GSEA: Ranking by PAGE.
Simulation study comparing statistical sensitivity PAGE and GSEA.
The authors thank Drs. Mary Jane Potash and Andrew I. Brooks for critical reading of the manuscript and Mrs. Ilene M. Totillo for secretarial assistance. We also thank anonymous reviewers for critical and helpful comments to improve our manuscript. This work was supported by grants NS31492 and DA17618 from the National Institutes of Health.