Our initial objectives in developing PAGE were to increase the statistical power of the existing gene set enrichment program for analysis of subtle changes in microarray data and to simplify the laborious computational process involved. We designed PAGE as a parametric statistical test that uses normal distribution to infer the statistical significance of Z scores calculated from actual numerical parameters such as fold change between two experimental groups. Distribution-free, non-parametric methods such as the ones used in GSEA [
25,
30] make no assumptions about variability or the form of the population distribution and are useful when the population distribution is not normal or unknown. However, because non-parametric tests use ranks instead of measured values, they tend to be less powerful, informative, and flexible than corresponding parametric tests [
31].
As described earlier, the theoretical basis for using normal distribution in PAGE is Central Limit Theorem, which states that when sampling size n is large enough, distribution of an average of sampled observations is normal regardless of the nature of parent distribution. In statistics, sampling size of 30 is generally sufficient, although the actual sampling size fulfilling Central Limit Theorem is dictated by how parent distribution is close to normal distribution. In our case, sampled observations were fold changes in expression of randomly chosen genes in a microarray data set grouped into pre-defined randomly chosen gene sets. We found that sampling size of 10 was sufficient for demonstrating close to normal distribution of averages of fold changes of constituent genes in gene sets as inferred by several normality tests including Kolmogorov-Smirnov (Fig. ), Anderson-Darling and Cramer-von Mises (data not shown). In our opinion, the reason that normality analysis of microarray data sets can be performed with a much smaller sampling size (10) than generally required is because parent population of parameters, i.e., fold changes of all genes being compared in microarray data sets, is already somewhat close to normal. Indeed, most fold change values lie in the center position of the distribution and the proportion of significantly changed genes decreased along the axis to both directions (Fig. ).
It is clear that the statistical tools used in PAGE direct the program to analysis of pre-defined gene sets in microarray data sets rather than individual genes. This design was intentional. Regardless of the experimental paradigm, the majority of the cellular transcripts analyzed for differential expression on genome-wide microarray chips such as the Affymetrix 133A/B show statistically insignificant changes. For example, our analysis of gene expression profile of HIV-1-infected astrocytes on U133A/B detected about 740 different transcripts with fold changes of > 2 or < -2 and p ≤ 0.05 [
32]. This result also means that the signals obtained with over 40,000 other probes on the chips in these experiments were not considered as significant. Thus, many potentially relevant but subtle changes in biological systems may not be readily detectable by individual gene analysis of differentially expressed gene lists. PAGE, like GSEA [
25], attempts to resolve this problem by utilizing the phenomenon of gene co-regulation. In complex biological systems, many genes belonging to the same family and performing similar functions or genes acting in the same biological pathway are co-regulated. Conversely, in disease states, these genes may be coordinately dysregulated. Characterization of gene co-regulation (or co-dysregulation) under different physiological and pathological conditions is an important research problem that can now be approached by bioinformatics tools [
33]. The assumption behind the gene set enrichment concept is that the statistical significance of coordinated changes in a set of co-regulated genes will be greater than that for individual genes in the set. This assumption was at least in part validated by applications of GSEA and seems to be borne out for PAGE as well (this work). In fact, we consider PAGE (as well as GSEA) not only as a program for detecting correlations between experimental conditions and changes in behavior of known gene sets containing co-regulated genes, but also as a tool for intentional search for novel gene co-regulation (or co-dysregulation) in microarray data sets as part of testable hypotheses.
It may be considered paradoxical to apply a statistical test based on normal distribution to an explicit goal of detecting sets of co-regulated, that is, interdependent genes. The normal distribution paradigm requires that sampled observations are independent and identically distributed, or IID. However, we would like to argue that gene dependency caused by co-regulation in a given microarray data set should be regarded as rare, and thus statistically significant. In developing this program, we started with a basic assumption, a null-hypothesis, that all genes in a given microarray data set are independent of each other and identically distributed, that is, they are not co-regulated. With given gene sets as testable hypotheses, we then tested whether there is a significant shift of behavior of genes as a group. When we observed a significant change in a given gene set, we rejected the null-hypothesis and concluded that those genes in a gene set are co-regulated and dependent on each other. We found that with the statistical tools we used, we matched and in most cases exceeded the ability of GSEA to detect co-regulated genes.
In a direct comparison of PAGE and GSEA using published data bases, the p-values of PAGE were lower than the respective p values obtained by GSEA and as a result, the number of gene sets that can be considered significantly changed was larger (Table and see Additional files
1,
2,
3,
4). Similar results were obtained in an extensive simulation study (see
Additional file 5). This confirms that similar to other applications [
31] the parametric statistical test is more powerful than the non-parametric method when applied to gene set enrichment analysis Two other features of PAGE facilitate the computational process involved in running the program. First, because PAGE uses standard normal distribution as a background distribution, there is no need for the preceding permutation step required for this calculation in GSEA [
25]. This reduces computation time at least 1,000 times when one performs 1,000 permutation of data set to get a background distribution. Secondly, the Z score of PAGE is two-tailed, showing gene sets of both increased and decreased expression in a single analysis. In one-tailed programs [
11,
12,
25], the entire process from ranking gene lists to class permutation and statistical inference must be repeated after analysis in one direction. Thus PAGE is a statistically powerful gene set enrichment analysis tool with features that decrease the computational burden of such programs and increase the amount of information obtained per one analysis.
With wider availability of the gene microarray technology, there is an exponential increase in publicly accessible microarray data bases obtained on different platforms, by different laboratories, and addressing a variety of biological questions. A number of increasingly advanced data analysis tools have been developed to begin to compare and integrate this diverse and often incompatible information, including programs to identify biological themes instead of differentially expressed gene lists [
12,
25] or programs which identify significant genes displaying consistent changes across biologically different systems [
15,
24]. Each approach was shown to lead to better congruency among diverse data sets than would be achieved by direct comparison of data sets, whether in demonstrating common transcriptional profiles of prostate cancer [
34], common molecular markers of lung cancer [
15], or a common biological theme in diabetic muscle [
25].
We have found that PAGE also can be applied to integrative data analysis across various microarray platforms and biological systems. As with GSEA [
25] or EASE [
12], the key to PAGE utility for this purpose is the ability of the program to compare microarray data sets for gene sets rather than individual genes. Our results indicate that PAGE works well with different probe level analysis methods (Table ) and different microarray platforms (Table ), in each case being able to identify several common biological pathways in the same starting material tested irrespective of the platform or primary analytical method used. Gene set analysis by PAGE was also far more discriminatory than individual gene analysis in finding common biological pathway changes in different microarray data sets generated to address the same biological question, the difference between young and aged muscles (Fig. and Table ). Another feature of PAGE that is useful for comparison of multiple microarray data sets is the Z score. Z score is a normalized and linear-scale value which is microarray platform independent and which is convenient to use as an input for subsequent analysis. It is possible to generate a data matrix containing Z scores of pre-defined gene sets of multiple data sets obtained on different microarray platforms and then perform cluster analysis to identify gene sets of specific interest or to identify relationships among data sets. We applied this approach recently to cluster analysis of multiple microarray data sets of macrophages infected with bacteria, protozoa, HIV-1, or treated with cytokines, and identified gene sets that were specifically changed in HIV-1-infected cells (S.-Y. Kim and M.J. Potash, unpublished), suggesting that the Z score system of PAGE will be useful for asking broad biological questions.