A central problem in cell biology is to infer functional molecular modules underlying cellular alterations from high throughput data such as differential gene, protein or metabolite concentrations. A number of computational techniques have been developed that use expression for class distinction to identify, from among a priori
defined sets of functionally or structurally related genes, those that correlate with phenotypic difference (see, for example, Goeman and Buhlmann [1
]). More sophisticated approaches have used random forests to capture nonlinear and complex information in expression profiles [2
]; applied linear transformations to measure the discriminative information of genes [3
]; and combined information from multiple assessments [4
One of the most widely used methods, gene set enrichment analysis (GSEA) [5
], ranks genes according to their differential expression and then uses a modified Kolmogorov-Smirnov statistic (weighted K-S test) as a basis for determining whether genes from a prespecified set (for example, Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways or Gene Ontology (GO) terms) are overrepresented toward the top or bottom of the list, correcting for false discovery when multiple sets are tested [6
]. The central message of this paper is that discovery depends strongly on the type of correlation used, and we illustrate this point by elaborating on the biological implications of two different cancer data sets. GSEA uses a weighted Kolmogorov-Smirnov statistic (WKS) to quantify enrichment. The weight is related to the correlation with phenotype, essentially omitting known network properties of gene sets. Here we take such properties into account, as explained below. We reserve the term WKS for describing GSEA, and refer to our method, which integrates topological information, as pathway enrichment analysis (PWEA), where a pathway is defined as a pair of nodes connected by an uninterrupted set of intervening nodes and edges, such as those found in protein-protein interaction networks, signal transduction networks, and metabolic pathways. In this paper we use KEGG pathways. Just as WKS represents a conceptual and practical improvement over the K-S test, we show in this paper that the inclusion of topological weighting is not only a conceptual change in enrichment analysis, but a substantial practical improvement.
Several recently introduced techniques, including ScorePAGE [7
], gene network enrichment analysis [8
] and Pathway-Express [9
], incorporate concepts of gene topology. ScorePAGE uses a topology-weighted cross-correlation of time-dependent (or condition-dependent) gene expression data to assign a significance value to a priori
defined KEGG metabolic pathways. Gene network enrichment analysis first identifies a high-scoring transcriptionally affected sub-network from a global network of protein-protein interactions, and then identifies gene sets that are enriched in the sub-network using a Fisher test. Pathway-Express contains in its scoring function a term that increases the scores of the genes that are directly connected to other differentially expressed genes, which in turn produces a higher overall score for predefined KEGG signaling pathways in which the differentially expressed genes are localized in a connected sub-graph. Other strategies that extract enriched functional submodules [10
] or paths [12
] from protein-protein interaction networks or other topological pathways without strict boundary (that is, identify only a subset of networks without a priori
gene set definition) also take advantage of the topology.
Here we present a new and general method for incorporating disparate data into statistical methods used to infer functional modules from a class distinction metric. In order to fix ideas and compare with the most popular method, we use differential expression to distinguish phenotype and define a topological influence factor (TIF) to weight the K-S statistic. The TIF, however, can just as easily be used with other kinds of class distinctions as data become available, and with other kinds of statistics.
The contributions of this paper are both methodological and biological. The methodological contribution consists of including known correlations among the genes in a gene set in the weighting procedure. When applied to cancer data sets we find that the inclusion of longer-range correlations substantially improves sensitivity, with little or no loss of specificity. In particular for colorectal cancer, PWEA and GSEA agree on 24 out of 25 pathways identified by GSEA, but PWEA identifies an additional 10 pathways, 8 of which, including oxidative metabolism of arachidonic acid, are supported by evidence from the literature. For small cell lung carcinoma, PWEA finds all 19 of the pathways identified by GSEA, and an additional 14 highly plausible pathways, including apoptosis, MAPK signaling pathway, Jak-STAT signaling pathway, and the GnRH signaling pathway.