Pathway analysis and gene-set enrichment analysis are both widely-used methods to identify significant molecular expression patterns from high-throughput data [1
]. Over the last decade, biological pathways have provided natural sources of molecular mechanisms to develop diagnosis, treatment, and prevention strategies for complex diseases [2
]. The various and massive functional genomics data are effectively analyzed by gene-set enrichment methods instead of individual gene analysis [5
]. Pathway analysis and molecular signature discovery continue to reveal the association between genotypes and phenotypes, which are simply called molecular profiling or molecular phenotypes. At present, researchers intend to combine pathway and gene-set enrichment approaches and network module-based approaches to identify crucial relationships among different molecular mechanisms [1
As sources of prior knowledge for molecular mechanisms, biological pathway databases are heterogeneous, cross multiple levels, and lack annotations [3
]. Different pathway databases may yield divergent results from the same input data. When different databases yield similar results, applying multiple pathway data sources in a single analysis can generate a measure of validation. Unlike candidate pathway analysis, genome-wide pathway analysis does not require prior biological knowledge. In addition, genome-wide pathway analysis can reveal gene interactions across different diseases [3
] and multiple pathways [3
]. Other studies based on an online integrated human pathway database (HPD) also provided associations between different pathways with diverse types, sizes, and sources [12
] on specific phenotypes. Although these efforts have greatly improved the efficiency of pathway analysis, our knowledge of biological pathways is still far from complete.
Gene signature data from the transcriptome level offers a complementary source of information to complete pathway knowledge. In a recent review, Khatri et al. [1
] categorized pathway analysis into three generations of approaches: the first-generation "over-representation analysis" (ORA) approaches, the second-generation "functional class scoring" (FCS) approaches, and the third-generation "pathway topology" (PT) approaches. To overcome the limitations of ORA approaches (gene-level statistics), FCS approaches, such as gene-set enrichment analysis (GSEA) [6
], were devised to include overall changes of gene expressions in each pathway/gene set (pathway-level statistics). Third generation approaches also include overall changes of gene expressions based on pathway topology--that is, their upstream/downstream positions within each pathway. Although these third generation approaches were meant to change our understanding of the underlying mechanisms of pathways, they lack information necessary to achieve this: the interdependence between pathways. Annotated knowledge from genome, transcriptome, post-transcriptome, and proteome levels can assist pathway and gene-set enrichment analysis.
Multi-level, multi-scale, knowledge-guided enrichment analysis can enable molecular phenotype discovery for specific human diseases. Currently, the acquisition of prior knowledge and systems modeling poses a challenge for developing tools that go beyond third-generation pathway analysis for disease-specific molecular profiling. Prior knowledge acquisition requires attention to updates and improves the available annotations with descriptive knowledge from multiple levels, especially for information on pathway microenvironment ("condition-, tissue-, and cell-specific functions of each gene") [1
]. Systems biology modeling must incorporate data from the view of systems biology to build systems with multiple scales, which can be used to generate hypotheses that will give detailed and accurate predictions of changes in systems. Both aspects of this challenge will be addressed by building a database not only containing disease-associated genes, transcript factors, proteins, and microRNAs, but also by organizing their relationships within and between pathways, gene signatures, and any gene sets from existing experiments or papers.
To meet the new challenges of molecular phenotype discovery, we developed in this work an integrated online database, the Pathway And Gene Enrichment Database (PAGED), to enable comprehensive searches for disease-specific pathways, gene signatures, microRNA targets, and network modules, by integrating gene-set-based prior knowledge as molecular patterns from multiple levels--the genome, transcriptome, post-transcriptome, and proteome. The new database can provide the following benefits to biological researchers. First, the new database consists of disease-gene association data, curated and integrated from Online Mendelian Inheritance in Man (OMIM) [14
] database and the Genetic Association Database (GAD) [15
]; therefore, it has the potential to assist human disease studies. Second, as of March 2012 it also contains all current compiled gene signatures in Molecular Signatures Database (MSigDB) [8
] and Gene Signatures Database (GeneSigDB) [7
]. Third, it further integrates with microRNA-targets from miRecords [16
] database, signaling pathways, protein interaction networks, and transcription factor/gene regulatory networks, partially based on data integrated from the Human Pathway Database (HPD) [12
] and the Human Annotated and Predicted Protein Interaction (HAPPI) [17
] database. All gene sets or pathways are annotated with molecular interaction details whenever available. We integrated the following version of the database OMIM [14
] (Feb. 2012), GAD [15
] (Aug. 2011), GeneSigDB [7
] (v. 4.0, Sept. 2011), MSigDB [8
] (v. 3.0. Sept. 2010), HPD [12
] (2009), HAPPI [17
](v. 1.4) and miRecords [16
] (Nov. 2010), which are the latest versions available. An advantage of our work lies in its representation of relationships between pathways, gene signatures, microRNA targets, and/or network modules. These gene-set-based relationships can be visualized as a gene-set association network (GSAN), which provides a "roadmap" for molecular phenotype discovery for specific human diseases. Using colorectal cancer expression data analysis as a case study, we demonstrate how to query PAGED to discover crucial pathways, gene signatures, and gene network modules specific to colorectal cancer functional genomics.