An important step in the analysis and interpretation of gene expression, proteomic, metabolomic or transcription factor binding data is answering the question, ‘What biologically related sets of genes are enriched with the interesting genes/proteins/compounds identified in my experiment?’ Such analysis applied to gene expression data is often referred to as gene set
, or functional
, enrichment testing. Gene sets defined by Gene Ontology (GO) (Ashburner et al.
; Harris et al.
) or KEGG pathways (Kanehisa et al.
) are often employed, and the statistical significance of enrichment can be established using the Fisher's exact test and the hypergeometric distribution. Several web-based or downloadable tools performing this or a similar test have been developed, such as Onto-Express (Draghici et al.
; Khatri et al.
), David/EASE (Dennis et al.
; Hosack et al.
), the Gostats
package of Bioconductor (Gentleman, 2007
), GOMiner (Zeeberg et al.
) and FuncAssociate (Berriz et al.
). A second research question, often viewed as separate from the first, is based on testing hypotheses of correlated signatures between disparate sources of biological knowledge. For example, are genes targeted by a specific microRNA more likely to be involved in a disease progression process than expected by chance? These two common types of research questions can be answered within the same analysis framework of gene set relation mapping.
While there is a plethora of tools for enrichment testing, few offer the level of visualization and interactivity desired by many biomedical researchers to explore results. Gene set relation mapping is a technique that extends beyond enrichment testing and can enable wide-spanning exploratory analysis and hypothesis generation by visualizing relationships among concepts. In addition to testing the overlap between an experimental gene list and predefined gene sets (concepts), the significant overlap among all predefined gene sets (concepts) is assessed. Two concepts are related when they have significantly more genes in common than expected by chance, and these relationships can form a network. Testing among concepts allows one to visualize the networked relationships among concepts enriched with genes in an experimental dataset. For example, one may observe that the concepts enriched with their data cluster into three distinct groups each having previously unsuspected relationships between concepts from diverse concept types. Concept types represent data from different sources of biological knowledge, such as biological processes, microRNA target lists, chromosomal regions, or drug target lists. Another approach to visualizing gene set relations is by clustering genes versus enriched concepts in a heatmap view. This allows one to see in a glance which subset of genes is responsible for the enrichment of which concepts, in addition to visualizing which concepts are closely related (see Section 2.7). Together, the network and heatmap enable a more biologically comprehensible understanding of functional enrichment results.
One type of gene set relation mapping was implemented and incorporated in the software Oncomine (Rhodes et al.
) (referred to as molecular concept mapping). It allows investigators to easily navigate the complex and diverse public domain gene expression knowledge relating to cancers through the use of data integration, manual curation, statistical analyses and visualization tools. This approach has led to important discoveries, particularly in research related to the progression of prostate cancer (Morris et al.
). However, this initial gene set relation mapping lacked some key sources of biological information included here in ConceptGen, such as protein–protein interactions other than from HPRD and metabolite information; it also relied on basic statistics for analysis of gene expression data, and restricted gene expression signatures to those related to cancer due to the program's cancer-related focus (see for comparison). DAVID/EASE (Dennis et al.
) offers a different type of gene set relation mapping in which clusters of related gene sets are formed using kappa statistics; however no visualization is offered, and one cannot observe connections between gene sets in different clusters.
Comparison of selected functional enrichment testing software
Here we present a new web-based software application, ConceptGen, that may be used as a gene set enrichment and gene set relation mapping tool. It contains several sources of biological knowledge, offers multiple visualizations, and has a convenient user-interface. In addition, we have performed gene-to-gene enrichment testing which identifies closely related genes based on significance of co-occurring concepts. This provides an additional viewpoint and database for addressing further questions. A similar approach to identifying related genes is used in the new Paralog Hunter tool in GeneDecks, which uses a variety of concept types from GeneCards to identify functional paralogs
(Safran et al.
; Stelzer et al.
NCBI's Gene Expression Omnibus (GEO) data repository (Edgar et al.
) offers a wealth of experimental data. As one concept type in ConceptGen, we have downloaded, processed and analyzed human raw Affymetrix data from GEO to create gene expression-based concepts covering a wide variety of expression signatures from reactions to treatments, diseases, exposures, genotype, development and injury/infection. Taking such an unbiased approach to data inclusion allows one to identify previously unsuspected relationships among diverse biological perturbations, and generating novel hypotheses. The datasets can be expanded to utilize epigenomics, proteomics, metabolomics and microRNA inputs.