For a gene set, the functional coherence
is a measure of the strength of the relatedness of the functions associated with the genes, which can be used to differentiate a set of genes performing coherently related functions from ones consisting of randomly grouped genes. It is commonly evaluated by analyzing the genes' functional annotations, which are almost invariably in the form of the controlled vocabulary from the Gene Ontology (GO; Ashburner et al.
; Brown et al.
; Huang et al.
; Khatri and Drălghici, 2005
; Mateos et al.
). The vocabulary terms are organized as a graph, with concepts ranging from very general to very specific. They are used to annotate gene products by a variety of methods, including human curation, based on evidence from the literature. The annotations capture what is known about the biology. Functional coherence can be reflected in two aspects. First, whether genes share similar functions or whether they participate in the same biological processes. For example, if a sufficient number of genes from a set are annotated with a common GO term, the annotation is considered to be ‘enriched’ and, therefore, the genes are deemed functionally coherent. Second, whether the distinct functions are related. The relationship between functional annotations can be either semantic or biological in nature. For example, if a gene is annotated with the term regulation of apoptosis
and another one is labeled with the term induction of apoptosis
, their functions can be considered coherent because the terms are semantically alike. Alternatively, annotations can be semantically distinct, e.g. apoptosis
and electron transport chain
, yet their co-occurrence in a gene set can be biologically meaningful if many genes in the set participate in both processes. Ideally, a measure of functional coherence should take the above aspects into account during evaluation. To date, a method unifying the enrichment and relatedness aspects of functional coherence remains to be developed.
The first aspect of functional coherence, evaluating GO term enrichment, is usually performed by various count-based methods that evaluate the probability of observing a GO term in a set by random chance—to determine if an individual term is over-represented in a gene set. The widely used count-based methods are based on the hypergeometric distribution or other similar probabilistic models (Cho et al.
; Huang et al.
; Khatri and Drăghici, 2005
; Man et al.
). The merits and limitations of this family of methods are well documented (Khatri and Drăghici, 2005
; Zheng and Lu, 2007
). Since the objective of count-based methods is focused on individual terms, directly utilizing their results, e.g. P
-values, to assess overall coherence encounters the following difficulties: (i) schemes (ad hoc
or sophisticated) need to be devised in order to combine the results of individual tests into a unified measure; (ii) the relationships among the terms are ignored by treating each annotation independently; and (iii) multiple testing potentially leads to false positives results, thus a less reliable unified measure.
The second aspect, evaluating the relatedness among distinct annotations, has been investigated in several studies that utilized the directed acyclic graph (DAG) representation of the GO. A number of studies have used the ontology graph structure in the context of functional analyses; however, the specific purpose or information used often differs from the methods proposed in this study, making a direct comparison between methods less meaningful. One theme is to find the representative summary term(s) utilizing the graph structure. For example, the lowest common ancestor terms have been used to find summarizing GO terms (Lee et al.
). Making use of the topology of the GO graph, Alexa et al.
) devised several algorithms to identify the representative GO terms and further to reweight the scores of the terms. Another theme utilizing the GO graph structure is to quantify the semantic relationships among the GO terms and derive statistics to assess their similarity. For example, the average of pairwise shortest paths between the annotated terms has been used to develop both pairwise and group-level measures of gene set similarity (Ruths et al.
; Wang et al.
). Other authors have used semantic similarity to summarize the results of enrichment analyses (Xu et al.
). Another measure of similarity, the total ancestry measure, was developed by Yu and co-workers (Yu et al.
) to summarize the functional similarity of GO terms from a gene set. Furthermore, GO graph-based studies have been carried out to evaluate the functional coherence of gene sets via the integration of multiple data sources. Several methods make use of microarray expression data (Goeman and Mansmann, 2008
; Kong et al.
), while others use the biomedical literature associated with genes (Raychaudhuri and Altman, 2003
; Zheng and Lu, 2007
). However, none of these methods have considered both aspects of functional coherence, nor have they explicitly considered the relationships among GO terms co-annotating genes, which provide additional biological information.
In this study, we introduce a novel approach to assess the functional coherence of gene sets by taking into account both the enrichment of GO terms and their relationships among terms, of which a conceptual overview is illustrated in . Our methods offer three novel aspects. First, the genes and their annotations are represented with a subgraph derived from the GO graph, in which genes and GO terms are represented as nodes, and their relationships are represented as quantifiable edges (gene-to-term and term-to-term). By studying the topological properties of the graph with methods from graph theory (Barabási and Oltvai, 2004
; Newman, 2003
), we have identified a set of metrics that reflect both the enrichment of GO terms and the relationships among them, which makes possible the differentiation of known coherent gene sets from randomly grouped ones. Second, we utilized the information of co-annotation of genes by a pair GO terms, a source of information ignored by most of contemporary methods, to further enhance the discriminative power of the graph-based metrics. Finally, we have developed a principled framework by employing simulation and non-parametric statistical methods, which enables us to directly test the null hypothesis that a gene set consists of random grouped genes. When applied to the gene sets from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al.
), the metrics were shown to be highly discriminative in terms of differentiating known coherent gene sets from random ones; when tested on gene sets derived from microarray analysis, the metrics provided biologically sensible assessments.
Fig. 1. Conceptual overview of graph-based functional coherence evaluation. (A) A graph representation of the GO is constructed and referred to as a GOGeneGraph, in which a node is a GO term or a gene, and an edge reflects the semantic relationship between a (more ...)