PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2010; 5(6): e10996.
Published online 2010 June 16. doi:  10.1371/journal.pone.0010996
PMCID: PMC2886832

Statistical Tests for Associations between Two Directed Acyclic Graphs

Fabio Rapallo, Editor

Abstract

Biological data, and particularly annotation data, are increasingly being represented in directed acyclic graphs (DAGs). However, while relevant biological information is implicit in the links between multiple domains, annotations from these different domains are usually represented in distinct, unconnected DAGs, making links between the domains represented difficult to determine. We develop a novel family of general statistical tests for the discovery of strong associations between two directed acyclic graphs. Our method takes the topology of the input graphs and the specificity and relevance of associations between nodes into consideration. We apply our method to the extraction of associations between biomedical ontologies in an extensive use-case. Through a manual and an automatic evaluation, we show that our tests discover biologically relevant relations. The suite of statistical tests we develop for this purpose is implemented and freely available for download.

Introduction

An increasing number of discoveries, particularly in biomedicine, are facilitated by statistical analyses of data annotated to biomedical ontologies [1]. Biomedical ontologies are generally represented as DAGs, and specific domains are usually represented in distinct, separate DAGs [2][4].

Statistical tests that utilize a single graph can only consider the given domain. However, entities from different domain are linked via biomedical relations [5]. These relations can be vital for the discovery of novel biomedical knowledge. We have designed a family of novel statistical tests to identify strong associations between nodes from two directed acyclic graphs. The tests combine measures of relevance and specificity.

We evaluated our statistical method through an extensive use-case in which we applied our tests to the detection of strong semantic associations between the Gene Ontology [3] and the Celltype Ontology [6] based on co-occurrence in scientific literature. In this use-case, we annotated the ontologies with occurrence and co-occurrence count data of the ontologies category labels in full text scientific articles. The strongest associations identified through our tests are biologically relevant relations.

An implementation of the six novel statistical tests to identify associations between directed acyclic graphs is available as free software from our project webpage at http://bioonto.de/pmwiki.php/Main/ExtractingBiologicalRelations.

State of the art

Our approach to the computation of the strength of the association between two graphs relies on approaches for capturing the semantic similarity between categories in ontologies and for propagating these similarities within DAGs. In the following, we give a brief overview of methods for computing the similarity of categories (a more complete overview can be found in [7]). Most of the existing semantic similarity approaches assume that ontologies contain categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e001.jpg that are annotated with terms An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e002.jpg. Based on this assumption, the computation of the semantic similarity of two categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e003.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e004.jpg can be carried out by using the structure of the ontology to which An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e005.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e006.jpg belong (edge-based approaches), the nodes and their properties (e.g., similarity between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e007.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e008.jpg) (node-based approaches) or by combining structural knowledge and annotations (hybrid approaches).

The most common edge-based approach consist of using a function of the number of edges between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e009.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e010.jpg as semantic similarity measure [8], [9]. Other approaches combine the previous approach with the lenght of the path from the most specific common ancestor of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e011.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e012.jpg and the root node [10], [11]. Edge-based approaches rely on the nodes being elements of the same graph. Thus, they cannot be utilized when trying to compute the similarity of two nodes from distinct DAGs.

The second category of approaches, the node-based approaches, use the properties of the nodes themselves to compute their similarity. One of the central concept for using annotations to compute similarity is that of information content, which is the negative log-likehood An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e013.jpg of a term An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e014.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e015.jpg is the probability of occurrence of the terms in An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e016.jpg in a certain corpus. Based on this value, several similarity metrics have been developed including the information content of the most informative common ancestor used in [12], [13] or of the disjoint common ancestors [14].

In recent years, hybrid similarity measures that combine node- and edge-based approaches have been developed. Most of these approaches utilize the information content. For example [15] utilize a combination of edge weights based on node depth and node link density and of the difference of information content of the nodes linked by that edge. Other approaches such as that described in [16] compute edge weights by using a scheme that takes the type of the edge into consideration. The semantic similarity between two terms is set to a function of the maximum of the product of best path between the terms. Again, these approaches can only compute the similarity of terms from the same DAG.

The aim of our approach is to provide a means for the computation of the association between nodes from 2 DAGs, which are, in general, distinct. We do not make similar assumptions about the annotation of edges and nodes as other approaches to semantic similarity. Instead, we go beyong current semantic similarity measures by providing a measure of statistical significance in a distribution of arbitrary node and edge annotations. When applying out method to semantic similarity between ontologies, we can compute initial semantic similarity values for categories which do not belong to the same ontologies.

Methods

Statistics on graphs

Preliminaries of directed acyclic graphs

Our tests take as input two directed acyclic graphs, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e017.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e018.jpg that are disjoint (An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e019.jpg). From these two graphs, a graph An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e020.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e021.jpg is constructed. We denote an edge as an ordered pair of vertices. If an edge connects An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e022.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e023.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e024.jpg, we call An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e025.jpg the child of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e026.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e027.jpg the parent of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e028.jpg. If there is a path from An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e029.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e030.jpg, we call An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e031.jpg a predecessor of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e032.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e033.jpg a successor of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e034.jpg.

In addition to the two graphs, two functions An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e035.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e036.jpg are given as input such that An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e037.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e038.jpg. From these two functions, a graph decoration for An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e039.jpg is constructed based on the assumption that the two input functions are transitive over the DAG: the decoration An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e040.jpg of a vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e041.jpg is the union of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e042.jpg and the values of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e043.jpg for all successors An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e044.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e045.jpg. Similarly, the decoration An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e046.jpg of an edge An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e047.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e048.jpg is the union of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e049.jpg and the values of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e050.jpg for all edges An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e051.jpg between the successors of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e052.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e053.jpg.

The third component of the input is a score function An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e054.jpg. We assume that the value of the score function between the vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e055.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e056.jpg depends only on the graph decorations An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e057.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e058.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e059.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e060.jpg as well as the decoration An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e061.jpg of the edge An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e062.jpg.

The score function is not symmetric, i.e., it is not necessary that An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e063.jpg. It is intended to measure the association strength between two vertices from the input graphs. Our method identifies whether the score between two vertices is significantly high. A graphical overview of our test method is shown in Figure 1.

Figure 1
Schematic representation of our method.

Determining the Random Distribution

The score between two vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e064.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e065.jpg is influenced by the topology of the input DAGs: a vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e066.jpg that is more general has a larger decoration set An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e067.jpg due to our basic assumption about transitivity of input graph decorations. Similarily, the cardinality of the decoration set of the edges between nodes from the two input DAGs is larger when the edges connect more general vertices. Therefore, it is insufficient to test for a high score between vertices to consider the score between two vertices as significantly high. A random distribution of the scores of each pair of vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e068.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e069.jpg provides a means for determining the significance of the score between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e070.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e071.jpg. This random distribution depends on the functions An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e072.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e073.jpg, the score function and the topology of the input graphs. Hence, we cannot assume any statistical distribution of scores ab initio. Instead, we simulate the random distribution of the scores between each vertex pair through multiple random permutations: the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e074.jpg-values that are given as input for our method are randomly swapped with the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e075.jpg-values of vertices in the input DAG from which they originate. There are two options for permutating the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e076.jpg-values for edges: either they are, mutatis mutandis, permutated similarily to the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e077.jpg-values of the vertices, or they are permutated depending on the permutation of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e078.jpg-values; in the latter case, when the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e079.jpg-values of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e080.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e081.jpg are swapped, so are the values of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e082.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e083.jpg for any vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e084.jpg.

Because our test is intended to identify associations between vertices, we do not assume that the values of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e085.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e086.jpg are independent. We therefore prefer to use the second option, i.e., that the permutation of the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e087.jpg values depends on the permutation of the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e088.jpg-values.

Based on these permutations, we first rebuild the graph decorations An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e089.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e090.jpg. Then, we calculate and record the values of the score function An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e091.jpg for all pairs of vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e092.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e093.jpg. In addition, for each vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e094.jpg, such that An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e095.jpg is a direct successor of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e096.jpg, we calculate and record the score difference An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e097.jpg. Further, for each vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e098.jpg with the direct predecessor An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e099.jpg, we calculate and record the difference An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e100.jpg.

Hence, the results of this step are threefold. First, we approximate the random score distribution for each pair of vertices through multiple random permutations. Second, each triple of vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e101.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e102.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e103.jpg gives rise to a random distribution of score differences between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e104.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e105.jpg. Third, each triple An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e106.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e107.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e108.jpg yields a random distribution of score differences between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e109.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e110.jpg.

Ontologies as graphs

While the tests we develop can be applied to any DAG that satisfies the conditions specified above, their primary application is to test the significance of an association between categories from two ontologies. An ontology is the specification of a conceptualization of a domain [17], [18]. Many biological ontologies are represented as directed acyclic graphs (DAGs) and are available in the OBO flatfile format [2]. In these DAGs, nodes represent categories and edges represent relations between these categories. A category, also called kind, class or universal, is an entity that is general in reality. Examples are dog, apoptosis or red. Categories may have instances, of which some may not be further instantiated. These are called individuals. We call the set of all categories in an ontology An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e111.jpg An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e112.jpg.

Categories may be related to other categories. The most important relation between two categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e113.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e114.jpg is the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e115.jpg relation, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e116.jpg. The relation An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e117.jpg can be defined by using the instantiation relation: when An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e118.jpg, then all instances An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e119.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e120.jpg are instances of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e121.jpg [18]. This definition implies that the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e122.jpg relation is reflexive, transitive and antisymmetric.

A set of categories with the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e123.jpg relation among them form a taxonomy. These taxonomies are often the backbone of the OBO ontologies' DAG structure. We call the set of all successors of a category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e124.jpg the sub-categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e125.jpg and its predecessors the super-categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e126.jpg. The direct successors of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e127.jpg in the taxonomy are called children (An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e128.jpg), while the direct predecessors are called parents.

In the OBO flatfile format, ontologies are assigned a namespace. Category identifiers are prefixed with the namespace of the ontology to which they belong. Identifiers are therefore unique within the OBO ontologies. In addition to a unique identifier, categories are assigned a name and a set of synonyms. Neither the name nor the set of synonyms must be unique.

Results

Statistics on graphs

To identify strong associations, we designed a family of tests for the score of each edge between the two input DAGs that considers a fragment of the path in the DAG. The tests are designed to measure the significance of the score between vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e129.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e130.jpg based on three criteria: (1) the score An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e131.jpg for the association should be higher than expected; (2) for each child An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e132.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e133.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e134.jpg should be higher than expected; and (3) for each parent An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e135.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e136.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e137.jpg should be lower than expected.

The first criterion of our tests identifies hypothetical associations between nodes from two graphs. The second and third criteria are used to verify whether the pair is the best selection, or whether a more specific or more general association is preferable. For this purpose, the second and third criteria test for novelty of the association (compared to the child and parent nodes).

Within this section, let An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e138.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e139.jpg be fixed vertices from the DAGs An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e140.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e141.jpg, respectively. Furthermore, let An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e142.jpg be the number of permutations that were used to determine the random distributions. The first test we designed, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e143.jpg, depends on the vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e144.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e145.jpg, the DAG structure and the number of permutations An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e146.jpg. It tests for the following properties:

  • the score between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e147.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e148.jpg is high,
  • the difference between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e149.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e150.jpg for every child An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e151.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e152.jpg is high,
  • the difference between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e153.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e154.jpg for every parent An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e155.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e156.jpg is low.

“Being high” and “being low” are captured using the values of the cumulative distribution functions (CDFs) obtained by the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e157.jpg permutations performed in the previous step: one function for each pair of categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e158.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e159.jpg, one function for each triple of categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e160.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e161.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e162.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e163.jpg is a child of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e164.jpg, and one for each triple An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e165.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e166.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e167.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e168.jpg is a parent of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e169.jpg. We combine the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e170.jpg-values of the score differences to children in a single value using their geometric mean. A similar combination of the score differences' An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e171.jpg-values to the parent categories of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e172.jpg is carried out: here, the combined value is the geometric mean of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e173.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e174.jpg is the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e175.jpg-value in the corresponding CDF.

Formally, let An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e176.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e177.jpg be fixed vertices from the directed acyclic graphs An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e178.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e179.jpg, respectively, and let

  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e180.jpg be the number of permutations,
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e181.jpg be the score between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e182.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e183.jpg in the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e184.jpg permutation,
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e185.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e186.jpg, be the cumulative distribution function (CDF) of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e187.jpg.
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e188.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e189.jpg, be the CDF of the difference between the vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e190.jpg and its An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e191.jpg child vertex,
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e192.jpg,
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e193.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e194.jpg, be the CDF of the score difference between the vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e195.jpg and its An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e196.jpg parent vertex,
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e197.jpg,
  • An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e198.jpg, for all An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e199.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e200.jpg, be the CDF of the variances An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e201.jpg of the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e202.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e203.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e204.jpg for the distributions An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e205.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e206.jpg, respectively.

For each child An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e207.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e208.jpg, we calculate the difference in scores An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e209.jpg. Then, we compute the geometric mean An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e210.jpg of all values An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e211.jpg. Similarly, we calculate An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e212.jpg for each parent An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e213.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e214.jpg, and the geometric mean An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e215.jpg of all values An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e216.jpg. Then we define as our first test

equation image
(1)

All other tests are extensions of the first test. The second test, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e218.jpg, uses the minimum function instead of the geometric mean to combine the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e219.jpg-values in the CDFs of the score differences to parents and children.

The first two tests An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e220.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e221.jpg do not consider the variances of the distributions of scores, differences in scores to children and differences in scores to parents. Therefore, we extend these tests by weighting all three components of the tests with the variances of their corresponding distributions. In these tests, high variance lowers the impact of the result, while lower variance strengthens it.

We define three new distributions for the variances and choose the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e222.jpg-value in the respective CDF as a weight in our tests. We compute the scores for each pair of category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e223.jpg times, resulting in one distribution of scores for each pair of categories. Each of these distributions has a variance. The score variance distribution is the finite distribution (containing An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e224.jpg elements) of the variances of each of these distributions. We define the variance distribution for score difference to parent and child analogously.

The tests An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e225.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e226.jpg use only the variance distribution of scores, while An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e227.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e228.jpg use all three variance distributions. These tests are one-sided, i.e., they are not symmetric. We define two-sided, symmetric tests An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e229.jpg for all vertices An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e230.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e231.jpg as

equation image
(2)

Table 1 lists the combination of properties for all tests. The precise formulation of all six tests can be found in the supplement S1.

Table 1
Elements of the test score of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e233.jpg.

Application to biomedical ontologies

Occurrence and co-occurrence count data as graph decoration

To verify whether the tests we designed yield reasonable results, we applied our method to the detection of significant co-occurrences between ontological categories in natural language texts, as a precursor to the detection of relations between ontological categories. For this purpose, we make the following assumptions:

  1. A term occurs in a portion of text if it is an exact substring of this portion of text.
  2. Terms can designate ontological categories; the terms that designate the same category are henceforth called the category's synset. Every occurrence of an element of the category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e240.jpg's synset is called an occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e241.jpg. Every co-occurrence of an element of the category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e242.jpg's synset with an element of the category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e243.jpg's synset is called a co-occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e244.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e245.jpg.
  3. If An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e246.jpg is a sub-category of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e247.jpg, then every co-occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e248.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e249.jpg is a co-occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e250.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e251.jpg. Additionally, every occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e252.jpg counts as an occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e253.jpg.

To test our method, we used the Gene Ontology (GO) [3] and the Celltype Ontology (CL) [6] as input DAGs. The GO is an ontology specifically designed to describe gene products. It contains three separate ontologies: the biological process, molecular function and cellular component ontologies. Gene products can be tagged with ontology categories to describe and classify them. The CL is an ontology for types of cells. It classifies cells based on criteria such as structure or function.

Based on the input requirements of our test, we constructed synsets from the synonyms attached to each category in the input ontologies, and counted the occurrences and co-occurrences of the categories based on two contexts: single sentences and sentences in documents. The second context refers to whole documents, but co-occurrence is based on single sentences. Therefore, when two terms co-occur in two or more sentences within one document, their co-occurrence is only counted once. The functions that assign the occurrence and co-occurrence count values to a synset of a category for each context are called An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e254.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e255.jpg, respectively.

We used exact string matching to identify terms in text. Our evaluation was conducted using a 2.2 GB text corpus containing 60143 fulltext articles from Open Access journals listed in Pubmed Central. The aim of our method is to test for significant co-occurrences between categories.

Text Processing

First, we counted the number of occurrences and co-occurrences of the terms contained in synsets of categories from the input ontologies. Table 2 shows examples for the synsets of categories. We counted the total number of sentences and documents in which at least one element of a synset was found by using exact matching. For each pair of categories, we counted the total number of co-occurrences of elements of their respective synsets in sentences. Furthermore, we counted the number of documents in which they co-occured within at least one sentence. We used exact matching and abstained from using any more sophisticated methods for recognizing the ontologies' categories in text [19], [20] to evaluate our method. Exact matching provides a large dataset for the evaluation of our method. For practical applications such as relationship extraction, more advanced methods should be chosen.

Table 2
Example synsets taken from the GO and the CL.

The text processing yielded, for each category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e256.jpg, both its frequency An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e257.jpg and the total number of documents in which An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e258.jpg occurred, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e259.jpg. Furthermore, for each pair of categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e260.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e261.jpg, we obtained both the total number of co-occurrences in sentences An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e262.jpg and the total number of documents containing these co-occurrences An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e263.jpg.

Count data over ontologies

The first component in our method implements the assumption that the input graph decorations are transitive over the DAG structure. In the case of ontologies, this implements the assumption that occurrence and co-occurrence between categories is transitive over the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e264.jpg relation between categories.

We assumed that when two categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e265.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e266.jpg stand in the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e267.jpg relation, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e268.jpg, then every occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e269.jpg is also an occurrence of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e270.jpg. This means that the synset-closure An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e271.jpg of a category An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e272.jpg can be constructed as follows:

equation image
(3)
equation image
(4)

For count data, the decoration value of a vertex An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e275.jpg in the DAG is equal to the sum of the input value pair An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e276.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e277.jpg and the corresponding input values for An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e278.jpg's successors. Therefore, for all categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e279.jpg, we define An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e280.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e281.jpg to represent the sum of the values An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e282.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e283.jpg over all of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e284.jpg's sub-categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e285.jpg. Furthermore, for all categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e286.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e287.jpg, we compute the cumulated An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e288.jpg - and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e289.jpg-values dubbed An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e290.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e291.jpg:

equation image
(5)
equation image
(6)

Again, for count data, co-occurrence values between nodes An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e294.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e295.jpg can be summed up over the successors of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e296.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e297.jpg to yield the decoration of the edge between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e298.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e299.jpg.

A score for occurrences and co-occurrences

For all categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e300.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e301.jpg, we defined the following score function:

equation image
(7)

The first component of the score function implements the natural logarithm of the Pointwise Mutual Information (PMI) [21] score achieved by the categories with respect to their co-occurrence within sentences. PMI has been successfully used in several text mining tools (see, e.g., [22]). To avoid divisions by 0, the denominators of all members of the score function were incremented. The second component measures a similar value using documents as context. The aim of the score function is to ensure that categories that co-occur relatively often are assigned a high score. The range of the score function is between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e303.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e304.jpg.

Discussion

Evaluation

We applied the tests to the biological process (BP) branch of the GO and the CL. To recognize the categories in text, we used the identifier of the category, the name and all exact synonyms of the category. On average, every category had 2.1 synonyms. Using exact matching, we identified 3,751 out of BP's 14,542 (26%) categories in our text corpus. We found 491 of 754 (65%) categories from the CL. Categories from the BP co-occurred 70,967 times with CL categories.

Using our method, we identified a total number of 202,627 co-occurrences between categories. After applying our tests, 157,894 co-occurrences produced test values distinct from An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e305.jpg. The remainder obtained a test value of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e306.jpg due to numerical restrictions. They were subsequently excluded, because they were indistinguishable from the absence of co-occurrence. We illustrate the quantiles obtained for different An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e307.jpg-values in our six tests, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e308.jpg, in Table 3. The distribution of scores for An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e309.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e310.jpg are shown in Figure 2. The remaining plots are included in the supplement S1.

Figure 2
Distribution of test results.
Table 3
An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e317.jpg-quantiles for different An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e318.jpg-values for all tests.

We found that the tests using the minimum instead of the geometric mean of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e328.jpg-values of score differences to parent and child categories are generally more restrictive, i.e., they include fewer co-occurrences for a given cutoff. Similarly, tests including the variance for scores are generally more restrictive than tests that are not weighted by the variance of score distributions. In this sense, the tests An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e329.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e330.jpg are the most restrictive.

Table 4 shows example associations, and Table 5 shows the kind of relationship between categories that our tests identified for the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e331.jpg top-scoring results with respect to the test An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e332.jpg. The has-participant relation is defined in the OBO Relationship Ontology (RO) [5] as a relation that holds between two categories, where every instance of one category participate in some instance of the other. We define the Participates-in relation as a relation between two categories: An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e333.jpg Participates-in An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e334.jpg An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e335.jpg, where participates-in is the primitive participation relation between individuals as defined in the RO. We extend the definition of located-in in the RO to a relation Located-in between processes and objects, which holds when all participants of a process are located-in a structure during the entire duration of the process.

Table 4
Association examples.
Table 5
Manually identified ontological relations in the An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e337.jpg top-scoring association results with respect to An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e338.jpg.

In our sample, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e339.jpg associations do not fall under one of the three relations that we investigated. We discovered several kinds of unclassified relations. First, mismatches in granularity lead to strong associations for unrelated categories. For example, xanthine transport and erythrocyte are closely related according to An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e340.jpg. Erythrocytes are involved in the transport of xanthine. However, the GO category xanthine transport refers to the inter- and intracellular level of granularity, while erythrocytes transport nutrients between organs. Second, some categories are indirectly related via another category. For example, osteoclasts and lymph node development are related via the protein RANK. Third, when cells have closely related functions, we sometimes identify too specific or too generic cell types as in the case of the association between basophil degranulation and mast cell. Finally, An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e341.jpg out of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e342.jpg associations in our sample seem erroneous.

We were not able to compute precision or recall for our method due to the absence of a gold standard. However, we compared our method with the GO-CL crossproducts available from the OBO Foundry. The dataset contains manually verified relations between categories from the GO and the CL that have been extracted using pattern matching on category names [23]. As this method is based on the compositional nature of terms in the GO, it exclusively identifies relations in which one category name (usually a type of cell) is a substring of another category name (usually a GO category).

The GO-CL crossproduct contains 396 relations between GO and CL categories. From these 396, we identified 73 that co-occurred in our text corpus. Table 6 shows the percentage of significant co-occurrences within these 73 relations for different cutoffs in our six tests. Figure 2 shows the distribution of the 73 pairs with respect to An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e343.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e344.jpg.

Table 6
Evaluation of our approach with respect to the GO-CL dataset [23].

As our method relies exclusively on the distribution of terms and not on their syntactic structure, it permits the recognition of associations between categories that cannot be recognized using syntactic patterns. An example of such an association is myoepithelial cell (cells located in the mammary gland) and milk ejection.

Important potential applications for our tests arise from the fact that annotations of a large set of biomedical ontologies satisfy the conditions for our tests. Annotations satisfy the True Path Rule [3]: if two categories An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e355.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e356.jpg stand in the is-a or part-of relation, then any annotation of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e357.jpg is also an annotation of An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e358.jpg. Therefore, if gene annotations are used as graph decorations for the two input graphs of our method, the conditions for applying our tests are satisfied. For detecting associations between annotations, an appropriate score function must be chosen based on the hypothesis that is to be tested.

Another potential application of our tests lies in the field of relation extraction. The evaluation of our tests with the GO and CL reveals that we are able to detect biologically relevant associations between these ontologies. An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e359.jpg of the best An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e360.jpg associations retrieved by An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e361.jpg have biological meaning, as shown in Table 5. Although our approach is unable to detect the types of the biological relations, the associations provide a good starting point for an elaborate approach to the extraction of biological relations.

Our method is designed for the detection of associations between two DAGs. However, it can be generalized to test for associations between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e362.jpg graphs. The result of the tests would then be significant An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e363.jpg-ary associations between An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e364.jpg nodes from An external file that holds a picture, illustration, etc.
Object name is pone.0010996.e365.jpg graphs.

Conclusions

We developed a family of novel statistical tests for associations between two directed acyclic graphs. The tests account for the graphs' topologies and test for relevance and specificity of associations. The tests are suitable for the detection of associations between categories from two biomedical ontologies, in particular those which comply with the OBO criteria [24].

In an extensive use-case, we applied our tests to the discovery of associations between categories from the Gene Ontology and the Celltype Ontology that were decorated with the number of occurrences and co-occurrences of the categories' labels in a large corpus of full-text articles. Our results show that a large proportion of the associations discovered by our tests are biologically relevant relations.

The family of tests is implemented in a Java library, which is available as free software from our project webpage at http://bioonto.de/pmwiki.php/Main/ExtractingBiologicalRelations.

Supporting Information

Supplement S1

Statistical tests for associations between two directed acyclic graphs and their application to biomedical ontologies.

(0.14 MB PDF)

Acknowledgments

We would like to thank Leonardo Bubach, Hernán Burbano and Heinrich Herre for helpful discussions and valuable comments, and Christine Green for her help in preparing the manuscript.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: The study was funded by the Max Planck Society and the University of Leipzig. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Smith B, Ashburner M, Rosse C, Bard J, Bug W, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech. 2007;25:1251–1255. [PMC free article] [PubMed]
2. Golbreich C, Horrocks I. Golbreich C, Kalyanpur A, Parsia B, editors. The OBO to OWL mapping, GO to OWL 1.1! 2007. Proceedings of the OWLED 2007 Workshop on OWL: Experiences and Directions, Innsbruck, Austria, Jun 6–7. Aachen, Germany: CEUR-WS.org, volume 258 of CEUR Workshop Proceedings.
3. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet. 2000;25:25–29. [PMC free article] [PubMed]
4. Beissbarth T, Speed TP. Gostat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics. 2004;20:1464–1465. [PubMed]
5. Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, et al. Relations in biomedical ontologies. Genome Biol. 2005;6 [PMC free article] [PubMed]
6. Bard J, Rhee SY, Ashburner M. An ontology for cell types. Genome Biology. 2005;6:R21. [PMC free article] [PubMed]
7. Pesquita C, Faria D, Falco AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009;5:e1000443. [PMC free article] [PubMed]
8. Wu Z, Palmer MS. Pustejovsky J, editor. Verb semantics and lexical selection. 1994. pp. 133–138. Proceedings of the 32th Annual Meeting on Association for Computational Linguistics (ACL '94), June 27–30, 1994, New Mexico State University, Las Cruces, New Mexico, USA. Morgan-Kaufman Publishers, San Francisco, CA, USA.
9. Wu H, Su Z, Mao F, Olman V, Xu Y. Prediction of functional modules based on comparative genome analysis and gene ontology application. Nucleic Acids Res. 2005;33:2822–2837. [PMC free article] [PubMed]
10. Wu X, Zhu L, Guo J, Zhang DY, Lin K. Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations. Nucl Acids Res. 2006;34:2137–2150. [PMC free article] [PubMed]
11. del Pozo A, Pazos F, Valencia A. Defining functional distances over gene ontology. BMC bioinformatics. 2008;9:50+. [PMC free article] [PubMed]
12. Resnik P. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95) San Mateo, CA: Morgan Kaufmann; 1995. Using information content to evaluate semantic similarity in a taxonomy. p. 448–453. URL citeseer.ist.psu.edu/resnik95using.html.
13. Lin D. An information-theoretic definition of similarity. 1998. In: Proceedings of the Fifteenth International Conference on Machine learning (ICML-98). Madison, Wisconsin.
14. Couto FM, Silva MJ, Coutinho PM. CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management. New York, NY, USA: ACM; 2005. Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. pp. 343–344. doi: http://doi.acm.org/10.1145/1099554.1099658.
15. Othman RM, Deris S, Illias RM. A genetic similarity algorithm for searching the gene ontology terms and annotating anonymous protein sequences. J of Biomedical Informatics. 2008;41:65–81. [PubMed]
16. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of go terms. Bioinformatics. 2007;23:1274–1281. [PubMed]
17. Gruber TR. A translation approach to portable ontology specifications. Knowl Acquis. 1993;5:199–220.
18. Herre H, Heller B, Burek P, Hoehndorf R, Loebe F, et al. General Formal Ontology (GFO) – A foundational ontology integrating objects and processes [Version 1.0]. 2006. Onto-Med Report 8, Research Group Ontologies in Medicine, Institute of Medical Informatics, Statistics and Epidemiology, University of Leipzig, Leipzig, Germany.
19. Doms A, Schroeder M. GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005;33:783–786. [PMC free article] [PubMed]
20. Gaudan S, Yepes AJ, Lee V, Rebholz-Schuhmann D. Combining evidence, specificity, and proximity towards the normalization of gene ontology terms in text. EURASIP Journal on Bioinformatics and Systems Biology. 2008;2008:9. [PMC free article] [PubMed]
21. Manning CD, Schütze H. Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: The MIT Press; 1999.
22. Pantel P, Lin D. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Special Interest Group on Knowledge Discovery in Data. New York, NY, USA: ACM Press; 2002. Discovering word senses from text. pp. 613–619. ISBN:1-58113-567-X.
23. Bada M, Hunter L. Enrichment of obo ontologies. Journal of Biomedical Informatics. 2007;40:300–315. [PMC free article] [PubMed]
24. Smith B, Ashburner M, Rosse C, Bard J, Bug W, et al. The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech. 2007;25:1251–1255. [PMC free article] [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science