Current genome sequencing projects are producing a wealth of data in the form of sequences of biological polymers. For this data to be useful, it has to be interpreted in functional terms. Thus, efficient systems to describe and classify protein function are needed, as well as tools to predict the function of the huge number of new sequences.
There is much evidence for the need of well-defined and structured functional descriptions [1
]. However, the main difficulty encountered is that 'function' is not a well defined concept and it is not as un-equivocal as 'sequence' or 'structure'. Indeed, protein function is a very complex and multidimensional phenomenon.
In many cases, functional descriptors are based on the available experimental techniques or are due to historical reasons. However, they do not necessarily have any meaning in biological terms (evolution, molecular mechanism). The methods we use to study biological systems require conceptualization and categorization, which are sometimes taken beyond their role as mere tools of the scientific method and are 'imposed' on the cell. One example is the artificial distinction between processes such as 'transmission of information' (for example DNA/RNA processing), 'metabolism' (of small compounds) and 'transport' (communication with the environment). Such disjointed classifications, as used in the first schemes to describe protein function, clearly do not extend to the molecular or evolutionary level. These schemes have been used in the past for classifying proteins into functional classes and for developing systems to assign newly sequenced proteins to them [5
The current tendency is to use vocabularies and ontologies that allow complex functional descriptions beyond disjointed classes. Among these, the important effort of the Open Biomedical Ontologies (OBO) [7
] in developing controlled vocabularies for a wide scope of applications in a biological and medical context must be recognised. The OBO ontologies are designed as graphic architectures formed by univocal concepts (terms) that are linked together by relationships that satisfy some prefixed and formal rules [8
]. The Gene Ontology (GO) project [9
] has become the 'de-facto' standard in biomedical ontologies. Formally, GO is designed as a Direct Acyclic Graph (DAG) based on two unconstrained relationships ('is-a' and 'part-of') that link a vocabulary of functional terms [2
]. This graph structure, together with the simple conceptualization, permits comparisons between any two GO terms to assess their functional similarity. However, certain problems, such as the function-based search for potential genes/proteins of interest across multiple annotated databases and the analysis of high throughput microarray data, have led to the in depth exploration of ontology in order to propose models and criteria to measure the functional relationships between the terms.
In recent years, many studies have addressed this matter [10
], although Lord was the first to establish a semantic distance for any two terms in GO [10
], adjusting the ideas of Resnik [15
] for general taxonomies. In the model proposed by Lord, the similarity of any two GO terms is determined as a function of the information content of common ancestors that are calculated from corpus statistics. Recently, further efforts to identify functionally related gene products in annotated databases based on the distances calculated by Lord [11
] have been shown to produce a good agreement with homology searches [12
]. Nevertheless, using the more informative common ancestors as a proximity reference presents some restrictions. First, the depth of the shared parent nodes is not a suitable criteria for some limited cases in which the terms to be compared are close to the root. Furthermore, the information content (i.e. probability) of a node is highly dependent on the annotated database selected and its release version.
Models have been developed to overcome these limitations that take into account other aspects of the ontology structure. For example, the distance between two terms may also integrate the density of the terms and the path that links them [13
]. Alternatively, a new definition has been used that considers the local relationships in the subgraph generated by the terms, rather than their global positions in the DAG [14
A common feature of these different approaches is that they rely mainly on the semantic links of the DAG. Unfortunately, there are inherent problems in this approach due to the non-homogeneity and the uneven distribution of the biological knowledge. As a result some regions of the DAG are more densely populated than others, so that the connections between terms are not comparable. In addition, the depth of a node (which is related to its specificity) can not be assigned in an unequivocal way. This type of problem is especially relevant for nodes that are profusely connected to the root by various paths of different lengths.
In this work, we propose a novel method that associates the Molecular Function GO (MF-GO) terms based on their co-occurrences in a 'curated' set of proteins and enriched by the semantic relationships from the ontology. Interpro is used as a curated database as it integrates protein information from other databases that describe protein families, domains and functional sites, such as PROSITE, PRINTS, Pfam, ProDom, SMART and TIGRFAMs [16
Conceptually, the method is, to some extend, similar to the way in which similarities between aminoacids are 'learnt' from examples (structural curated alignments) rather than obtained from the raw chemical properties of the aminoacids. Methodologically, it shares aspects of the algorithm used in the DAVID tool [17
] for clustering heterogeneous annotation contents from different resources into annotation groups based on the co-association of the annotated genes in the databases.
The method analyses the mutual occurrences of the MF-GO terms across the Interpro entries. The occurrences are used as the basis of the comparison of the terms on the assumption that the persistent coincidence of two terms describes its 'relation' in the general functional space. The analysis of the occurrences provides a useful mathematical tool to quantify the functional similarity between terms. A hierarchical tree linking the MF-GO terms is built from the similarity matrix. We termed this tree the 'Functional Tree' and it formally constitutes a Distance Model since it satisfies the ultrametric triangle inequality. In this context, the Functional Distance for a pair of terms, Df, is defined as the height of their least common ancestor in the 'Functional Tree'. In addition, the tree allows the GO terms to be clustered into compact and homogeneous groups with biological meaning.
We describe here how the Functional Tree was built, how the tree is clustered and the groups generated are analyzed in terms of the functions they describe. The Functional Distance Df derived from the Functional Tree was used to calculate the distances between pairs of yeast proteins to assess the reliability of the tree. We also compare this new metric with another based on semantic similarities.