Genome annotation relies heavily on bioinformatics methods. The identification of homologous relationships is a powerful and frequently used approach for protein-level annotation [1
], where query protein sequences are compared to sequences of characterized proteins in order to find homologies. Based on this comparison, proteins of unknown function are assigned to characterized protein families, generating testable hypotheses of their molecular function. However, this established annotation approach has several limitations. Devos and Valencia [2
] suggest that up to 30% of the function annotations made through sequence similarity searches might be erroneous. Obviously, there is no simple relationship between sequence similarity and function, but some general trends have been observed. The same authors showed that the Enzyme Classification (EC) [4
] number tends to be completely conserved only for proteins with more than 80% sequence identity. They found that it is problematic to assign EC numbers based on a sequence alignment with less than 30% identity.
Complementary to sequence similarity searches, more direct approaches for the functional characterization of gene products have been proposed. In particular, genomic context methods predict which gene products are involved in common biological processes [5
]. Other methods use different protein features or structural information to predict the function of a gene product [7
The Gene Ontology Consortium provides a structured standard vocabulary for describing the function of gene products [10
]. The Gene Ontology (GO) is divided into three orthogonal ontologies, biological process
, molecular function
, and cellular component
. The three ontologies are represented as directed acyclic graphs (DAG) in which nodes correspond to terms and their relationships are represented by edges. Each node can have several parents and several children. There are two types of relationships. "is-a" indicates that the child is a subclass of the parent, and "part-of" is used when the child is a component of the parent. GO terms are widely used to annotate genes and their products with functional terms [11
New methods can exploit these GO annotations in order to compare gene products on the basis of their function. There are some issues which one has to take into account when GO annotations are compared. One problem is that the depth of a term in the GO graph is not representative of the specificity of the underlying concept. Different terms on the same rank (same depth in the GO graph) usually are not equally specific. In addition, GO is an ongoing project in which new terms are added continuously but many specific functional terms may still be missing. The manual mapping of GO terms to genes is based on results available in the scientific literature or in public databases, but relies on human decision and therefore is considerably subjective [12
]. In addition, a large part of gene products is not yet annotated with GO terms. These problems have to be considered when designing robust measures to assess the similarity of two GO terms.
Semantic similarity measures have been proposed for comparing concepts within an ontology. Resnik [13
] developed a measure of semantic similarity for "is-a" ontologies based on the information content of the lowest common ancestor (LCA) of two terms. The more frequently a term occurs, i.e., the higher its probability of occurring, the lower its information content. If the LCA of two terms describes a generic concept, these terms are not very similar and this is reflected in a low information content of their LCA. This measure considers how specific the LCA of the two terms is but disregards how far away the two terms are from their LCA. Lin [15
] developed a related measure that depends on the information content of the LCA and of the two terms that are compared. This measure assesses how close the terms are to their LCA. It does not refect the level of detail of the lowest common ancestor, though.
Protein sequences annotated with GO terms can be compared on the basis of such semantic similarity measures. Lord et al
] were the first to apply a measure of semantic similarity to GO annotations. They implemented GOGraph, a tool for calculating the semantic similarity of protein pairs based on Resnik's measure. The semantic similarity between two proteins is defined as the average similarity of all GO terms with which these proteins are annotated. Each protein pair receives three similarity values, one for each ontology. Cao et al
] integrated a semantic similarity search into the Bio-Data Warehouse. They use also Resnik's measure to define the similarity between two single GO terms. Speer et al
] employed a distance measure based on Lin's similarity for clustering genes on a microarray according to their function. Khatri and Draghici reviewed tools for ontological analysis of gene expression data [19
]. Friedberg and Godzik [20
] used the molecular function annotation of protein structures in the Protein Data Bank (PDB) [21
] to perform a functional comparison of different folds. They define a GO-based fold similarity as the normalized average Resnik term similarity of two folds. Lee and Lee [22
] applied Resnik's semantic similarity measure to MIPS [23
] and GO annotations in order to infer modularized gene networks. They divide the GO annotations into three sets, set 1 contains all GO terms annotated to both genes, set 2 and set 3 contain the GO terms annotated to only one of them. Then the maximum similarity between any terms from set 2 and terms from set 3 is calculated (max2, 3
). Finally, the annotation information score is the sum of all self-similarities of terms in set 1 plus max2, 3
. Shalgi et al
. utilized Lord's definition for a subcellular clustering score based on the cellular component ontology. They calculate the similarity of two genes as the maximum similarity of GO terms annotated to one of the genes. Björklund et al
] developed a domain distance score for assessing the similarity of two domain architectures. They showed that the domain distance correlates well with Lord's approach to semantic similarity of proteins. Sevilla et al
] analyzed the correlation between gene expression and Resnik's and Lin's measures of semantic similarity. They concluded that Resnik's measure correlates well with gene expression.
Gene products are functionally similar if they have comparable molecular functions and are involved in similar biological processes. These gene products did not necessarily evolve from a common ancestor and therefore do not necessarily show sequence similarity. GO annotations capture the available functional information of a gene product and can be used as a basis for defining a measure of functional similarity between gene products. In this paper, we introduce a new measure of similarity between GO terms that is based on Lin's and Resnik's definitions. The measure simRel takes into account how close terms are to their LCA as well as how detailed the LCA is, i.e., distinguishes between generic and specific terms. This simRel score is the basis for a new measure, called funSim, for assessing the functional relationship between two gene products. funSim extends the measure of similarity to the comparison of two functional annotations, each composed of sets of GO terms from different ontologies. The funSim score allows for identifying functionally related gene products from different species that have no significant sequence similarity. The measure also allows for partial matches, resulting in a more robust similarity score for the comparison of gene products with incomplete annotation or for the comparison of multi-functional proteins. We used simRel to identify all biological processes from fungi that do not appear in mammals. Furthermore, simRel was used to find molecular functions from Mycobacteria that do not appear in mammals. We compared the funSim score to established sequence similarity approaches. The method was also applied to find the proteins from human that are functionally related to yeast proteins. We compared the yeast proteins with each other using funSim, and obtained a functional map using multidimensional scaling. We also applied funSim to the functional comparison of all Pfam families and generated a functional map of the protein families.