Assignment of function to gene products in the absence of direct experimental information is an important challenge of computational molecular biology [1
]. In annotating proteins from newly-sequenced genomes, it is a common practice to transfer functional annotation from a homologous protein [4
]. This approach depends on the assumptions that: (1) because homologous proteins have similar sequences and structures, they have similar functions, and (2) the annotation of the source homologue is correct. Often, but certainly not always, these assumptions are valid.
In this study we quantitatively assess the relationship between the divergence of protein function and the divergence of amino acid sequence in families of homologous proteins. In addition to illuminating the process by which proteins evolve altered and novel functions, the results provide guidance about the expected accuracy of transfer of functional annotation among homologous proteins in databases.
The most general evidence for protein homology, and inference of shared function, depends on comparative analysis of sequences and structures. PSI-BLAST [9
] and Hidden Markov Models [10
] identify distant homologs from multiple sequence alignments. Other techniques include the training of support vector machines [11
] and neural networks [12
] on protein features such as charge distribution and hydrophobicity to predict protein function. Structure comparisons improve the accuracy of inference of function in the absence of direct experimental evidence. These include the use of information from domains [13
] and motifs [14
]. Fleming et al
] combined structural and sequence alignments of proteins in an annotation tool named PHUNCTIONER.
Despite the sensitivity of these tools for detecting homologs and predicting function, many authors have pointed out that because closely-related proteins can change function, either through divergence to a related function or by recruitment for a very different function, annotations based only on homology can be incorrect [18
Two problems that have arisen in studying the evolution of protein function and evaluating the expected accuracy of functional annotation transfer have been (1) standardization of terminology in describing function, and (2) defining a measure of the "distance" between functions. The Enzyme Commission classification has been very valuable but deals with only one class of protein functions [29
]. In 2000, The Gene Ontology (GO) Consortium formulated a newer and more general classification of protein functions and the relationships among them [30
]. Unlike the EC classification, which was a strict hierarchy, the GO scheme has the form of Directed Acyclic Graphs (DAGs), specialized to three domains: Molecular Function, Biological Process, and Cellular Component.
Enzyme Commission identifiers form a strict four-level hierarchy, or tree. For example, isopentenyl-diphosphate Δ-isomerase is assigned EC number 184.108.40.206, where the initial 5 specifies the most general category, 5 = isomerases; 5.3 comprises intramolecular isomerases; 5.3.3 those enzymes that transpose C = C bonds; and the full identifier 220.127.116.11 specifies a particular reaction. Note that the EC classified reactions, not enzymes. To compare functional assignments of two proteins according to the EC classification, it is conventional to ask at how many levels of the hierarchy the EC numbers agree.
In contrast, the GO classification is not a tree, but a more general type of graph. Each node is labeled by a general or specific protein function. Edges in the graph correspond to relationships between more general and more specific functions, that is, child-parent relationships. For example, the node "protein binding" is a child of the node containing the more general function "binding". The number of levels – the length of the path from any leaf to the root – is not constant. The structure of the GO DAG induces a measure of distances between functions, which will be used to quantify sequence-function relationships in proteins (see Materials and Methods).
GO assigns the identifier 0004452 to isopentenyl-diphosphate Δ-isomerase. (The numbers themselves have no specific significance.) Figure shows a minimal-length path from GO:0004452 to the root node of the molecular function DAG, GO:0003674. In this case there are four intervening nodes, progressively more general categories as we move up the figure. Note that the GO description of this enzyme as an oxidoreductase is inconsistent with the EC classification, in which a committed choice between oxidoreductase and isomerase must be made at the highest level of the EC hierarchy.
The minimal-length path from GO:0004452 to the root node in the molecular function ontology.
Our current work treats the Molecular Function component of the GO classification. The GO Molecular Function graph forms a network that has characteristics in common with other biological networks. In the Gene Ontology DAGs, the average in-degree is 1.36 (that is, on average a node or GO ID had 1.36 parents.) The in-degree distribution is intermediate between an exponential and a power function. There is a wide range in out-degree, ranging from 1 to 298. Three nodes had very high out-degree with 122, 238 and 298 children. The out-degree distribution followed a power law, showing that there are hubs, or highly connected nodes. The total degree (in-degree + out-degree) distribution for the Molecular Function ontology has a mean of 2.69, and follows a power law.
1.1 Assignment of functions to proteins
Neither the Enzyme Commission nor the GO classifications of protein function constitutes an assignment of function to any particular protein. Both provide only a framework for making such assignments. The PIR database at Georgetown University [31
] associates Gene Ontology Identifiers (GO IDs) with individual proteins. The annotation of each protein may include several GO IDs. Indeed, annotation with any function logically implies annotation with all more-inclusive functions, all the way up to the root of the graph. (Note, however, that annotations of proteins by GO terms in databases do not always explicitly contain all the ancestors of every function that appears.) Therefore for each protein we extracted the distal
(= most precise) GO IDs to represent the function of the protein (see Materials and Methods).
1.2 The relationship between sequence divergence and function divergence
Many proteins with similar sequences have similar functions; for example, mammalian hemoglobins transport oxygen and carbon dioxide. For mammalian hemoglobins, transfer of annotation among homologs gives correct results. However, other families of homologs contain proteins with different functions. For example, hen egg white lysozyme and baboon α-lactalbumin have 37% identical residues in optimal sequence alignment, and retain very similar mainchain structures, but have unrelated functions. Contrasting mammalian hemoglobins with lysozyme/α-lactalbumin, there is a general correlation between divergence of sequence and divergence of function. That is, mammalian hemoglobins have similar sequences and similar functions; lysozyme and α-lactalbumin have more distantly related sequences and dissimilar functions.
However, there are many exceptions to this correlation. In the duck, eye lens crystallins are identical in sequence to liver enolase and lactate dehydrogenase [32
]. This is an example of "recruitment" – unrelated function with little or even no sequence change. This threatens to produce incomplete or even erroneous annotations, if annotation is passed freely among homologs. Conversely, some proteins very distantly related in sequence nevertheless retain similar function.
Several groups have studied the relationship between sequence similarity and functional similarity based on the Enzyme Commission classification. Those studies were necessarily limited to proteins with enzymatic functions:
In studying the relationship between sequences and EC classifications of proteins, Wilson, Kreychman & Gerstein [33
], Todd, Orengo & Thornton [34
], and Devos & Valencia [19
] reached similar (although not identical) optimistic conclusions. Wilson, Kreychman & Gerstein [33
] concluded that for pairs of single-domain proteins, at levels of sequence identity > 40%, precise function is conserved, and for levels of sequence identity > 25%, broad functional class is conserved (according to a functional classification that uses the EC hierarchy for enzymes, and supplements it with material from FLYBASE [35
] for non-enzymes.) The study of Todd, Orengo & Thornton [34
] analyzed only the homologous pairs of enzymes and reported that approximately 90% of pairs of proteins with sequence identity > 40% conserve all four EC numbers. Even at 30% sequence identity, Todd, Orengo & Thornton found conservation of three levels of the EC hierarchy for 70% of homologous pairs of enzymes. Devos & Valencia [19
] reached very similar conclusions; they also reported the ability to predict correctly the agreement of FSSP categories [36
] and SWISS-PROT [37
] keywords, as a function of the level of sequence similarity.
Our work pursues the question of the relationship between divergence of sequence and function in homologous proteins, using the Molecular Function DAG of Gene Ontology for the classification of function. Use of the GO classification allows extension of the earlier work to proteins with non-enzymatic functions, permitting a comprehensive study of functions of proteins.
The steps of our analyses were as follows: For each pair of homologous proteins from a PFAM family, we recorded the % identical residues in the optimal alignment as a measure of sequence divergence, and we measured the functional distance between the sets of distal GO IDs associated with the two proteins. We based our definition of the distance between sets of annotations on a generalization of the simple minimum-path-length measure of the distance between two single GO ID's (see Materials and Methods).
From these data, we mapped the relationship between sequence divergence and function divergence. We distinguished divergence of functions within the same "branch" of the DAG (those for which the lowest common ancestor of two nodes was not the root node) and those in different "branches" of the DAG. We call these similar and dissimilar functions, respectively (Figure ). We observed that, despite counterexamples of recruitment, there is a general correlation between divergence of sequence and appearance of dissimilar functions within each family. This relationship is made precise by our calculations. Our results also show that there is some variation among different PFAM families, especially for more highly-diverged sequences.
Figure 2 Distinction between similar and dissimilar function. We regard hydrolase activity, acting on ester bonds and oleoyl-[acyl-carrier protein] hydrolase activity, as similar functions, because their lowest common ancestor, hydrolase activity, acting on ester (more ...)