We set out to compare and contrast the encoded protein complements by identifying both orthologous proteins (7
), and shared and novel protein domains in yeast and worm. Distinguishing orthologs, which have evolved by vertical descent from a common ancestor and are presumed to carry out the same function (8
), from paralogs, which arise by duplication and domain shuffling within a genome and hence may have divergent functions, is paramount when carrying out whole genome comparisons (9
). Failure to do so can result in functional misclassification (10
) and inaccurate molecular evolutionary reconstructions (11
). In this part of our analysis, we did not attempt to detect distant homologs, which may be found by using less stringent criteria and more sensitive techniques (12
We compared the predicted proteins of yeast and worm by first carrying out reciprocal WU-BLASTP (13
) comparisons (that is, each predicted yeast protein against all the predicted proteins of the worm and vice versa). In every case in which a high-scoring pair (HSP) was detected, we collected all members of a group from both organisms by using several BLAST P
-values as thresholds, as described in . The ORFs within each group were then ordered by similarity clustering with the CLUSTALW program (14
) and displayed as multiple sequence alignments, rooted cluster dendrograms, and unrooted trees. Each of these displays for every comparison can be found on our Web site. The numbers of worm and yeast ORFs that fall into these clusters at various similarity thresholds are given in . graphically depicts the distribution of the sequences from the worm-yeast clusters within functional categories. The first significant (and some-what unexpected) observation is that the absolute number of ORFs for which we find worm and yeast homologs is about the same in each organism. At the highest level of similarity (P
), approximately equal numbers of yeast and worm ORFs are present. This trend generally holds even at the lowest threshold we studied (P
), where there are 2497 yeast ORFs (40% of total yeast ORFs) and 3653 worm ORFs (19% of total worm ORFs).
Fig. 1 Distribution of core biological functions conserved in both yeast and worm. Yeast and worm protein sequences were clustered into closely related groups (BLASTP P < 1 × 10−50, with the >80% aligned length constraint) as (more ...)
These observations suggest that the core biological processes of the two organisms are carried out by a similar number of proteins. It further suggests that the very large difference in the total number of different proteins encoded by the two organisms (~3.1-fold higher in worm) is not accounted for by endless close variations in the clusters found among the shared set, but instead are proteins that are substantially different in sequence [compare with (15
)] and thus are likely to perform tasks that are specific to each organism. A subset of such organism-specific proteins, those associated with regulation and signal transduction, were investigated and found to support this idea [(16
); and see below].
If many core biological processes of worm and yeast are indeed carried out by a comparable number of closely related proteins, then it might not be necessary to study the proteins (or the processes) in detail in both organisms. Instead, the annotation for the proteins involved in shared core biology (annotation that exists almost exclusively for yeast) might be transferable to the worm, provided that the orthologs between the two species are easily recognizable by sequence analysis alone. Functional conservation of proteins from different species was first demonstrated experimentally by showing that the mammalian RAS protein can substitute for yeast RAS in a RAS-deficient yeast strain (17
). The worm RAS homolog let-60
is involved in a variety of signaling processes (18
) and is homologous to two yeast RAS genes (RAS1
), as described for many families below (see also the Web site). Although upstream regulators and downstream effectors of RAS may have diverged in the two organisms, it is likely that these orthologs may have a core biochemical function that is conserved, a prediction that can be easily tested in genetically tractable model organisms (19
). In another example, yeast CDC28
and worm ncc-1
form an orthologous pair in the cyclin-dependent kinase family and have already been shown experimentally to be functionally interchangeable. When expressed in yeast, the protein encoded by ncc-1
complements the G2
/M arrest of a cdc28
temperature-sensitive mutation, illustrating functional conservation in vivo (20
shows that at each level of significance roughly half (611 of 1171 at P < 10−10) of all the sequence similarity groups found by our reciprocal BLASTP procedure contain exactly two members. Because ascertainment of each group began with a yeast-worm HSP, these groups contain one worm and one yeast member. The availability of complete sequences for both worm and yeast makes it unlikely that we are missing large numbers of potential orthologs. It remains possible that the conservative similarity cutoffs used leave fast-evolving orthologs to be identified by more detailed analysis. Thus, most of the proteins contained in these 611 groups will turn out to be authentic orthologs, like the CDC28/ncc-1 pair cited above.
Examination of the CLUSTALW output provides a comparably strong indication of many orthologous relationships within the remaining groups (560 of 1171 at P
) that contain three or more members. From several hundred such families, six examples are illustrated in a rooted tree display (21
) (). The first example () illustrates the two clusters of DNA-dependent RNA polymerases. In every case, the yeast and worm proteins form unambiguous pairs. In this instance, most of the cases for pairing are conclusive, because the RNA polymerase I and II subunits were independently identified in yeast and worm (23
). In addition, the cluster [here done at P
)] contains the yeast polymerase III subunit paired with its presumed ortholog in the worm.
Fig. 2 Orthologous core biological functions in yeast and worm. Representative sequence groups are shown as rooted CLUSTALW Neighbor-Joining trees, clustered as described in the legend to , at a similarity level indicated after each description. Gene (more ...)
The second example () shows the cluster of DNA replication factor C subunits, which act as processivity factors for DNA polymerases δ and ε and load proliferating cell nuclear antigen (PCNA) onto DNA (25
). This cluster has 12 members, and the pairing is entirely consistent with the idea that each member of each pair is orthologous to the other. The third example () shows a similar clustering of proteasome subunits (26
). In this case there are 25 members of the cluster, which form 10 clear pairs, with three yeast and two worm sequences apparently unpaired. However, it seems probable that there is an additional orthology: yeast PRE2
with the minimally diverged (recently duplicated?) worm sequences K05C4.1 and Y105E8A.jj. Accepting this, the 25 sequences yield 11 pairs.
The worm has 17 tubulin genes, compared to just 4 in yeast (). Because the worm expresses specific tubulins for specific functions, a skewed worm:yeast ratio is to be expected. For instance, worm tba-1
α-tubulin is selectively expressed in a set of mechanosensory and ventral-cord motor neurons during development. Conversely, yeast express almost twice as many hexose transporters as worm, indicating the importance of sugar transport to S. cerevisiae
). Both worm and yeast encode just one γ-tubulin, implying that whereas other tubulins may have become more specialized, γ-tubulin still functions only in a common core process.
The comparisons for actin and actin-like proteins give a quite different result (). Although there are more classical actins in the worm than in yeast, several of the actin-related proteins (ARP
genes) of yeast have what appear to be orthologs in the worm. Like γ-tubulin, they appear to carry out a core process shared by the two organisms. The true actins of the worm function in both muscular contraction and as cytoskeletal elements, so that the duplication and divergence of specialized actins was to be expected. Somewhat surprisingly, there is a yeast actin-related protein with no obvious counterpart in the worm, ARP1
), which encodes a nuclear protein related to dynactin and centractin. This lack of orthology may be explained by the relatively unusual chromosome mechanics of C. elegans
, whose chromosomes are holocentric and thus lack defined centromeres (29
In the large cluster of HSP70 heat shock proteins (), five subclusters can be recognized, each containing worm and yeast genes. The subclusters appear to reflect different localization or substrate specificities in yeast. One encodes yeast cytoplasmic HSP70 proteins (SSA
genes); another encodes mitochondrial proteins (SSC1
). A third encodes yeast cytoplasmic proteins that act on nascent peptides and associate with translating ribosomes (SSB
). Notably, the fourth group encodes genes that act as chaperones in the endoplasmic reticulum; Kar2p in yeast (31
) and hsp-3 and hsp-4 in worm (32
) have independently been characterized to have this function.
The nuclear-encoded mitochondrial proteins of worm and yeast provide a compelling example of orthologous pairs but also a remarkable case of the worm apparently missing orthologs for a set of important yeast proteins. Comparisons were performed with PSI-BLAST (33
) and validated by demonstrating sequence similarity to Escherichia coli
or Methanoccus jannaschii
protein sequences. A total of 108 mitochondrial proteins from yeast have highly conserved homologs in worm (P
-value scores <10−39
). These orthologous pairs can be assigned to diverse mitochondrial functions such as the TCA (tricarboxylic acid) cycle, electron transport, lipid metabolism, amino acid biosynthesis, intermediary metabolism, membrane transport, protein processing, RNA metabolism, and protein synthesis. Surprisingly, worm orthologs were identified for only 10 of the approximately 40 unique yeast mitochondrial ribosomal proteins (34
). It seems possible that given the small size of mitochondrial ribosomal RNAs in the nematodes (35
), the C. elegans
mitochondrial ribosomes could contain a small number of proteins. However 10 proteins are unlikely to make a functional ribosome. It therefore remains to be determined whether more ribosomal protein genes are encoded in the worm genome but are missing in the currently defined gene complement, or if some have been displaced in the nematode mitochondrial ribosome by cytoplasmic ribosomal proteins.
Taken together, these observations show that for a substantial fraction of the yeast and worm genes, unequivocal, one-to-one orthologous relationships are readily identifiable. The simplest explanation for these results is that the proteins in this data set carry out core biological processes required by each organism. To test this idea, a functional classification for each of the proteins in this set was abstracted, mainly from the SGD (most of the yeast proteins in this set have some functional annotation) but also from the Web version of ACeDB (www.sanger.ac.uk/Projects/C_elegans
). When this was done for the set of proteins at the level of P
, 91% of the proteins could be classified. Of these, 79% could be assigned to rubrics fitting the description of core biological processes (). A more detailed scrutiny of orthologs in different functional categories indicates, however, that certain central metabolic pathways (for example, those for the biosynthesis of several amino acids) that are present in yeast appear to be missing in the worm. This reflects the different nutritional requirements of the two organisms. Many of these functional designations are particularly reliable because they originate from experiments carried out directly with yeast.
Possibly the most important opportunity to emerge from these results is that annotation of protein functions and activities will be reliably transferable between organisms as disparate as yeast and worm by sequence analysis. With well-annotated genomes, the identification of orthologous pairs becomes a powerful analytical approach. Whereas biochemical and biological experiments must be done to unequivocally prove the functions of proteins, the wealth of data from sequence analyses allows researchers to better design experiments and avoid duplication of work done in other systems.