Phyletic patterns and genome-wide demography of protein domains involved in RNA metabolism
We delineated domains involved in RNA metabolism as described above and conducted a survey of their demography across the genomes of representative organisms considered in this study. This overall demographic survey revealed a number of general trends in the evolution of these domains (Fig. A–D). The most notable, if not unexpected, feature was the separation of the three primary kingdoms by specific phyletic patterns of many domains. A large set of enzymes and interaction domains are present universally and, in all likelihood, are part of the LUCA inheritance (Fig. A and C). However, another substantial fraction of the domains involved in RNA metabolism appear to have evolved in a particular superkingdom or lineage, with the greatest number of lineage-specific inventions found in eukaryotes (Fig. D). Many eukaryote-specific domains belong to ancient folds, but acquired their RNA-related function only in the eukaryotic lineage. Examples of such exaptation of ancient domains for functions in RNA metabolism include the mRNA-capping enzyme that was derived, at the onset of eukaryotic evolution, from the more ancient DNA ligases (
37–
39), and the lariat-debranching enzyme that was derived from the ubiquitous calcineurin-like phosphoesterases (
40,
41). Similarly, superfamily (SF)-I helicases were recruited for important RNA-related functions, such as nonsense-mediated decay, only in eukaryotes, although several such helicases function in bacterial DNA recombination and repair. Some eukaryote-specific enzymes, such as the RNA-dependent RNA polymerase involved in PTGS and the Kem1/Rat1 family of 5′→3′ nucleases, have large, complex catalytic domains that so far could not be traced to any ancient enzymatic fold. Although structural innovation is less common in prokaryotes than it is in eukaryotes, there are a few enzymes, for example, the RNase domains of the RNaseE/G superfamily, that appear to be innovations of the bacterial lineage.
The interaction domains also show a strong trend of eukaryote-specific innovation, the most prominent one being the RNA recognition motif (RRM), which apparently was derived from a more ancient nucleic acid-binding fold with a characteristic four-stranded core found in diverse DNA- and RNA-binding domains (Table ). Another theme seen in eukaryotes is the recruitment of α-helical superstructures, such as the TPR-like fold (the HAT repeat module found in RNA processing proteins), the pumilio (PUM) repeat (
42,
43), and the NIC domains (
16) for functions in RNA metabolism. This parallels the widespread utilization of these α-helical repeat modules in a number of other contexts in eukaryotes. Many of the distinct, small RBDs that evolved in eukaryotes, such as CCCH, Zn knuckle, C2H2-, LRP1- and C4-Little fingers utilize the common theme of stabilization through metal chelating cysteines and histidines (Fig. D). This type of structure is ancient, with numerous Zn-ribbon modules found in archaea (
44), but many of these metal- and RNA-binding domains seem to have evolved
de novo in eukaryotes, given that utilization of metal coordination to stabilize the core of a domain requires relatively few evolutionary changes, namely the emergence of a strategically placed set of metal-chelating residues.
Another major pattern in the phyletic distribution is the presence of numerous catalytic and interaction domains that are shared by eukaryotes and bacteria, to the exclusion of archaea (Fig. A–D). Another distinct set of domains is solely shared by archaea and eukaryotes, which supports the chimeric origin of the eukaryotic systems of RNA metabolism. A subset of proteins containing domains shared by eukaryotes and bacteria function in the mitochondria and chloroplasts that have descended from endosymbiotic bacteria. This is reflected in the larger average number of proteins with such a phyletic pattern in plants that have two distinct endosymbiont organelles, mitochondria and chloroplasts. However, several domains with a bacterio-eukaryotic distribution pattern function in non-organellar contexts, such as cytoplasmic RNA degradation. Enzymes of apparent bacterial origin recruited for cytoplasmic functions include several superfamilies of RNases, such as the 3′→5′ exonucleases (
45). Of the domains with an archaeo-eukaryotic phyletic pattern, several are involved in core processes, such as RNA maturation, e.g. the tRNA endonucleases, and translation, e.g. PIWI (
14), pelota and SUI1 domains (
9).
Most of the domains involved in ancient functions, such as RNA modification enzymes and RBDs associated with RNA modification, translation and transcription (Table and Fig. ), are present in nearly constant numbers in all life forms, except that eukaryotes often have more paralogs, partly owing to the presence of organelles derived from bacteria. Eukaryotes show a striking expansion of ancient SFII RNA helicases and, to a lesser extent, of other ancient catalytic domains, such as SFI helicases, GTPases, Rossmann-fold methylases, 3′→5′ exonucleases, RNase III and deaminases. A corresponding expansion of non-catalytic domains is mainly restricted to those newly invented or recruited in eukaryotes, including RRM, CCCH, Zn-Knuckle and G-patch. The advent of these RBDs correlates with the emergence of eukaryote-specific functional systems, such as pre-mRNA splicing, PTGR, and mRNA editing and modification (Fig. ).
These observations indicate that 40–45 of the approximately 100 principal domains associated with RNA metabolism originated at early stages of evolution, prior to LUCA. These domains were associated with the most ancient and conserved cellular functions, such as translation, transcription and some forms of RNA modification. The next phase of innovation marked the separation of the bacterial and archaeo-eukaryotic lineages and saw the origin of some proteins, which are involved in basic cellular functions, but are specific to one of these lineages. Finally, with the emergence of the chimeric eukaryotic lineage, domains from both the bacterial and the archaeo-eukaryotic precursor were incorporated into the eukaryotic RNA metabolism pathways. In addition, eukaryotes also ‘invented’ several new domains and recruited or expanded preexisting ones, concomitant with the origin of new RNA processing systems that were largely absent in prokaryotes. No archaea-specific domains involved in RNA metabolism were identified. This might reflect the retention of most core archaeal systems in eukaryotes, which makes the corresponding domains archaeo-eukaryotic in distribution. In addition, archaea could possess some distinct domains that were not detectable through homology and remain unknown due to the paucity of experimental studies in archaeal systems.
The surveyed organisms dedicate, approximately, between 3 and 11% of their proteomes to RNA metabolism, with the highest fraction, predictably, seen in parasitic bacteria with small genomes and the lowest fraction in multicellular eukaryotes and complex bacteria. Generally, this seems to reflect (i) the central place that RNA metabolism systems occupy in all cells, compared with the substantially more variable systems of transcription, replication or DNA repair, and (ii) a more or less linear growth of the number of proteins involved in RNA metabolism with the increase of the total number of encoded proteins in free-living organisms. Below we discuss in detail specific trends in evolution of catalytic and interaction domains involved in RNA metabolism.
Evolutionary histories of catalytic domains involved in RNA metabolism
RNA modifying enzymes. Cellular RNAs are subject to a number of post-transcriptional modifications that involve modification of the bases and sugars or synthesis of non-canonical bases or nucleotides (
46–
48). The direct nucleotide modifications include methylation of bases and sugars on N, C or O atoms, deamination and demethylation, whereas formation of non-canonical bases includes thiouridylation, pseudouridylation, thioadenylation, dihydrouridylation, and synthesis of archaeosine and queuine.
Methylases. The most common among RNA modifications are the numerous methylations of all types of RNA molecules (
46). The RNA methylases come in two major classes (Table ): (i) the Rossmann-fold methylases, which include the majority of N-, C- and O-methylases that modify both sugars and bases in RNA, and (ii) the recently described SPOUT (
49) superfamily, which consists of the m
1G-specific methylase TrmD (
50,
51), the 2′-
O-methylguanosine-specific methylase SpoU (
52–
54), and several other poorly characterized predicted RNA. The SPOUT superfamily is traceable to LUCA, but the evolution of these methylases is not considered here in detail because it has been recently described in detail elsewhere (
49).
The methylases of the Rossmann-fold class share a six-stranded Rossmann-fold core with the dinucleotide-binding dehydrogenases and are distinguished from them by a methylase-specific 7th strand (
20,
55). This class contains the great majority of the known methylases that participate in almost every conceivable methylation reaction in biological systems, and RNA specificity appears to have emerged on multiple occasions among them. We sought to resolve the evolutionary relationships among Rossmann-fold RNA methylases using a combination of conventional phylogenetic trees and cladistic analysis based on specific shared sequence motifs (Fig. ). Several distinct lineages of dedicated RNA methylases can be detected; some of the corresponding protein families also include related DNA methylases. The RNA methylases, typically, are highly conserved and are often associated with specific RBDs, which distinguish them from the DNA methylases; many of the latter are large proteins occurring in restriction-modification operons with a sporadic phyletic distribution. The largest monophyletic superfamily of nucleic acid methylases are the base N-methylases (the BNM superfamily). These methylases are characterized by a shared derived character, the [N/D]PP[Y/F] motif at the end of strand 4, which is associated with base specificity (Fig. ). Phylogenetic analysis helped in identifying several distinct families within the BNM superfamily, and most of these families can be distinguished by specific derived characters in the above motif. Within the BNM superfamily, two families, namely the HemK family (
19) and the MJ0438 family of predicted methylases containing the RNA-binding THUMP domain (
12), are represented in all three primary kingdoms and are thus traceable to LUCA. Along with several other related families with more restricted phyletic patterns, these families form a large assemblage of (predicted) purine N-methylases with the NPP[Y/F] motif associated with strand 4. Some of the smaller families appear to be more closely related to either the HemK or the MJ0438 family and might have emerged from them through duplications much later in evolution. The RsmC family methylases that methylate G1207 in 16S rRNA (
56) and RsmD,YfiC and YbiN families are bacteria-specific elaborations that are related to the HemK family, whereas the MJ0046 family apparently was derived from the HemK family in the archaeo-eukaryotic lineage. The MJ0438-related elaborations, namely the MJ0710 and MJ0284 lineages, are present in archaea and eukaryotes. The YhhF and MJ1273 families, which are restricted in their distribution to bacteria and archaea, respectively, also belong to this assemblage, but do not show a specific relationship with either the HemK or the MJ0438 family. The functions of the HemK and MJ0438 families are poorly characterized, but their nearly universal conservation pattern suggests a role in purine methylation in rRNA. In
Rickettsia, the HemK methylase is fused with another methyltransferase of a different family, MicA (Fig. ). This suggests that these two methylases coordinately function in rRNA methylation.
The next major assemblage within the BNM superfamily is distinguished by the motif DPP followed by a polar residue (typically R) after strand 4. One of the main families within this assemblage is the Trm2 family, which is involved in methylation of U54 in tRNA at the 5 position (
57). This family with its pan-bacterial distribution appears to have emerged early in bacterial evolution and apparently was subsequently transferred to the eukaryotic lineage through the mitochondrial symbiosis. Certain bacteria encode an additional methylase family of this assemblage, TrmA, which has the same specificity (
58), and appears to have branched off the more widespread Trm2 family. Similarly, eukaryotes have their own, specific methylase family related to the Trm2 family proteins and typified by CG3808 from
Drosophila. Another prominent group within this assemblage is the MJ1653 family that shows a fusion to the RNA-binding PUA domain and is widespread in both archaea and bacteria. Families with a more restricted distribution, which are probably more recent offshoots of this lineage, include the YcbY family seen only in some bacteria and the archaeal MJ1233 family (Fig. ).
The last major group of the BNM assemblage are the methylases with a circularly permuted methylase domain. All members of this group that are widespread in prokaryotes are DNA adenine methylases associated with restriction–modification systems. In eukaryotes, this group diversified into three distinct families of adenine mRNA methylases (
59) typified by the yeast proteins Kar4p and Ime4p, and
Drosophila CG14906 (lost in
S.cerevisiae), respectively (Fig. ). In these families, the motif associated with strand 4 assumes the form [D/E]PPW, which is shared with DNA adenine methylases, such as MunI.
The SUN superfamily is the next major assemblage of Rossmann-like fold RNA methylases, which is the sister group of the BNM superfamily (Fig. ) and has the diagnostic motif DAPC associated with strand 4. The Sun family enzymes, which methylate rRNA at the cytosine 5 position (
23), are represented in all three primary kingdoms, consistent with their presence in LUCA. The SUN superfamily has undergone extensive radiation in archaea and eukaryotes, giving rise to two distinct families prior to the separation of eukaryotes and archaea and the eukaryote-specific Nop1 family involved in rRNA and snU RNA methylation (
60).
The Erm1/KsgA family that has the motif NLP[Y/F] associated with strand 4 is another close sister group of the BNM superfamily (Fig. ). These methylases are conserved in all life forms and are responsible for diadenine 2-methylation in rRNA (
61), which suggests the presence of this modification in LUCA. The archaeo-eukaryotic Trm5 tRNA methylase family and the archaea-specific MJ1557 family also have a similar form of the strand-4 motif, suggesting that these families form a monophyletic superfamily with the KsgA family (Fig. ).
Generically related to the BNM, SUN and KsgA-Trm5-like superfamilies are two methylase groups with a more restricted distribution. One of these is the bacterial YqlF family, which has an N-terminal S4 domain and a strand-4 motif of the form D[V/L]DF. Thus, this family shares the conserved D or N followed by two small residues and the predicted base-interacting aromatic or hydrophobic residue with the former superfamilies. The second group, the Uvi22 superfamily, also has a similar strand-4 motif, but has a unique, two small amino acid insert prior to the conserved D at the end of strand 4. While none of the members of this superfamily has been experimentally characterized as RNA methylases, the presence of the characteristic form of the above mentioned strand-4 motif supports this function. Additionally, one of the yeast members of this family is fused to a RNA deaminase (see below), suggesting a role in RNA modification (Fig. ). This superfamily is restricted to the proteobacteria (conserved in all α-proteobacteria) amidst the bacteria, while it vastly expanded into several distinct families in eukaryotes. This pattern, taken together with phylogenetic analysis results (data not shown), suggests an origin from the mitochondrial endosymbiont. Members of this superfamily might represent a major, as yet unexplored group of eukaryotic nucleic acid methylases.
Sequence evidence and the distinct form of the strand-4 motif suggest that all methylase superfamilies described above descended from a common RNA-methylating ancestor well before the emergence of LUCA. Structural comparisons reveal even deeper links, suggesting that these methylases, in turn, form a higher-order monophyletic group with the FtsJ superfamily of methylases involved in 2′-
O-methylation of uridine in LSU rRNA (
62) (Fig. ). The FtsJ/RrmJ family proper is represented in all three primary kingdoms, which points to its presence in LUCA. Several other related families, such as YgdE in bacteria and at least four distinct eukaryotic families, including two animal-specific ones, were derived at various later points in evolution, probably from a FtsJ-like precursor. Some of these, e.g. the Spb1 family, might methylate Sno RNAs (
63), suggesting that other, unexplored specificities exist within this family of methylases. Structural comparisons indicate that the group of RNA methylases closest to the FtsJ superfamily is the Fibrillarin/Nop1 family, which is involved in snoRNA methylation (
64). This family is restricted to the archaeo-eukaryotic lineage and might have been derived from the FtsJ superfamily through extreme divergent evolution. The archaeo-eukaryotic Trm1 methylase family and the MicA family shared by bacteria and eukaryotes appear to comprise another monophyletic group, which appears to be a sister group of all of the rRNA methylases described above (Fig. ). Both these families share a similar form of the strand-4 motif with the signature DP followed by an aromatic and then by a small residue. Trm1 functions as a tRNA N2,N2-dimethylguanosine-26 methyltransferase (
65,
66) and MicA probably performs a similar, although not identical, role in bacteria and eukaryotic mitochondria. These two families might represent the archaeo-eukaryotic and bacterial branches, respectively, of an ancestral methylase that was represented in LUCA.
All the other groups of RNA methylases appear to have been derived, independently, on more than one occasion in evolution, from within the vast assemblage of small molecule and protein methylases. None of these families is traceable to LUCA; instead, they are restricted in their distribution to only one or two of the primary kingdoms. Two of these families, the Abd1p family that methylates the eukaryotic mRNA cap, and Yml014w family that is fused, in some cases, to the AlkB domain (see below), have a dyad of aromatic residues in the 4th position after the end of strand 4. This feature suggests their derivation from within the vast class of small molecule methylases. The Yml014w family has additionally lost the polar residue (D/N) at the end of strand 4. Also derived from within this small molecule methylase assemblage is the family typified by the plant Corymbosa2/Hen-1 protein. Predicted methylases of this family are present in the crown group eukaryotes and in some bacteria, such as
Streptomyces and
Nostoc, and retain a single aromatic residue in the 4th position after the end of strand 4. The plant representatives of this family are fused to an N-terminal RNA-binding LA domain and a double-stranded RNA-binding domain (dsRBD) (Table and Fig. ), which suggests that these proteins are RNA methylases that probably methylate substrates containing double-stranded regions (see below). The GCD14 family of methylases (
67,
68), which methylate A58 of tRNAs in position 1, was derived in the archaeo-eukaryotic lineage and is more closely related to protein arginine and carboxyl group methylases than to other RNA methylases. These methylases have been sporadically transferred to bacteria, such as
M.tuberculosis and
A.aeolicus. They are distinguished by the presence of a distinct C-terminal domain similar to the transcript cleavage factor GreA (
69). This family appears to have undergone a duplication in eukaryotes, giving rise to a paralog, GCD10, whose methylase domain shows a disruption of the Rossmann-fold loop and the strand-4 region. The RrmA family that methylates G745 in position 1 in LSU rRNA (
70) is another family that appears to have been derived from the small molecule methylases late in bacterial evolution, followed by inter-bacterial dispersion via horizontal transfer.
Thus, Rossmann-fold methylase appear to have been recruited for RNA methylation at an early stage of evolution, well before LUCA. From this ancient, ancestral methylase, the significant majority of the RNA methylases, including the five to six aforementioned methylase families that were probably already present in LUCA, were derived. Extensive duplication, later in evolution, particularly in eukaryotes, resulted in the formation of several more families within this large, monophyletic assembly of RNA methylases. Additionally, lineage-specific RNA methylases were apparently derived independently, on multiple occasions, from within the small molecule and protein methylase clade. At early stages of their evolution, RNA methylases formed stable fusions with several distinct RBDs, such as the S4, PUA (
9), TRAM (
11), THUMP (
12), NusB and a potential OB-fold domain (in Trm5) (
71) (Fig. ). In addition, in eukaryotes, fusions of RNA methylases to eukaryotic-specific RBDs, including RRM and CCCH domains in the TrmA-family methylases and a G-patch domain (
18) in the FtsJ family, were detected. These fusions appear to have emerged relatively late in eukaryotic evolution and probably participate in the methylation of eukaryote-specific snRNAs. Most of these pan-bacterial families of methylases appear to have been horizontally transferred to the eukaryotic genomes as a consequence of organellar endosymbiosis, resulting in a bacterial–eukaryotic distribution pattern. The identification of several uncharacterized RNA methylase groups in this analysis (Table ) may help in further investigations of the diversity of this crucial RNA modification.
Pseudouridine synthases. The modified base pseudouridine is synthesized by pseudouridine synthases via
in situ isomerization of uridines in tRNAs, rRNAs and eukaryotic snRNAs, such as U5 and U3 (
46,
72). Pseudouridine synthases belong to two apparently unrelated superfamilies, one of which (Type I PSUS) includes the four principal ancient families, RluD, RsuA, TruB and MJ0041, whereas the other superfamily (Type II PSUS) consists of a single ancient lineage typified by TruA (
22,
73,
74) (Fig. ). Type II PSUS are present in a single copy in all proteomes, except for eukaryotes that have at least three enzymes of this superfamily. Within the Type I PSUS superfamily, the TruB family is traceable to LUCA; several members of this family are fused to a PUA domain, suggesting that this was the ancestral PSUS Type I domain architecture. The RluD and RsuA families originated in bacteria; each family includes several members containing the S4 RBD (
9), which was probably present in the ancestor of these families, but was subsequently lost on multiple secondary occasions. Conversely, the THUMP-domain-containing MJ0041 family of PSUS appears to be an innovation specific to the archaeal lineage. The RluD family has been secondarily transferred to the eukaryotes, probably via the pro-mitochondrial endosymbiont. Type I PSUS are predicted to adopt an α+β fold; the crystal structure of the Type II PSUS shows the presence of a core RRM-like fold common to several ancient nucleic acid-binding domains (
75). This, taken together with the use of guide RNAs by the eukaryotic PSUS, suggests that Type II PSUS might have evolved from an ancient RBD that functioned in conjunction with a ribozyme, with a gradual shift of the active site from the RNA to the protein component.
Enzymes involved in base thiolation. A variety of thio-bases are represented in cellular RNAs, the most common ones being 2- or 4-thiouridines and their derivatives, and 2-methylthioadenine derivatives. The methylthioadenines are typically additionally modified with bulky adducts, such as threonine or 4-hydroxyisopentene in the N6 position. Recently, the enzyme responsible for adenine thiolation in
E.coli, MiaB (
76), has been identified and shown to consist of a C-terminal RNA-binding TRAM domain and an N-terminal biotin synthase-like, metal cluster-containing catalytic domain that is predicted to catalyze sulfur insertion via SAM-dependent organic radical generation (
11,
77,
78). MiaB-like proteins are universally present in all life forms, indicating their origin prior to LUCA. Several organisms encode more than one version of this enzyme, which appear to have diversified through early duplications; these multiple forms might differentially function in the synthesis of different 2-methylthioadenine derivatives, such as 2-methylthio-N6-threonyl carbamoyladenosine and 2-methylthio-N6-methyladenosine (
46).
Thiouridine synthase (ThioUS; ThiI protein in
E.coli) is involved in the synthesis of 4-thiouridine in tRNAs and has a core PP-ATPase domain (
79), which catalyzes adenylylation of the 4-carbonyl group of uridine, followed by sulfur insertion catalyzed by a rhodanese-like enzyme (
80,
81). This rhodanese-like enzyme either comprises a distinct domain of the ThiI protein or functions as a stand-alone protein. 2-Thiouridine is universally present in tRNA, and 2-thiouridine derivatives, typically containing an additional modification of a methyl or aminomethyl group in position 5, are also common. One of the enzymes involved in 2-thiouridine synthesis, TrmU, has been identified (
82). This protein contains a PP-ATPase domain with an unusual conserved cysteine dyad inserted after strand 3 in the PP-loop domain. This suggests that syntheses of 2-thiouridine and 4-thiouridine follow similar biochemistry, which involves activation of the carbonyl group by adenylylation. In TrmU-like enzymes, the internal conserved cysteines might directly participate in sulfur insertion as a functional counterpart of the separate rhodanese-like domain, which is required for 4-thiouridine formation.
Previously, we predicted that the MJ0066 family represents a novel family of archaeal ThioUS, on the basis of the fusion of a PP-ATPase domain with a PUA domain (
9). Here, we systemically investigated other PP-ATPase families that potentially could be involved in thiouridine or thiocytidine synthesis by examining fusions with RBDs, association with the ribosomal super-operon and conserved phyletic patterns typical of RNA metabolism proteins. As a result, the MTH271-MJ1157 family, which showed fusions with the KH and Zn-ribbon domains, and the MJ0690 family, which is associated with ribosomal super-operon in different archaeal genomes, emerged as candidates for these functions (Fig. ). Furthermore, the MesJ family, which is closely related to the TrmU family, is universally conserved in all bacteria and potentially also could be involved in base thiolation.
The ThiI-family proteins contain a N-terminal THUMP domain and are bifunctional proteins that additionally participate in thiamin biosynthesis (
80,
81,
83). These proteins are ubiquitous in archaea, but sporadic in bacteria, suggesting that they originated in archaea, with several subsequent horizontal transfers to bacteria. In contrast, the TrmU family is absent in archaea, but is nearly universal in bacteria and eukaryotes, suggesting origin in the bacterial lineage, followed by an early transfer to the eukaryotes, probably via the pro-mitochondrial route. These phyletic patterns do not seem to be consistent with the universal distribution of the simple 2-thiouridine modification in tRNAs (
46,
84). The only predicted universal ThioUS are the members of the MJ1157 subfamily (Fig. ) of the MTH271-MJ1157 family containing N- and C-terminal Zn-ribbon domains. The universal distribution, taken together with the distinct bacterial and archaeo-eukaryotic clades detected during the phylogenetic analysis of this family (data not shown), suggests that these enzymes are the 2-thiouridine synthases, whereas TrmU is likely to be specifically involved in 5-methyl-2-thiouridine biosynthesis. The presence of a conserved cysteine dyad insert in the PP-ATPase domain of the MTH271-MJ1157 family, similar to TrmU, might indicate an analogous catalytic mechanism. Archaea and eukaryotes have additional subfamilies of the MTH271-MJ1157 family (Fig. ) that, along with the MJ0066 family, could compensate for the absence of TrmU or ThiI in some of these lineages. The sporadic presence of ThiI in bacteria suggests that it might be substituted by a more widespread, but so far unidentified 4-thiouridine synthase specific to bacteria; the conserved MesJ protein could be one candidate for this function.
Queuosine and archaeosine synthases. The 7-deazaguanosines, queuosine and archaeosine, found, respectively, in bacteria and eukaryotes, and in archaea, are incorporated into tRNA through transglycosylation (
46,
84). The queuosine transglycosylase, Tgt (
85,
86), is present in a single copy in most bacteria, whereas eukaryotes, with the exception of yeast, encode two forms of this enzyme. Archaea have two distinct proteins of the archaeosine transglycosylase family, which are distantly related to Tgt (
87). The MJ1022 subfamily so far has been found only in Euryarchaea;
A.fulgidus, in addition, has a single copy of queuosine transglycosylase, apparently a lineage-specific acquisition from bacteria (Fig. ). The complementary distribution of queuosine and archaeosine transglycosylases suggests that they originally diverged from a common ancestor with a TIM barrel fold (
88), concomitantly with the split of the bacterial and archaeo-eukaryotic lineages. In archaea, the catalytic domain was fused with the RNA-binding PUA domain and this form of archaeosine transglycosylase underwent a duplication in Euryarchaea (Fig. ). In eukaryotes, acquisition of the bacterial queuosine synthase through horizontal transfer from the pro-mitochondrion probably resulted in displacement of the ancestral archaeo-eukaryotic archaeosine synthase, with a further duplication leading to the forms involved in modification of organellar and cytoplasmic tRNAs.
RNA deaminases. RNA deaminases are responsible for the synthesis of certain modified nucleosides, such as inosine, and for base conversions during various RNA editing reactions. The cytidine deaminase family includes generic enzymes that catalyze generation of uridine from cytidine. In yeast, these enzymes are responsible for C→U editing (
89), suggesting that they might perform a similar function in many, if not all, eukaryotes. Plants show an expansion of a specialized form of this family, with an N-terminal inactive deaminase domain, in addition to the C-terminal active one; conceivably, these proteins might be involved in a plant-specific form of regulated RNA editing. Deaminases of the Tad2p family, which generate uracil from cytosine and inosine from adenosine in the wobble position of tRNAs (
90,
91), are present in most bacteria and all eukaryotes, but not in archaea (Fig. ). The Tad3p family, which comprises the second subunit of the inosine-generating deaminase, is eukaryote specific. The combination of Tad2p and Tad3p probably confers the specificity that differentiates this enzyme from generic cytosine deaminases. The eukaryote-specific Tad1p family of deaminases (
92) is involved in inosine generation at A37 of tRNA
Ala and in adenine editing of mRNAs in animals (
93,
94). The animal versions typically have the characteristic dsRBD fused to the catalytic domain, whereas one of the vertebrate paralogs contains a winged helix–turn–helix domain (Fig. ). Cytosine deaminases of the vertebrate-specific APOBEC family are involved in C→U editing and are represented by at least eight paralogs in mammals (
95). These enzymes appear to have been recently derived from the cytidine deaminases through rapid divergent evolution. The deaminases related to the RibD protein, which is involved in riboflavin biosynthesis, are fused to a Type I PSUS in
S.cerevisiae and to a potential RNA methylase in
S.pombe, suggesting that, similarly to cytidine deaminases, they might be involved in specific editing processes (Figs and ).
Specific RNA deaminases of known families are nearly absent in archaea. The corresponding functions might have been taken over by unrelated, still unknown enzymes or, at least in some cases, could be provided by related enzymes of the deoxycytidine deaminase family that are present in some archaea. This phyletic pattern suggests a bacterial origin for at least two of the major deaminase lineages, cytidine deaminases and cytosine deaminases. Following their acquisition by eukaryotes from the bacterial endosymbiont, cytosine deaminase underwent duplication to give rise to the two A→I deaminases involved in wobble-specific inosine synthesis. Additionally, members of both the cytidine and cytosine deaminase lineages were independently recruited for mRNA editing in vertebrates and possibly in other eukaryotic lineages (Fig. ).
Dihydrouridine synthases. Dihydrouridine synthases are poorly characterized enzymes that synthesize dihydrouridine through the reduction of the aromatic ring of uracil. This base is widely found in tRNAs from all three primary kingdoms and in LSU rRNA from prokaryotes (
96,
97). The yeast dihydrouridine synthase Dus1p belongs to the superfamily of FAD-binding TIM barrel oxidoreductases typified by dihydroorotate dehydrogenase (
98). This enzyme is universally represented in eukaryotes and bacteria, but completely missing in archaea. Eukaryotes have four main lineages within this family, which are typified by the yeast proteins Dus1p, Smm1p, Ylr405wp and Ylr401cp. The members of the first three families typically show fusions with the LRP1 Zn-finger, dsRBD and CCCH RBDs, respectively (Fig. ); these RBDs probably target dihydrouridine synthases to specific sites in the substrate RNAs. Bacteria have at least three principal lineages of dihydrouridine synthases typified by the YhdG, YohI and YjbN proteins from
E.coli (Fig. ). The phyletic pattern of dihydrouridine synthases suggests that this enzyme emerged early in bacterial evolution and was transferred to eukaryotes, probably via the endosymbiotic route. The diversification of dihydrouridine synthases into multiple forms apparently occurred independently in bacteria and eukaryotes. Dihydrouridine has been detected in tRNAs of
T.acidophilum and
M.thermoautotrophicum, but appears to be missing in other archaea studied to date (
99,
100). Hence, at least in those archaea that appear to contain this modification, an alternative as yet undiscovered enzyme is likely to be present.
NTP-dependent enzymes involved in RNA metabolism
In addition to the PP-loop ATPases discussed above in the context of base modification, a variety of ATP- and GTP-utilizing enzymes of the P-loop NTPase fold are involved in RNA modification, processing and splicing and especially in translation itself. In addition, aminoacyl-tRNA synthetases (aaRS), which belong to two other distinct, ancient classes of ATP-utilizing enzymes, are central to the translation process. Evolutionary relationships of aminoacyl-tRNA synthetases have been examined in detail in several recent studies (
10,
36,
101,
102). Here, we briefly summarize the evolutionary history of the vast class of P-loop NTPases in the context of their repeated utilization in RNA metabolism.
GTPases. P-loop GTPases are among the central, most ancient components of RNP complexes and at least nine distinct GTPases associated with different aspects of translation are traceable to LUCA. These include the four translation factors involved in initiation and elongation, two distinct versions of the OBG family of GTPases containing the RNA-binding TGS domain, the circularly permuted YlqF-like GTPases, and two GTPases associated with the signal recognition particle and its receptor. The first seven of these families belong to a large assemblage of GTPases related to the translation factors (the TRAFAC class), whereas the remaining two are members of the signal recognition/MinD/BioD (SIMIBI) class of GTPases and related ATPases (
103). These two classes correspond to the first fundamental split in the evolution of GTPases and, because both classes include proteins involved in translation, it appears likely that the primordial GTPase was a component of an ancient RNP complex that functioned as a generic regulator of translation. Even prior to LUCA, the GTPases have diversified through several duplications to perform more specific, essential functions in translation and secretion. After the radiation of the major lineages of life, many GTPases were recruited for specific functions within the translation system, such as translation-termination and RNA modification and processing. The Era family GTPases (
104), which contain a C-terminal domain that is a topologically rearranged version of the KH domain, the PseudoKH domain (
105), and the TrmE (ThdF) family were derived in bacteria within the TRAFAC class of GTPases and participate in rRNA and tRNA modification. TrmE is involved in the synthesis of the modified nucleotide 5-methylaminomethyl-2-thiouridine in tRNAs (
106). The archaeo- eukaryotic Clp1 GTPase family of the SIMIBI class was recruited to participate in polyadenylation site selection (
107). In eukaryotes, a distinct paralogous derivative of the universal translation factor EF-2, typified by Snu114p, acquired a new function in splicing as a component of the U5 RNP (
103). Further details of GTPase evolution are presented elsewhere (
108).
RNA helicases. The next major class of P-loop NTPases that are associated with RNA metabolism are RNA helicases and related ATPases. The known RNA helicases of cellular life forms belong to two major superfamilies, SFI and SFII, that descend from an ancient common ancestor antedating LUCA. This ancestral helicase contained two distinct α/β domains that are present in both SFI and SFII (
109). The N-terminal domain is a classic P-loop ATPase domain that belongs to the RecA-like subclass of P-loop domains (
110,
111). The C-terminal domain appears to represent an extremely divergent P-loop domain that might have evolved through an ancient duplication of the N-terminal domain, followed by extreme sequence divergence, which probably accompanied a functional shift to single-strand nucleic acid binding. The extant lineages of SFI and SFII helicases include both DNA and RNA helicases, and other nucleic acid-dependent ATPases. Among the helicases involved in RNA metabolism, SFII occupies a more prominent position than SFI; SFII helicases are much more prevalent in eukaryotes than in bacteria (Fig. ). Seven major families of SFII helicases have experimentally characterized or clearly predicted roles in RNA metabolism. Two of these, namely the eIF4A-DeaD family (with the classic DEAD motif in the Walker B site) and the Ski2p-Lhr family, are widespread in all three primary kingdoms, which points to their presence in LUCA. Within the eIF4A-DeaD family, the orthologous group typified by the bacterial DeaD protein, which is involved in translation regulation (
112), is widely represented in bacteria and archaea and might be the form closest to the ancestor of this family. In eukaryotes, this family has vastly expanded to include at least 30 distinct lineages, with almost 25 of them traceable to the common ancestor of the crown group (Fig. ). Most members of this expanded helicase subfamily are subunits of pre-mRNA splicing complexes, whereas some others, such as Rrp3p (
113), function in other RNA processing pathways, and Upf1p is involved in mRNA degradation (
114). The pan-eukaryotic translation initiation factor eIF4A appears to be the direct equivalent of the prokaryotic DeaD-like helicases, and its function in eukaryotes might be an extension of the ancient role of these helicases in regulatory unwinding of mRNA secondary structure. Proteobacteria have a lineage-specific expansion of the DeaD lineage, with additional orthologous groups, such as RhlE and RhlB (
115), whereas most of the other bacteria have only a single member.
The Ski2p-LHR family is a much smaller family whose ancestral form probably was involved in RNA degradation and processing (
116,
117). Archaea typically have three distinct helicases of this family, whereas eukaryotes have four members of the Ski2p-Mtr4p-like subfamily, all of which apparently function in conjunction with the exosomal nucleases in RNA degradation (Fig. ). Another eukaryote-specific orthologous group within this family includes Brr2p-like proteins, which contain two helicase and sec63 domains and are involved in both cytoplasmic RNA processing and splicing as a component of U5 snRNP (
118). One orthologous group within the Ski2p-LHR family, which is typified by mus308 of
D.melanogaster and MJ1401 of
M.jannaschii, appears to have been recruited for DNA-related functions in the archaeo-eukaryotic lineage and, in eukaryotes, shows a fusion to a DNA polymerase domain (
119,
120).
The remaining families of SFII helicases involved in RNA metabolism show purely eukaryotic, bacterial or bacterio-eukaryotic distribution. The Suv3 family involved in mitochondrial RNA degradation (
121) and the CAF family involved in PTGS are small groups that are restricted to eukaryotes and appear to function in eukaryote-specific regulatory processes (see below). The Prp2p-Mle subfamily is found in both bacteria and eukaryotes. Eight distinct orthologous groups can be delineated within this family in eukaryotes, with the majority involved in splicing, including Prp2p, Prp16p, Prp43p, Prp22p and Mle (
122). The HrpA/B proteins are bacterial representatives of the Prp2p-Mle family that are present only in proteobacteria, spirochetes and
Deinococcus, which suggests dissemination via horizontal transfer among bacteria, although the initial direction of horizontal transfer responsible for the bacterio-eukaryotic distribution remains uncertain. The SecA family proteins are ubiquitous in bacteria and plants and have been shown to possess RNA helicase activity (
123–
125). However, the role of this activity
in vivo remains unclear because SecA also has a well characterized function as an ATP-dependent translocase involved in protein secretion. The RecQ family of SFII helicases is unusual in that these proteins have functions in both DNA repair and RNA metabolism. This family is represented only in bacteria and eukaryotes, with a single horizontal transfer into the crenarchaeon
A.pernix. This distribution suggests that the RecQ family originally evolved in bacteria and was subsequently acquired by eukaryotes from the pro-mitochondrial endosymbiont, which was followed by extensive diversification into at least five distinct orthologous, eukaryote-specific groups. Many members of this family share a predicted RBD, the HRDC domain (
126), with the RNase D family of nucleases, suggesting that the ancestral function of the RecQ family helicases might have been in RNA metabolism, with a subsequent shift to DNA-related functions. A member of this family from
Neurospora has been shown to have a role in RNA metabolism, in particular PTGS (
127). Orthologs of this protein are present in other eukaryotes; furthermore, fusion of the RecQ family helicases with the Zn-knuckle and the F-box domains in plants and animals (see Figs and ) indicate that this family might have more extensive RNA-related functions than presently conceived.
Several SFI helicases are implicated in RNA-related functions in eukaryotes; they all belong to the Smubp-Sen1p family, which is conserved throughout the archaeo-eukaryotic lineage and in a few bacteria (
128). This family includes both DNA and RNA helicases and probably emerged early during the evolution of the archaeo-eukaryotic clade, rather than in LUCA. All archaeal members of the Smubp-Sen1p family are orthologs of the eukaryotic Smubp, which is a DNA-binding protein (
129). However, the presence of the single-stranded nucleic acid-binding R3H domain (
15) in some of the eukaryotic members of this family might point to an undiscovered role in RNA metabolism (see Figs and ). All known eukaryotic SFI RNA helicases of the Smubp-Sen1p family were derived after the divergence of eukaryotes from the common ancestor with archaea (Fig. ). Five distinct lineages of RNA helicases of this family emerged prior to the divergence of the crown group eukaryotes and include proteins involved in a variety of functions, such as snoRNA maturation [Sen1p (
130)], mRNA degradation [Nam7p (
131)] and PTGS [Sde3 (
132)] (Fig. ). One of the eukaryotic SFI lineages, represented by the
S.pombe SPCC1739.03 and its orthologs, is closely related to the NAM7p subfamily and is an uncharacterized group of predicted RNA helicases, which, on the basis of their phyletic pattern (
133), are likely to participate in PTGS (Fig. ). Another distinct pan-eukaryotic family, typified by the Aquarius protein (
134), is predicted to include inactive helicases as indicated by the disruption of the P-loop and Walker B motifs; these proteins probably function as RNA-binding regulators, rather than as enzymes. Two small, lineage-specific expansions of these helicases were detected in
Arabidopsis and
C.elegans, typified by the F1E22.14 (eight members) and K08D10.5 (six members) proteins, respectively; these might represent specific adaptations for antiviral response or related processes. Most of the other SFI families, such as the RecD family, appear to have evolved in bacteria and are known to be involved only in DNA repair and recombination (
119).
The PhoH family of ATPases (
135) evolved in bacteria, apparently through the loss of the C-terminal α/β domain that was present in the common ancestor of SFI and SFII helicases. A role in RNA metabolism is strongly suggested by the presence of RNA-binding PIN and KH domains in different members of the PhoH family (Fig. ). There are two orthologous groups of PhoH-like ATPases, typified by PhoH and YlaK, respectively, that evolved as a result of an early duplication in the bacterial lineage. The PhoH proteins could either function as helicases or could be involved in ATP-dependent dynamics of as yet uncharacterized RNP complexes in bacteria.
Miscellaneous P-loop NTPases involved in RNA metabolism. In addition to the above, well characterized classes of P-loop NTPases involved in RNA metabolism, several others have less common and less thoroughly understood RNA-related functions. The most notable of such groups includes the PilT ATPases, which form a distinct class within the P-loop fold and appear to be a sister group to the ABC class (D.D.Leipe, E.V.Koonin and L.Aravind, unpublished data). The PilT ATPases implicated in RNA metabolism appear to be predominantly an archaeal innovation and are typified by MJ1533 and its orthologs that are highly conserved in archaea (
136). These proteins combine the PilT ATPase domain with RNA-binding PIN and KH domains. In bacteria, a group of PilT ATPases is present sporadically in
Bacillus and
Synechocystis and form fusions with the RNA-binding R3H domain. These ATPases might represent a novel class of RNA helicases or could participate in other ATP-dependent reactions of RNA metabolism.
Some kinases of the P-loop fold, such as polynucleotide kinases, also participate in RNA metabolism. A generic polynucleotide kinase that probably acts on both DNA and RNA seems to be conserved in all eukaryotes except for
S.cerevisiae (
137–
139). Additionally, some lineage-specific P-loop kinases are implicated in RNA metabolism on the basis of suggestive domain fusions, including the kinase fused to yeast RNA ligase (
140,
141) and the animal-specific hnRNP-U (SAF-A) proteins, which contain a SAP domain, and might function as chromatin-bound polynucleotide kinases in pre-mRNA splicing (
142,
143). P-loop kinase domains are also fused to the ligase-related nucleotidyltransferase domains of the capping enzyme in trypanosomes (
144).
The P-loop proteins of the MiaA family modify adenines, chiefly in tRNAs, through the addition of bulky adducts, such as isopentene, in position 6, using organic phosphates, e.g. dimethylallyl diphosphate, as donors of the modifying groups (
145,
146). These enzymes are distantly related to the AAA+ class of P-loop ATPases and are nearly ubiquitous in bacteria and eukaryotes, which is consistent with the phyletic pattern of 6-isopentenyl adenines in tRNA. MiaA probably evolved in the common ancestor of bacteria and was acquired by eukaryotes from the promitochondrial endosymbiont. On the basis of operon organization, it can be predicted that, at least in certain bacteria, such as proteobacteria,
Aquifex and
Synechocystis, MiaA utilizes the Hfq protein (the bacterial homolog of the eukaryotic SM proteins) as an RNA-binding subunit.
Evolutionary history and trends of non-catalytic domains involved in RNA metabolism
Approximately 50 major superfamilies of non-catalytic domains, primarily RNA-binding ones, are implicated in RNA metabolism (Fig. A and B and Table ). In addition, several conserved domains are found exclusively in ribosomal proteins. Below we consider some of the general and specific features of the natural history of these domains that emerge from a detailed analysis of their phyletic patterns combined with attempts on evolutionary classification.
Evolutionary mobility of domains. RBDs show remarkable diversity in terms of domain architectures. Several RBDs, such as ribosomal protein L30 and the SRP14-domain, typically occur as stand-alone proteins and in a single copy per genome. At the other end of the spectrum are ‘promiscuous’ domains, such as RRM, which display over 35 distinct multidomain architectures and are found in combination with up to 20 different domains (Figs –). These observations suggests major differences in evolutionary mobility among RBDs. Certain highly conserved, ancient RBDs, such as L30, S6 and SmpB, appear to have largely stabilized in specific functional niches in the ribosome or in lineage-specific RNP complexes and are not typically recruited to roles in more general contexts related to RNA metabolism. In contrast, some other conserved domains found in ribosomal proteins, such as S1 (
158), KOW (
13) and S4 domains (
9), have been recruited for a variety of other functions which involve RNA binding. Some of these domains (KOW, S4), along with other mobile RBDs, such as EMAP, PUA, PIN, TRAM, THUMP, TGS, N-OB, NusB (
9–
12,
71,
136) and several conserved domains found in aaRS (
10), form a group of moderately mobile, ancient domains. The majority of the fusions that involved these domains appear to have evolved close to the origin of one of the superkingdoms or, in some cases, even in or prior to LUCA. Most of these architectures show remarkable parallelism of fusions of different RBDs to various RNA modification and processing enzymes. It appears that these RBDs emerged at early stages of evolution and, shortly after their origin, formed fusions that facilitated the delivery of diverse catalytic activities to RNA and hence were maintained in most lineages. These moderately mobile domains formed lineage-specific fusions on relatively rare occasions, such as those of N-OB and EMAP to the C-termini of plant and vertebrate TyrRS, respectively (
10), or the fusion of TRAM to a FtsJ-like methylase in
Thermoplasma (
11).
The next major phase of domain mobility coincided with the emergence of eukaryotes and continued through the divergence of the major eukaryotic lineages. This burst of mobility correlates with the origin of splicing and other post-transcriptional regulatory mechanisms in eukaryotes. Some of these domains, such as S1, dsRBD (
159) and KH (
160,
161), were already present in LUCA as parts of ubiquitous ribosomal proteins or enzymes. These domains went through an initial phase of moderate evolutionary mobility, but experienced a new spurt of mobility in eukaryotes, each giving rise to several new architectures associated with splicing and other post-transcriptional regulatory processes. However, most of the domain shuffling events in eukaryotes involve relatively new, eukaryote-specific domains, such as RRM, Zn-Knuckle, CCCH, Little Finger, G-patch, SWAP and PWI.
Differential utilization of some ancient RBDs and high mobility of the eukaryote-specific domains point to two distinct evolutionary forces involved in the emergence of the complexity of eukaryotic RNA metabolism. First, it appears that most of the ancient, moderately mobile RBDs were not sufficiently versatile to occupy the new functional niches, such as splicing and PTGR. Exceptions include several ancient mobile domains, such as S1 and KH; proteins containing these domains in eukaryote-specific architectures have undergone lineage-specific expansions, which indicates greater functional versatility and adaptation to some of the new functional niches. These domains, however, largely formed combinations amidst themselves or with catalytic domains, akin to their more ancient versions, rather than with more recently invented domains. Secondly, the newly invented domains appear to have been recruited en masse to the new, eukaryote-specific functions close to the points of origin of these functions. Thus, through an evolutionary feedback process driven by duplication and repeated selection for the same set of newly derived domains, they started rapidly colonizing the new functional niches to the exclusion of the older, moderately mobile RBDs. This strong selection favoring the proliferation of the recently evolved, mobile domains also appears to have resulted in architectures that most frequently involved combinations among themselves rather than with the less common, ancient RBDs.
A brief history of major families of RBDs. The specific evolutionary histories of the common RBDs are important for understanding the emergence of the functional systems that comprise cellular RNA metabolism. Below we briefly consider the main events in the diversification of major RBD families.
OB-fold and other all-β
strand domains. The OB-fold is a six-stranded β-barrel, which is common to several superfamilies of nucleic acid-binding domains. Among the domains involved in RNA metabolism, the S1, S1-like, EMAP, N-OB and thermonuclease domains adopt the OB fold (
55,
71,
158). Most of these domains were already represented in LUCA, which indicates that a major phase of divergent evolution of OB-fold domains took place at even earlier stages of evolution. Several of the OB-fold domains are seen in proteins that have been conserved throughout evolution as central components of the translation system. Ribosomal protein S12 and initiation factor IF1/eIF1A are the most conserved orthologous groups of S1-domain proteins, each traceable to LUCA. In addition, several conserved versions of the S1 domain are present in ribosomal protein S1, RNase E, RNase II, polynucleotide phosphorylase, the circularly permuted GTPases of the YjeQ family, Tex and NusA, all of which are (nearly) ubiquitous in bacteria and probably evolved at the onset of bacterial evolution. Conversely, the forms of the S1 domain present in eIF2-α, RpoE and Rrp4p/Rrp40p exosomal subunits go back to the base of the archaeo-eukaryotic clade. The Rrp5p and Prp22 lineages of S1 domains evolved in eukaryotes, whereas the SPT5p family appears to have evolved in eukaryotes, from a Tex-like ancestor that was acquired from bacteria. ‘S1-like’ domains belong to a lineage that is of bacterial origin and is represented by orthologous groups, such as the major cold shock protein (CspA), RNase II and transcription terminator Rho. Another OB-fold domain related to the S1 domains is the C-terminal domain of the universal translation factor EF-P/eIF5A,which appears to have branched off from all the other S1 domains prior to LUCA and has not shown any evolutionary mobility ever since.
The most ancient form of the EMAP domain seems to be the one in methionyl-tRNA synthetase (
10), which is widely distributed throughout all three primary kingdoms. Additionally, a duplication at the base of the bacterial clade gave rise to the EMAP domain in the β-subunit of PheRS. Similarly, the most ancient lineage of N-OB domains (
162) is the one that is present in AspRS; this domain underwent duplications to give rise to the forms present in LysRS and AsnRS in bacteria and eukaryotes, respectively. Other N-OB domains appear to have been recruited widely in various DNA metabolism enzymes, which suggests exaptation of an ancient RBD for DNA binding (
162).
The SH3-like barrel (
163) is another all-β fold, which is present in several non-catalytic domains involved in RNA metabolism, such as the KOW, SM, L21E, L2 and tudor domains. The KOW domain present in the ribosomal protein L24, NusG/Spt6 and EF-P/eIF5A evolved prior to LUCA and the KOW-domain-containing proteins have largely retained their architectures ever since. The eukaryotic ortholog of NusG, Spt6, contains four or five divergent copies of the KOW domain, apparently resulting from a previously undetected amplification. The SM domain (
164–
166) also appears to have been present in LUCA, although it seems to have been subsequently lost in several bacterial lineages. This domain is unusual in that it always occurs as a stand-alone protein, suggesting selection against the formation of multidomain architectures, the underlying cause of which remains unclear. Prokaryotes encode one or two SM-domain proteins, whereas, in eukaryotes, 16 distinct orthologous groups of SM proteins already evolved prior to the radiation of the crown group, which is consistent with large-scale recruitment of this domain to snRNP complexes involved in splicing. The L2 domain seen in the universal ribosomal protein L2 is an orphan version of the SH3-like barrel fold that might have been derived from the ancient KOW superfamily, with subsequent extreme sequence divergence. Similarly, the L21E domain of the archaeo-eukaryotic lineage might be a divergent derivative of the more universal superfamilies of the SH3-like fold. The TUDOR domain (
167) is also related to the SM and L2 domains and appears to have been derived from one of them in eukaryotes. Several members of the tudor superfamily appear to have lost the RNA-binding function and participate in protein–protein interactions in the splicing snRNP complexes (
168); some divergent versions even function in chromatin structure maintenance (L.Aravind, unpublished observations). At least four distinct orthologous groups of proteins containing TUDOR domains with functions in RNA, such as
Drosophila TUDOR itself and its orthologs and the splicing factor SPF30, had evolved prior to the radiation of the eukaryotic crown group. A few additional TUDOR-containing proteins emerged in animals, such as the
Drosophila RNA helicase Homeless and the SMN-like proteins.
The PUA domain and the TRAM domain are two other RBDs that are confidently predicted to have an all-β fold (
9,
11); however, they cannot be classified with any of the folds discussed above in the absence of a 3D structure. Both PUA and TRAM are ancient, moderately mobile domains that typically are associated with various RNA-modifying enzymes (Table and Figs and ). Eukaryotes and bacteria have some lineage-specific architectures of PUA domain-containing proteins, such as the fusion with another RBD, SUI1, in eukaryotes and an unexpected, conserved fusion with glutamate kinase in bacteria. The TRAM domain is principally associated with the ancient MiaB-like enzymes (see above) and is also fused to the predominantly archaeo-eukaryotic TRM2-like methylases.
Major RBDs with α
/β
and α
+ β
folds. The RRM, which is the most prevalent RBD in eukaryotes and is involved in all aspects of RNA metabolism, is a eukaryotic invention. At least 40 distinct orthologous groups of RRM-containing proteins appear to have emerged prior to the radiation of the eukaryotic crown group and, additionally, several more orthologous groups are confined to animals, plants or fungi. This explosive diversification of the RRM domain is surprising given the absence of this domain in archaea and bacteria, except for a few occurrences, which probably are horizontal transfers from eukaryotes. The RRM domain belongs to an ancient fold of nucleic acid-binding domains, which is present, for example, in ribosomal protein S6 (
75) and also in the catalytic domains of a variety of enzymes, including RNA and DNA polymerases and Type II PSUS (
55) (L.Aravind, unpublished data). It appears most likely that the RRM domain proper has been derived from a S6-like ancestor at an early stage of eukaryotic evolution.
Several other α/β and α + β domains, such as KH, dsRBD and THUMP (Table ), have ancient representatives among ribosomal proteins or RBDs of conserved RNA-modifying enzymes. The lineage-specific orthologous groups of proteins containing these domains appear to have evolved through duplication and diversification of these ancient lineages. The TGS domain and the S4 domain that have a distinct α + β fold, called the α-L fold (
169), appear to have diverged from a common ancestor and become distinct lineages prior to LUCA.
All α
-helical domains. A distinct version of the helix–hairpin–helix (HhH) domain, which is typified by the RBD of ribosomal proteins S13/S18, is ubiquitous in all three primary kingdoms and may represent one of the most ancient lineages of the HhH domains (
170). This domain was subsequently sporadically recruited to RNA metabolism, e.g. in the NusA and Tex-Spt6 families, but is far more prevalent in DNA-binding contexts. Thus, this might be another case of an ancient RBD, which diversified extensively only after recruitment for DNA binding.
The PIN domain is another predominantly α-helical domain found in proteins, which, in eukaryotes, are associated with PTGR and RNA degradation (
136,
171). Stand-alone PIN-domain proteins are widespread across all three primary kingdoms, with distinct architectures in the form of fusions with PilT and PhoH ATPases conserved, respectively, in archaea and in bacteria. A protein containing a PIN protein and a Zn-ribbon domain (human ART-4 orthologs) is conserved in the archaeo-eukaryotic lineage, whereas eukaryotes additionally have a unique architecture of PIN fused to RNase II and TPR repeats. These domain fusions suggest that PIN domains perform a wide range of functions and experimental analysis of PIN-domain proteins might unravel new facets of RNA metabolism. An enigmatic aspect of the evolution of the PIN domains is the expansion of the stand-alone versions of these domains in archaea, such as
Archaeoglobus and
Aeropyrum, and bacteria, such as
Mycobacterium and
Synechocystis. These PIN domains potentially might be involved in some unusual regulatory mechanism or in defense against RNA viruses. It has been hypothesized, on the basis of limited similarity to 3′-5′ exonucleases of the RNase H fold, that PIN domains, particularly those involved in RNA degradation in eukaryotes, might have exonuclease activity (
171). However, the proposed catalytic residues are not conserved in all PIN domains and a nuclease activity appears unlikely at least for the expanded prokaryotic forms.
The translin domain is an α-helical RBD that is found in a single copy in archaea and in two copies in eukaryotes. The eukaryotic translin protein might be part of a cytoplasmic RNP complex that mediates localization or tethering of mRNAs (
172). Given the conservation of this protein in archaea, it seems that these RNP complexes have an ancient function in maintaining RNA stability. As discussed above, several α-helical superstructure-forming domains, such as PUM, HAT [a specific version of the TPR repeat (
173)] and NIC, have been recruited for functions related to RNA metabolism in eukaryotes.
Metal-chelating domains. Of the large number of mobile metal-chelating domains that are utilized in RNA metabolism, only the Zn-ribbon (
44) (ZNR) is of ancient provenance. The ZNR is a four-stranded domain stabilized by a metal atom typically chelated by four cysteine side chains (sometimes replaced by histidines). The ZNRs function as RBDs and DNA-binding domains and as cofactors in redox reactions, and are also involved in structural stabilization of various proteins (
44). The ZNRs in MetRS, IleRS and ribosomal protein S14 are traceable to LUCA. Several ZNRs in translation-associated proteins, such as L40A, L36AE, S27, eIF5 and eIF5β, are conserved throughout the archaeo-eukaryotic lineage, whereas many others are specific to archaea and some to bacteria. This is indicative of massive recruitment of ZNRs during the emergence of the archaeal clade, which might correlate with the iron respiration typical of archaea (
174).
The Zn-chelating RBDs that evolved in eukaryotes include the Zn-knuckle with a C2HC pattern of metal ligands, the CCCH domain (named after its conserved chelating cysteines), the little finger with a C4 metal-binding pattern and characteristic conserved tryptophan, the LRP1 finger and the classic C2H2 Zn-finger. There are approximately 12 orthologous groups of proteins containing Zn-knuckles, 13 groups of proteins with CCCH domains and three groups with Little Fingers that are conserved throughout the eukaryotic crown group. All these domains are highly mobile and several lineage-specific fusions of ancient or recently derived proteins to these domains were detected. This suggests a burst of proliferation in early eukaryotes resulting in the establishment of the major orthologous groups, followed by sporadic duplications in individual lineages.
The LRP1 finger is a previously undetected domain that we identified as part of this study. LRP1 has a C6H ligand pattern, which suggests chelation of two metal ions. In animals, this domain is fused to the dihydrouridine synthase Dus1p (Fig. ), whereas in plants, it has undergone a lineage-specific expansion, with at least 10 stand-alone members, including the namesake LRP1 protein (
175). The classic C2H2 Zn-finger is typically associated with DNA binding in eukaryotes and is part of numerous transcription factors and chromatin-associated proteins. However, several members of this family are associated with known or predicted RBDs, e.g. in the experimentally confirmed RNA-binding proteins TFIIIA and dsRBP-Zfa (JAZ) (
176–
178). However, no distinct sequence features or specific phylogenetic relationships of the RNA-binding versions of this domain were detected so far, making it impossible to predict the fraction of C2H2 fingers in eukaryotic proteomes that have RNA-related functions. We only documented those occurrences where the evidence was sufficiently clear from either experimental data or association with other specific RBDs. This is likely to represent the lower boundary of the C2H2 fingers involved in RNA metabolism.
Evolutionary history of RNA metabolism systems and reconstruction of their ancestral states
Analysis of evolution of individual domain families, a summary of which is presented above, provided a means of reconstructing the evolutionary history and probable ancestral states of the numerous functional systems, pathways and protein complexes that comprise RNA metabolism. We summarize below the results of this reconstruction, which is based on the data gathered for principal conserved domains involved in RNA metabolism. Figure is a Venn diagram that shows the numbers of conserved orthologous groups of proteins shared by various lineages across the entire phylogenetic spectrum that we sampled, for various functional systems.
The ancient core: translation, transcription and RNA modification. Comparative genomics showed that the basic translation apparatus contains the largest number of (nearly) universally conserved proteins. The set of translation-associated proteins whose origin is traceable to LUCA and possibly beyond includes 15 proteins associated with the small subunit of the ribosome, 18 proteins associated with the large subunit, nine class I aaRS, seven class II aaRS, seven GTPases associated with various aspects of translation, and at least two other translation factors. Other ancient proteins associated with translation are the glutamate (aspartate) amidating enzyme subunits, which are necessary for glutamine (and in some cases, asparagine) incorporation into proteins in most bacteria and archaea (
179), the signal recognition particle GTPases that form the link between translation and secretion, possibly a SFII helicase associated with translation regulation or initiation, and a variety of RNA-modifying enzymes (Table ). The modification enzymes that could be confidently traced back to LUCA include two distinct classes of methyltransferases with six to seven representatives altogether, two classes of pseudouridine synthases, and enzymes involved in the synthesis of thiouridine and thioadenine derivatives and 7-deazaguanosines. Thus, LUCA possessed an abbreviated protein core of the modern ribosome and the basic repertoire of accessory proteins required for translation. From this pivotal point, it is possible both to track back the early, pre-LUCA stages in the evolution of RNA metabolism and to examine its elaborations in the major clades of life.
| Table 2.Proteins involved in RNA metabolism that are traceable to the LUCA of the extant life forms |
As pointed out above for individual domains, many components of the ribosome, translation factors and RBDs of RNA-modifying enzymes, which are traceable to LUCA, descended from even more ancient common ancestors. Numerous ribosomal proteins and other translation/modification-associated RBDs in the ancestral set belong to a small number of folds, such as OB-fold, SH3-like barrel and the α-L fold. Thus, prior to the divergence of the S1, N-OB and EMAP domains, or the KOW, SM and L2 domains, or the TGS and S4 domains, their respective ancestors probably functioned as RBDs with generic properties. The same logic applies to enzymes of RNA metabolism. The case is particularly clear for aaRS, which are indispensable components of the modern translation machinery responsible for the specificity and efficacy of amino acid incorporation into protein. Since most of class I and class II aaRS were already present in LUCA, there is obviously a history of pre-LUCA duplications in each of the classes (
102). The ancestral aaRS of each class, which functioned in the primitive translation system, most likely was a non-specific amino acid-activating enzyme, with the specificity determined by tRNAs themselves. This type of translation system appears to be a transition state between a primordial machinery based entirely on RNA catalysts and the modern, largely protein-based system. Furthermore, the catalytic domains of both classes of aaRS are homologous to certain other NTPases and nucleotidyl transferases, whose functions are unrelated to translation; some of these, for example, are enzymes of coenzyme biosynthesis, such as NAD synthase in the case of class I (
102) and biotin synthase for class II (
180). Thus, the progenitors of the two classes of aaRS, which evolved from within the primitive RNA world, probably were non-specific nucleotidyl transferases, which combined functions in translation with those in other branches of metabolism. Similarly, at this stage of evolution, the individual translation factors and RNA-modifying enzymes, such as methyltransferases, had probably not yet differentiated into their specific versions, but were represented by the corresponding ancestral forms, which functioned in multiple contexts with a low specificity.
Looking forward from LUCA, it is immediately apparent that several major additions to the translation apparatus and its accessories map to the point of divergence of the two principal branches of life, the bacterial and the archaeo-eukaryotic clades. Approximately 28 proteins were added to the ancestral ribosomal core in the archaeo-eukaryotic lineage and, conversely, 21 ribosomal proteins are specific to the bacterial lineage, which results in the profound differences in the ribosomal superstructure between the two clades. The translation termination factors and several initiation factors also were added to the conserved set as these major lineages diverged. Eukaryotes showed a further development in the complexity of the translation initiation system: several new translation regulators emerged in the eukaryotic lineage, some of which consist of the RRM domain or newly derived α-helical domains, such as NIC, MI and W2 (
16), whereas others have new combinations of ancient RNA-binding and enzymatic domains, such as PUA, SUI1 and SFII helicases. The complexity of RNA modification also increased during the post-LUCA phase of evolution as a result of several duplications within various enzyme families and the origin of several new enzymes, such as dihydrouridine synthetase and MiaA (Figs and ). Most of the RNA modification enzyme superfamilies, in addition to the highly conserved groups of orthologs, include many smaller groups, which are restricted to a specific lineage or show a sporadic distribution (Figs and ). Thus, a subset of RNA modifications, while not universally essential, are likely to have specific adaptive value for particular organisms in their ecological niches. These adaptations might include tolerance to extreme environmental conditions, such as high temperature or osmolarity, or resistance to anti-translation antibiotics or particular xenobiotics. The relatively late emergence of many RNA modifications suggests that the RNA modification state in LUCA and especially at earlier stages of evolution was relatively simple and therefore these modifications might not have been a major factor in modulation of the catalytic activities of primordial ribozymes.
Several RNA-binding proteins contribute to transcription. The best-studied proteins in this category are the transcription elongation/antitermination factors that include the universally conserved NusG-Spt5p family of KOW-domain proteins (
181). Bacteria additionally possess several distinct subunits of the transcription antitermination complex, including NusB, which contains the prototype of the α-helical NusB domain, ribosomal protein S4 and the S1 and KH domain-containing protein NusA (
182–
184). The functionally equivalent eukaryotic transcription elongation complex contains Spt6 (
185,
186), which is the ortholog of the bacterial Tex protein (
187). Similarly to NusA, this protein contains an S1 domain and is likely to be the functional counterpart of NusA. In animals, this complex additionally contains the RRM-containing RD protein (
188). The ancestral form of the transcription elongation/antitermination complex, which was present in LUCA, might have consisted of a single KOW-domain protein and perhaps the ribosomal protein S4. This was followed by accretion of additional subunits, at least in bacteria. Bacteria also evolved transcription antiterminators containing the α-helical AmiR domain that relieve specific mRNAs from termination in response to stimulation of specific signaling pathways that lie upstream of them (Table ) (
189). The corresponding additions in archaea, if any, remain unknown, but in eukaryotes, SPT6, apparently acquired from bacteria via horizontal transfer, was recruited to this complex, followed by other lineage-specific additions.
The archaeo-eukaryotic RNA polymerase E1 subunit containing the S1 domain and eukaryotic transcription factors EWS/TAF68 and TAF
II250 containing the Zn-knuckle domain are other transcription-related RNA-binding proteins. Fusion of the SAP domain with RBDs (
143) suggests that eukaryotes might have still uncharacterized RNP complexes, which could couple nuclear RNA processing with transcription. Finally, in animals, several chromosomal RNAs, such as RoX1/2 and XIST, have been described that have a role in regulating chromosomal structure, and thereby transcription, on a global scale. A specific class of Chromodomains typified by the MSL proteins (
190) and other proteins, such as the SFII helicase Mle (
122), interact with these RNA molecules.
Polyadenylation and capping. Polyadenylation occurs in all three primary kingdoms. Prokaryotic poly(A) tails are short (~30 nt) compared with the eukaryotic ones, which extend to several hundred nucleotides (
191). Bacterial poly(A) polymerases also have CCA-adding activity and are often fused to HD or DHH phosphohydrolase domains (
149). The eukaryotic Poly(A) polymerases are only distantly related to the bacterial versions and, instead, are more closely related to the Trf4/5 family of eukaryotic DNA polymerases and archaeal CCA-adding enzymes (
149), suggesting that these archaeal enzymes probably have a second function as Poly(A) polymerases. In eukaryotes, the free 3′ end for the Poly(A) polymerase is generated by a predicted nuclease of the metallo-β-lactamase fold, CPSF-I (
192–
194). This enzyme is conserved throughout the archaeo-eukaryotic lineage and is also present in many bacteria. Thus, LUCA probably had a polyadenylation system that consisted, at least, of a CPSF-I-like enzyme that cleaved the transcript and a polymerase β family nucleotidyltransferase that added the adenylates. The reasons for the rapid evolution of the poly(A) polymerases in each of the three primary kingdoms are unclear. It seems plausible that, in eukaryotes, the displacement of the CCA-adding function by a horizontally transferred bacterial enzyme resulted in the divergence of the poly(A) polymerase from the ancestral, bifunctional form seen in the archaea. Eukaryotes additionally recruited to the CSPF complex several new RNA-binding proteins containing eukaryote-specific domains, such as RRM, CCCH and Zn-knuckle. Furthermore, RRM and NIC-domain-containing proteins were recruited to form a eukaryote-specific poly(A) tail-binding complex.
The cap is a unique structure present in eukaryotic mRNAs; the minimal form of the cap is synthesized through the following steps: (i) removal of the terminal phosphate of the triphosphate at the 5′ end of mRNA, (ii) guanylylation of the 5′ diphosphate and (iii) methylation of the guanine at the N-7 position (
195). The first two steps are catalyzed by the capping enzyme, which consists of a triphosphatase and a nucleotidyltransferase, whereas the N-7 methylation is catalyzed by methylases of the Abd1p family (
196). The enzymes that catalyze the latter two capping reactions appear to be conserved throughout the eukaryotes. The capping guanylyl transferase apparently was derived from the more ancient ATP-dependent DNA ligase (
38,
39), whereas the capping methylase probably evolved from within the vast small-molecule methylase class, rather than from the regular, monophyletic RNA N-methylases (see above). The capping triphosphatase, however, shows great variability among eukaryotes. Animals and plants share a triphosphatase of the tyrosine phosphatase superfamily that is fused to the N-terminus of the guanylyl transferase (
197,
198). The fungi and
Plasmodium falciparum contain a distinct phosphoesterase of an all-β fold, which occurs as a stand-alone subunit and is also present in large DNA viruses, such as PBCV (
199). The earlier branching trypanosomes have a phosphoesterase domain of the P-loop-containing adenylate kinase family fused to the N-terminus of the guanylyl transferase (
144). This unusual diversification of the triphosphatase domain suggests that, whereas the capping methylase and guanylyl transferase were derived early in eukaryotic evolution, there was no specific triphosphatase at the corresponding stage of evolution. Instead, the triphosphatase reaction might have been performed by a non-specific phosphatase. Subsequently, in each lineage, an independent triphosphatase appears to have been recruited for this function. We found that the animal-specific CG6379 family of methylases of the FtsJ-like superfamily have a divergent, catalytically inactive version of the capping enzyme nucleotidyltransferase domain fused to the methylase domain. These RNA methylases might function as regulators of the capping process that bind cap through the inactive capping enzyme domain.
The principal proteins of the nuclear and cytoplasmic cap-binding complexes, CBP80 and eIF4G, respectively, appear to have diverged from an NIC-domain-containing ancestor, which was probably the core subunit of the ancestral cap-binding complex (
16,
17). After the divergence of these central components, new subunits, such as CBP20 (
200), a RRM domain protein and eIF4E (
201), appear to have been independently recruited to the respective complexes, at least prior to the divergence of the eukaryotic crown group. EIF4E also has a core RRM-like fold, although no sequence similarity to RRM domains is detectable; this domain might have been derived from a common precursor with the RRM.
Post-transcriptional regulatory mechanisms. Mechanisms of PTGR that act directly on the transcript and affect its stability or association with the ribosome are common in both bacteria and eukaryotes. At the core of these mechanisms are the ribonucleases that mediate RNA degradation; these enzymes are conserved in all three primary kingdoms (
45). Eukaryotes evolved a specific elaboration of this system whereby a whole class of dedicated proteins and RNAs lend specificity to the degradation system with respect to the transcripts that are regulated (
202–
205). This phenomenon has been termed PTGS and, in many eukaryotes, depends on the amplification of small regulatory RNAs by an RNA-dependent RNA polymerase (
153–
156). Additionally, while distinct from the chromatin-level transcriptional silencing, the PTGS system appears to interact with it (
133,
206).
The most ancient PTGR systems are comprised of RNases and helicases that unwind RNA secondary structures to aid degradation or regulate translation (Fig. ). Many, if not all, of the nucleases implicated in PTGR appear to be involved also in the processing of RNA precursors. The RNA degradation enzymes that can be traced back to LUCA are RNase HII and RNase PH, of which the former is responsible for the removal of the RNA primer during DNA replication and apparently has no direct role in PTGR. In contrast, RNase PH is one of the principal RNA degradation enzymes, along with RNase P. RNase P is present in all extant organisms, but its protein subunits are not homologous in bacteria and archaea-eukaryotes, which suggests that, in LUCA, RNase P existed as pure ribozyme. RNase PH and the bacterial RNase P protein subunit have a common nucleic acid-binding domain of the S5 fold (
207,
208). This suggests an evolutionary scenario whereby the S5 domain was recruited by a common ribozyme ancestor of RNases PH and P and, during the subsequent evolution, the ribozyme was gradually replaced entirely by a protein catalytic scaffold in RNase PH-like enzymes, whereas RNase P retained the ribozyme and the RNA-binding subunit. This scenario implies that the protein subunit of the bacterial RNase P retains the ancestral state and probably has been displaced by unrelated proteins in the archaeo-eukaryotic lineage. The primitive RNA degradation system of LUCA might also have included a LHR-Ski2p family helicase and, possibly, a generic thermonuclease-like protein of the OB fold and RNA-binding PIN domains. Another component that might have been represented in LUCA is the SM domain. In prokaryotes, SM domain-containing proteins bind numerous specialized small RNAs, such as the DsrA/RprA RNA, and regulate mRNA stability and association with the ribosome (
209). It remains to be seen if any of the small RNAs bound by the SM proteins possess ribozyme activities.
With the separation of the archaeo-eukaryotic and bacterial lineages, several distinct superfamilies of nucleases were independently recruited in each of them for RNA degradation and processing [see the recent detailed evolutionary classification of RNases (
45)]. The most important innovations in bacteria included 3′→5′ exoRNases, RNase E/G, RNase II and RNase III. In the archaeo-eukaryotic lineage, a 3′→5′ RNA degradation and processing complex, the exosome, has evolved. The eukaryotic exosome has been extensively characterized experimentally (
116,
210,
211), whereas the existence of the archaeal counterpart and, by inference, the presence of the exosome in the common ancestor of archaea and eukaryotes, have been postulated through comparative analysis of archaeal genomes. Genes for predicted exosomal components form some of the most conspicuously conserved gene strings (probable operons) in archaea (
212). The exosome consists of Rrp41p- and Rrp42p-like RNase PH family nucleases, RNA-binding proteins containing S1 domains combined with KH or Zn-ribbon domains, such as Rrp4p and Csl4p, PIN domain proteins, a LHR/Ski2p-like helicase and, possibly, also RNase P as predicted during archaeal genome analysis.
The archaea also evolved a distinct RNase of the DHH hydrolase family, which contains S1 and ZnR domains and, as suggested by the comparative genome analysis, might interact with the exosome (
45). In addition to these conserved complexes involved in RNA degradation, other RNA-binding complexes, which might contribute to PTGR by affecting mRNA stability and association with the ribosome, evolved after the split of the primary lineages. Cold shock proteins (CspA) containing S1-like domains are among such bacterial regulatory RNA-binding protein (
213). Additionally, proteins such as Hsp15, with a stand-alone S4 domain, which bind RNA and regulate translation, point to the existence of diverse PTGR systems in bacteria (
214). Some of the RNA-binding proteins predicted during this study, e.g. a protein that combines a PIN and a TRAM domain, could provide leads for discovery and investigation of poorly understood PTGR systems in prokaryotes (Fig. ).
The emergence of eukaryotes was accompanied by several major elaborations of the PTGR systems, which involved several types of evolutionary processes. One of the major factors was the collusion of the archaeal and bacterial inheritances that gave rise to more complex forms of ancient PTGR systems. A case in point are nucleases, such as 3′→5′ exoRNases (e.g. Rrp6p) and RNase II (e.g. Rrp44p), which apparently were acquired by eukaryotes from bacteria, probably via the pro-mitochondrial endosymbiont, and added to the exosome whose core was inherited from the archaeo-eukaryotic ancestor. The large-scale, intra-familial duplication, e.g. among helicases such as Mtr4p and Ski2p (Fig. ), was the second major evolutionary phenomenon that contributed to the elaboration of the eukaryotic exosome complex. The third trend in the ontology of these complexes was the recruitment of pan-eukaryotic, superstructure-forming domains, such as WD40 and TPR, which probably provided scaffolding for the enlarged eukaryotic complexes.
The eukaryote-specific mRNA degradation system, which destroys both nonsense codon-containing (nonsense-mediated decay or NMD) and normal mRNAs, appears to have been assembled, in part, from various translation-related components. Among these components, NMD3p appears to have emerged in the archaeo-eukaryotic lineage and functions in ribosomal assembly (
215,
216). The other components of this system are eukaryote-specific innovations that mimic the set of similar components that have been added to the exosome. NMD2p (
217) contains an NIC domain and shares a common ancestor with the translation factor eIF4G. NMD4p and its metazoan equivalents, such as SMG6 (
171,
218), contain PIN domains and might ultimately have descended from the stand-alone PIN-domain proteins detected in archaea. NMD5p is a HEAT repeat protein and UPF1p is a SFII RNA helicase (
217). The poly(A)-degrading complex also appears to have emerged prior to the divergence of the major eukaryotic lineages and contains at least three conserved nuclease components, namely Pan1p, Pop2p and DAN-like nucleases, which belong to the 3′→5′ exonuclease family, and CCR4, which is a derivative of the DNase I superfamily (
45,
219,
220).
The eukaryote-specific PTGS system is present throughout the crown group, at least. Recent experimental results combined with computational predictions based on phyletic patterns resulted in the identification of a complex PTGS apparatus that can be traced back to the common ancestor of the eukaryotic crown group. The core of this system includes a SFII helicase–RNaseIII fusion protein of the carpel factory (CAF, also called DICER) family, which generates small, 21–25 nt RNAs [small interfering RNAs (siRNAs)] used as guides to promote degradation of specific RNAs by a nuclease complex (
133,
221–
223). Additionally, the DICER helicase–nuclease appears to be involved in the processing of numerous other small regulatory RNAs, including the stRNAs, such as Lin-4 and Let-7, which regulate specific transcripts through antisense interactions (
224). A LIN-28-like RNA-binding protein containing an S1-like domain and homologous to bacterial Csp (
225), which binds these small RNAs, probably is another ancestral component of the PTGS system. The siRNAs function as primers in an amplificatory degradative PCR-like reaction that generates dsRNA and is catalyzed by a specialized RNA-dependent RNA polymerase that is thus far traceable to the base of the eukaryotic crown group (
153–
156). Proteins of the PIWI-argonaute family, which combine PIWI and PAZ domains (
14), also probably participated in the ancestral PTGS as siRNA-binding components (
226). The actual RNA destruction apparently depends on several other components, including a RecQ-like helicase (
127) and RNase D family 3′→5′ nucleases, such as Mut-7 and Egl (
227). From the time of its emergence, the PTGS system probably closely interacted with the more generic RNA degradation systems, including the exosome, NMD and the poly(A)-tail degradation system.
A substantial part of the PTGS system, including the progenitor of most of the 3′→5′ exonucleases, RNase III, the RecQ-like helicase and the RNA-binding CSP proteins are part of the bacterial inheritance of the eukaryotes. The 3′→5′ exonucleases and RNase III, after their acquisition by eukaryotes, each underwent series of duplications to give rise to several distinct groups of orthologs and also formed new architectures through domain fusions. The Mut-7 proteins contain a module, C-terminal to the 3′→5′ exonuclease domain, which consists of a unique α/β domain fused to a Zn-ribbon, which might bind RNA (
45). This Mut-7C module appears as a stand-alone protein in archaea and bacteria and potentially might interact with a 3′→5′ nuclease already in prokaryotes, followed by the fusion in eukaryotes. The Argonaute-like proteins are represented in archaea and
Aquifex; one of the eukaryotic members of this family has been described as translation initiation factor eIF2C (
228). These ancient versions contain only a PIWI domain and their phyletic pattern is typical of translation machinery components, suggesting that their original function was related to translation. Prior to the divergence of the eukaryotic crown group, the PIWI domain combined with a predicted RBD, PAZ, which is also fused to the helicase and nuclease domains in the CAF family proteins (Fig. ). The PAZ domain, which might bind the small RNAs that are generated as part of PTGS, evolved in eukaryotes with the emergence of this system.
Within the crown group, PTGS shows considerable variability, with extensive gene loss completely or partially eliminating the system in various lineages. In yeast
S.cerevisiae, the entire system appears to have been lost (
133), whereas in
Drosophila and humans, the apparent loss is restricted to the RdRp and the Mut-7 nuclease. However, the detection of a functional PTGS system in
Drosophila (
229) suggests that the role of the RNA polymerase may have been taken over by other enzymes, such as the DNA-dependent RNA polymerase or a reverse transcriptase-like enzyme, which are known to possess similar activities
in vitro. In contrast, plants and
Dictyostelium show expansions of the RdRp family, with at least six and four distinct members, respectively. Furthermore, the architectures of the proteins involved in PTGS show lineage-specific variability, e.g. fusion of RRM domains to the RdRp in plants and a duplication of the RdRp within a single protein in
C.elegans. Several eukaryotic proteins were identified that, on the basis of their domain architectures, seem to be likely candidates for participation in PTGS. Examples include a nuclease of the RNase II family that is fused to a Sen1p-like SFI helicase in humans and a family of plant 3′→5′ exonucleases fused to the RRM domain (
45). Analysis of phyletic patterns and domain architectures also resulted in the identification of several novel candidates, which could be parts of a more extended PTGS network (
133) (Fig. ). The most notable of these include an orthologous group of predicted adenine methylases (the CG14906 group) related to the Kar4-Ime-4 family of mRNA methylases (Fig. ). Another group of predicted RNA methylases with a similar phyletic pattern are the Corymbosa2/Hen1 family of methylases that are predicted to be dsRNA methylases (see above). These enzymes could specifically regulate the stability of dsRNA regions formed by pairing of mRNAs with anti-sense RNAs (Figs and ). Homologs of the DNA repair protein AlkB fused to the RRM domain might be involved in RNA modification (Fig. ). It has been predicted that this subfamily of AlkB proteins, similarly to their homologs involved in DNA repair, possess iron- and 2-oxoglutarate-dependent oxidative demethylating activity (
157). Consistent with this prediction, these AlkB homologs, in addition to the RRM domain fusion, also show fusions to a distinct family of methylases. Taken together with the widespread distribution of these enzymes in the crown group, with the exception of
S.cerevisiae [a phyletic pattern typical of other PTGS components (
133)], these observations suggest that a mRNA methylation–demethylation circuit might be another component of PTGS.
Finally, numerous other uncharacterized eukaryotic RNA-binding proteins were predicted, which could point to still unknown PTGR systems and complexes. For example, Ro protein, which shares the RNA-binding ROT domain with telomerase subunits (
230), binds small RNAs called Y RNAs in animals and the resulting RNPs might be involved in several poorly characterized regulatory functions, such as RNA quality control (
231). Ro protein homologs are also present in certain bacteria, such as
Deinococcus and
Streptomyces, probably as a result of horizontal gene transfer from eukaryotes and it has been shown that, in
Deinococcus, the Ro homolog binds several small RNAs and belongs to a PTGR system that regulates radiation resistance (
232).
RNA processing and splicing. In both eukaryotes and prokaryotes, rRNAs and tRNAs are released from larger precursors through RNA processing events mediated by the same nucleases that are involved in RNA degradation, such as RNase PH and RNase P. As discussed above, the presence of distinct nuclease families in the archaeo-eukaryotic and bacterial lineages suggests that many of these processing systems evolved only after the separation of these primary lineages, with the eukaryotes processing machinery combining the archaeal and bacterial inheritances. Archaea-eukaryotes evolved a specific system of tRNA processing, which removes an intron present in the middle of the tRNA precursor (
233). The tRNA splicing endonuclease is a distinct member of the restriction endonuclease fold (
234), which might have been derived from an ancient, restriction enzyme-like genomic parasite. This is consistent with the mobile parasitic behavior of several members of the restriction endonuclease superfamily (
235,
236). In eukaryotes, this enzyme underwent a tetraplication followed by inactivation of two of the copies and resulting in a heterotetrameric functional complex (
21,
45). The U3 RNP complex is involved in rRNA processing, which involves chiefly rRNA modifications guided by the associated small RNAs (
237). This complex consists of, at least, Imp4p, Prp31p and the methylase fibrillarin and evolved in the common ancestor of archaea and eukaryotes; archaeal genome comparisons suggest that it might functionally interact with the exosome (
212). In eukaryotes, some of the components of this complex, e.g. PRP31p (
238), appear to have been additionally recruited for pre-mRNA splicing.
The most distinctive RNA-processing pathway is mRNA splicing, which, in its entirety, is seen only in eukaryotes (Fig. ). Eukaryotic spliceosomal mRNA introns share with Type II self-splicing introns the intermediate step of lariat formation. This observation prompted the hypothesis that Type II introns, which existed as parasitic retroelements in the genomes of the organellar precursors, invaded the eukaryotic nucleus, giving rise to the spliceosomal introns (
239–
241). The analysis of the spliceosomal components that we present here suggests that a version of this hypothesis is plausible and argues against the competing ‘introns early’ hypothesis, which postulates extensive presence of introns in LUCA (
242–
244).
The eukaryotic splicing apparatus consists of five principal snRNP particles, U1, U2, U4, U5 and U6 (
245–
248), which contain their namesake small RNAs (Fig. ). Many specialized spliceosomal particles, especially in multicellular eukaryotes, contain alternative counterparts of these main U RNAs and are dedicated to the processing of special (non-canonical) splice junctions (
249). The components that are common to all five spliceosomal U RNP particles can be traced back to the common ancestor of the eukaryotic crown group, suggesting that the core of the spliceosomal machinery was firmly established by the time the crown-group eukaryotes radiated. Examination of the inferred domain composition of the ancestral spliceosomal machinery shows marked enrichment of several conserved domains (Fig. ). These include SFII helicases and RBDs, namely RRM, SM, Zn-knuckle, CCCH, G-patch, SWAP and PWI. Thus, the spliceosomal particles are largely made up of paralogous forms of a relatively small set of domains. It appears that the ancestral spliceosome was assembled mainly from eukaryote-specific domains and its elaboration resulting in the origin of the five principal spliceosomal particles had occurred largely through the proliferation and shuffling of just these few domains that, in the early spliceosome, were represented by their common ancestors. Common to all these U snRNPs are small, stand-alone SM proteins, which belong to a class of RNA-binding SH3-fold β-barrel domains (see above); this RBD probably bound small RNAs already at a pre-LUCA stage of evolution.
The expansion of the SM family from a single ancestral form found in archaea to the numerous lineages seen in eukaryotes suggests that the SM protein formed the ancestral core of the splicing complex by acting as a protein cofactor for the self-splicing Type II introns that invaded eukaryotes. This could have increased the efficiency of splicing of the Type II introns and diminished their deleterious effects, thereby contributing to their spread. At this point, proteins containing some of the newly emerged eukaryote-specific domains, such as RRM, Zn-knuckle and CCCH, and RNA helicases of the eIF4A-DEAD and Maleless families, might have been added to the set of protein cofactors of the Type II introns. Additionally, some proteins that were initially associated with exosomal function, such as helicases of the Ski2p-Lhr family, also might have been recruited to the emerging spliceosome. The next stage of evolution probably involved partial degeneration of the introns themselves and the emergence of distinct intron fragments as precursors of the U RNAs, which possess ribozyme activity and appear to be the primary catalysts of splicing (
250). Simultaneous evolution of eukaryotic chromatin allowed the major increase in genome size in eukaryotes and thus provided the niche for selectively neutral or advantageous (although the nature of these potential advantages is not clear) expansion of the introns throughout eukaryotic evolution. This expansion probably was accompanied by a feedback loop that selected for the proliferation and diversification of the original protein cofactors recruited for splicing, causing an explosive expansion of RRM, SFII helicase and other eukaryote-specific domains involved in splicing.
Genome sequences of early-branching eukaryotes might provide the details of the actual temporal order of the duplications in the evolution of the splicing system, but some inferences on the relative branching pattern already can be drawn from the currently available eukaryotic genome sequences. At least 70–80 orthologous lineages of proteins containing one or more of the common RBDs mentioned above, 15 or more lineages of SFII helicases (
249), and several other single-copy proteins with no mobile domains, such as PRP38 or Snu66, are traceable to the ancestor of the crown group. Among the RRM-domain proteins, the most common architectures include the single- and multi-RRM proteins, followed by fusions to the G-patch, CCCH and Zn-knuckle domains (Fig. ). From this ancestral state that existed prior to the radiation of the crown group, several lineage-specific developments ensued, which correlate with the origin of alternative splicing in multicellular eukaryotes. The common ancestor of animals apparently had approximately 40 orthologous groups of splicing-related proteins that evolved after the divergence of the major crown group lineages (Figs and ). However, the most striking development is seen in vertebrates, which have at least 30 distinct RRM-domain proteins with no orthologs in arthropods or nematodes and several vertebrate-specific expansions within other ancient ortholog groups of RRM proteins. This diversity of RRM proteins correlates with and is probably functionally linked to the extensive utilization of alternative splicing as a means of generating protein diversity (
251,
252). A similar situation seems to exist in plants because over 50 plant-specific RRM proteins were detected in
Arabidopsis; however, the exact point of origin of this diversity is currently unclear, given the absence of other plant genomes. In contrast, in yeast, the U2 and, to a lesser extent, U5 snRNPs show extensive degeneration, which correlates with the near-complete elimination of spliceosomal introns (
133,
253).
Links between molecular chaperones, protein degradation, the ubiquitin system and RNA metabolism. Several deep evolutionary links seem to exist between RNA metabolism, protein degradation and ubiquitin signaling pathways, suggesting that these cellular systems have a long history of interactions. The earliest of these links appears to be the potential functional coupling of the RNA-degrading exosome, the protein-degrading proteasome and co-translational protein folding facilitated by prefoldins, as indicated by the juxtaposition of the corresponding genes within a superoperon, which is conserved in most archaeal genomes (
212). Such functional coupling can be rationalized in terms of coupled pre- and post-translational regulation of the protein level through mRNA and protein stability, respectively. This type of interaction appears to have extended into eukaryotes as suggested, in particular, by the presence of the shared Sec63 domain in chaperones involved in endoplasmic protein translocation and degradation and the exosome/splicing-related helicase Brr2p (
118), and by the presence of the Little Finger domain in the animal versions of the Npl4p (suppressor of Sec63p) protein (Fig. ). A pan-eukaryotic SPBC17G9.05-like protein containing a cyclophilin-like PPIase fused to a RRM domain (with an additional Zn-knuckle in plants) might be another component of such a system, through coupling protein unfolding to RNA metabolism. Furthermore, animals possess another distinct cyclophilin–RRM fusion (Fig. ) that might also perform a similar function.
Another ancient link between RNA metabolism and protein degradation is suggested by the domain architecture of the prokaryotic protease HypF, which is involved in hydrogenase maturation and assembly. The HypF protein consists of a dsRNA-binding Sua5 domain (
254), an OSGP metallo-protease domain of the Hsp70 fold (
38) and an acyl phosphatase domain; this domain architecture is suggestive of complex regulation of specific protein processing events through interaction with RNA (Fig. ).
In eukaryotes, the elaborate ubiquitin signaling system has a central role in targeting proteins for degradation (
255,
256). Ubiquitin also acts as a signaling moiety to direct specific protein–protein interactions. A number of domain architectures seen among eukaryotic proteins involved in RNA metabolism suggest close interactions with the ubiquitin system. These include numerous fusions of RING-finger domains, which function as ubiquitin E3 ligases, with RBDs, such as KH, Little Finger and CCCH; these domain combinations are present in MDM2, Makorin and several other proteins (Fig. ). These proteins might function as E3 ligases that specifically tag certain splicing or other RNP complexes with ubiquitin or ubiquitin-like molecules and thereby target them for degradation or regulate their assembly. Furthermore, fusions of other domains involved in ubiquitin signaling, such as ubiquitin itself, UBA, F-box and CUE with various domains involved in RNA metabolism are also seen in several proteins, such as PRP21, TAF
II250 and TAB2 (Fig. ). These architectures are again suggestive of a role in bringing the ubiqutination machinery to the RNA-binding complexes, in which these proteins reside.
A protease of the ubiquitin C-terminal hydrolase family, Sad1p, is required for the assembly of the U4/U6.U5 tri-snRNP and might act a protease in processing of some of the subunits of this complex (
257,
258). Eukaryotes have re-used inactive versions of a predicted ancient hydrolytic enzyme, the JAB domain, which is also found in several components of the proteasome/signalosome, in two distinct RNA-associated complexes (
259,
260). Specifically, the JAB-domain protein PRP8 is a subunit of the U5 and U6 snRNP complexes, and the translation-related eIF3 complex also contains several JAB-domain subunits. In the context of potential links between RNA and protein degradation, among the most interesting architectures are the fusions of the Little Finger domain with three distinct protease domains (Fig. ), namely an inactive version of the Otu-A20 family protease (
261) in TRABID, a calpain protease in
Small optic lobes (
262), and a metalloprotease domain of the WSS1p family in
Arabidopsis and
Trypanosoma F14J16.17. This, taken together with the fusion of the Little Finger with E3 ubiquitin ligases in certain proteins, such as MDM2 (
263), suggests that this domain might provide a specific link between RNA metabolism and protein degradation. The exact nature of this connection is unclear, but it seems plausible that still uncharacterized, small RNAs regulate the function of the protein degradation complexes. Alternatively or additionally, the Little Finger might function as a tether to target proteolytic machinery to proteins associated with specific cellular RNAs. Most of these architectures are restricted to a few eukaryotic lineages, suggesting the existence of numerous lineage-specific mechanisms for modulation of RNA metabolism.
The significance of the phyletic patterns of conserved proteins in RNA metabolism systems for inferring evolutionary relationships between major taxa. Examination of the phyletic patterns of the conserved proteins in the RNA metabolism systems potentially could help in testing phylogenetic hypotheses regarding the relationships between major lineages. At the deepest level, the presence of a distinct archaeo-eukaryotic lineage is supported by approximately 60 conserved orthologous groups that are shared exclusively by archaea and eukaryotes. This is contrasted by a mere 20 or so orthologous groups common exclusively to archaea and bacteria and approximately 39 bacterial–eukaryotic groups. This pattern is consistent with the domain distribution data and supports a model whereby eukaryotes are a chimeric lineage, which combines archaeal and bacterial inheritances. This massive chimerism in the eukaryotic inheritance most likely reflects the endosymbiotic interaction between the pro-mitochondrial α-proteobacterium and an archaeon. The evidence for the presence of an ancestral mitochondrion in all, including the earliest branching eukaryotes (
264–
267), and the extensive bacterial contribution that can be seen in the available genomic data from early-branching eukaryotes (
268–
270) supports this model. There has been a smaller, but noticeable gene flow between the two prokaryotic superkingdoms, apparently driven by the regular process of horizontal gene transfer rather than large-scale chimerism; however, in some cases, such as the bacterial hyperthermophiles, this gene transfer probably made a much greater contribution (
271,
272).
Within the eukaryotes, the observed phyletic distribution of domains and proteins involved in RNA metabolism seems to conflict with two well known phylogenetic hypotheses. The number of orthologous groups of proteins shared exclusively by animals and plants is approximately 41, in contrast to just 15 that are exclusively shared by fungi and animals. At face value, this contradicts the currently accepted phylogeny, in which fungi and animals are sister groups (
273). A possible explanation for this pattern, however, is a massive loss of ancestral genes in the currently available fungal genomes, those of two yeasts. Comparative genomics indeed provides support for large-scale gene loss in the yeasts (
133,
274). However, in some cases, such as the capping enzyme, TAFii250, eIF4G and Whi3p, the yeast versions have domain architectures distinct from those that are shared by their orthologs in animals and plants. Thus, the topology of the primary branches within the eukaryotic crown group probably should be considered unresolved, emphasizing the need for further investigation from the comparative genomics angle, in addition to individual phylogenies of multiple proteins.
The second piece of evidence that contradicts a popular phylogenetic hypothesis is the presence of 24 exclusive orthologous groups shared by arthropods and vertebrates as opposed to only three that are shared by arthropods and nematodes. A similar phyletic pattern has been reported in the case of orthologous groups shared by nematodes and vertebrates in other functional systems, such as chromatin structure and organization and the apoptosis apparatus (
275,
276). These observations are not consistent with the existence of a nematode–arthropod clade, which is favored by the ecdysozoan model of eukaryotic evolution (
277). Although some gene loss in
C.elegans is a possibility, the minimal animal proteomes are of approximately the same size (once lineage-specific family expansions are factored in), and therefore it appears less likely that the specific link between vertebrates and arthropods can be attributed to massive gene loss in nematodes. This suggests that the traditional model of a coelomate clade (
278), as opposed to an ecdysozoan clade (
277), could be a more accurate representation of animal phylogeny.