|Home | About | Journals | Submit | Contact Us | Français|
The E1-like superfamily is central to ubiquitin (Ub) conjugation, biosynthesis of cysteine, thiamine and MoCo and several secondary metabolites. Yet, its functional diversity and evolutionary history is not well-understood. We develop a natural classification of this superfamily and use it to decipher the major adaptive trends occurring in the evolution of the E1-like superfamily. Within the Rossmann fold, E1-like proteins are closest to NAD(P)/FAD-dependent dehydrogenases and S-AdoMet-dependent methyltransferases. Hence, their phosphotransfer activity is an independent catalytic “invention” with respect to such activities seen in other Rossmannoid folds. Sequence and structure analysis reveals a striking diversity of residues and structures involved in adenylation, sulfotransfer and substrate-binding between different E1-like families, allowing us to predict previously uncharacterized functional adaptations. E1-like proteins are fused to several previously undetected domains, such as a predicted sulfur transfer domain containing a novel superfamily of the TATA-binding protein fold, different types of catalytic domains, a novel winged helix-turn-helix domain and potential adaptor domains related to Ub conjugation. Based on these fusions we develop a generalized model for the linking of E1 catalyzed adenylation/thiolation with further down-stream reactions. This is likely to involve a dynamic interplay between the E1 active sites and diverse fused C-terminal domains. We also predict participation of E1-like domains in previously uncharacterized bacterial secondary metabolism pathways, new cysteine biosynthesis systems, such as those associated with archaeal O-phosphoseryl tRNA, metal-sulfur cluster assembly (e.g. in nitrogen fixation) and Ub-conjugation. Evolutionary reconstructions suggest that the last universal common ancestor (LUCA) contained a single E1-like domain possessing both phosphotransfer and thiolating activities and participating in multiple sulfotransfer reactions. The E1-like superfamily subsequently expanded to include 26 families clustering into three major radiations. These are broadly involved in ubiquitin activation, cofactor and cysteine biosynthesis, and biosynthesis of secondary metabolites. In light of this we present evidence that in eukaryotes other E1-like enzymes, such as Urm1, were independently recruited for Ubl conjugation, probably functioning without conventional E2-like enzymes.
Modification of eukaryotic proteins by ubiquitin (Ub) and ubiquitin-like proteins (Ubls) is a three step cascade catalyzed by the E1, E2 and E3 enzymes 1,2. The E1 enzyme initiates the process via adenylation of the carboxy-terminal glycine residue of the Ub/Ubl polypeptide. Remarkably, a similar reaction occurs in the biosynthesis of thiamine and molybdenum or tungsten cofactors (MoCo/WCo), where a homolog of the E1 enzyme, ThiF or MoeB, adenylates the C-terminus of a ubiquitin-like protein, ThiS or MoaD 3–8. Upon modification, the trajectories of the ubiquitin-like proteins are very different in the conjugation and cofactor biosynthesis pathways. In the ubiquitin modification system, Ub/Ubl is covalently conjugated to a lysine on the protein substrate via trans-thiolation reactions between E1, E2 and in some cases E3 enzymes (i.e. E3s with HECT domains). In contrast, in cofactor biosynthesis pathways, the adenylated C-terminus of the ThiS or MoaD protein is further modified, using a sulfur donor, to a thiocarboxylate, which then serves as a sulfur donor during the biosynthesis of cofactors. E1-like enzymes are also present in other sulfur incorporation steps involved in biosynthesis of certain siderophores (e.g. quinolobactin), peptide antibiotics, small molecule first messengers and cysteine in prokaryotes 9–15.
Recently, there have been several advances in the deciphering of the structure and mechanisms of the E1-like superfamily (hereinafter referring to E1, MoeB, ThiF and all other homologous proteins that are closer to them than to any other superfamily of enzymes) 4,5,7,8,16,17. Site-directed mutagenesis and X-ray crystallography of different members of the E1-like superfamily has supported a comparable role for various conserved residues, albeit pointing to subtle differences in the catalyzed reactions 17,18. Despite their diversity, active E1-like enzymes share overarching biochemical themes related to adenylation and sulfotransfer. This aspect roused our interest in exploring the natural history of the E1-superfamily of enzymes, both in the context of the Ub-conjugation systems and sulfur metabolism at large. In particular, we wanted to address the following issues: 1) Relationships of the E1-like superfamily to other superfamilies within the Rossmann fold and understanding the key modifications that resulted in the evolution of its extant biochemical activities. 2) Conserved sequence and structural features common to all members of the fold and assessing how variations to this core set of features affect functional properties such as substrate choice and reaction mechanism. 3) The complete set of contextual associations of the E1-like domain, such as domain architectures and conserved gene-neighborhoods. 4) Implications of these contextual associations for the interplay between the adenylation and thiolating activities of E1-like domains and interactions with proteins catalyzing preceding or subsequent reactions. 5) Determining the evolutionary radiations of the E1-like superfamily, establishing its major adaptive trends and inferring any novel biochemical roles they might have acquired.
To address the above issues we performed a comprehensive analysis to systematically identify members of the E1-like superfamily. We first transitively searched the PDB database with available 3D structures of the E1 superfamily by using the FSSP program to detect structurally related modules. We then aligned the modules recovered in these searches with the MUSTANG program to obtain a structural alignment of the E1 superfamily with their related structures. This alignment enabled us to objectively identify the features distinguishing the E1 modules from other related Rossmannoid domains (see below). We used several representative sequences of E1 superfamily proteins as seeds to initiate sequence profile searches against the NCBI NR (non-redundant) database with the PSI-BLAST program (see Materials and Methods for details on searches). Sequences detected in these searches were used to prepare a multiple alignment and a Hidden Markov model (HMM) derived from it was used to further search genomic databases using the HMMer package. As a result we exhaustively recovered E1 proteins from the NR database and classified them by means of phylogenetic tree analysis, uniquely shared sequence motifs and structural features (see Materials and Methods for details). Consequently we delineated 26 distinct families of E1-like domains. We also identified their key structural features, established their phyletic patterns, identified conserved gene-neighborhoods (predicted operons) and domain architectures and collated their biochemical functions where available. The reconstructed evolutionary history and natural classification of the E1-like superfamily based on this information is summarized in Figure 1 and Table 1 (For further details refer supplementary material).
The E1-like domain adopts a 3-layered α/β sandwich fold with a central, eight-stranded β-sheet, with strand order 87654123 (hereafter referred to as S1–S8) (Fig. 2) and helices packing against either face of the sheet. The core of the domain is a Rossmann fold comprised of the β-α units defined by strands S1 to S5 and corresponding helices labeled H1–H4. E1-like domains possess at least three other well-studied structural elements and several subtler sequence features distinguishing them from other superfamilies containing a Rossmann fold. The first of these structural elements is a dyad of helices N-terminal to the Rossmann fold. This unit often contains a conserved arginine that projects into the active site of the opposite monomer of the homo- or hetero- dimer and appears to stabilize the hyper-charged pentavalent phosphorus during the phosphotransfer reaction 7. Thus, it is equivalent to the arginine finger seen in the P-loop NTPases; hence, we refer to this feature as the “arginine finger” hereinafter (Table 1) 4,7,19,20. The second distinctive feature is an extended loop between S2 and H2 containing several polar residues (usually D, N, R and K) strongly conserved across most E1 families (Table 1). Along with the arginine finger from the dimerizing partner, these residues are necessary for catalyzing the adenylation reaction 7,17. The third feature unique to E1-like domains is the extension to the core Rossmann fold, which includes strands S6–S8 with S6 and S8 being anti-parallel to the other strands (Figure 2). This unique extension contains a characteristic linker between strands S6 and S7 which has been termed the “crossover loop” 4 (Figure 2). This structure has one or more helical elements and harbors a strongly conserved cysteine which is required for thiolation reactions catalyzed by these enzymes (henceforth thiolating cysteine). Catalytically active E1-like domains also possess a conserved aspartate in S4 which is involved in coordinating Mg2+ and probably orienting Mg2+-ATP, analogous to the aspartate from the Walker B motif of the P-loop NTPase fold 21. Further, a highly conserved arginine after H4 makes polar contacts with the polypeptide backbone of the Ub/Ubl substrate, perhaps directing the Ub/Ubl tail to the adenylating active site.
A poorly understood feature of several families of E1-like domains is a pair of CxxC motifs that coordinate a zinc ion. One of the CxxC motifs is present in the “crossover loop” and the other in the poorly-structured coil region following S8 (Table 1, Supplementary material). Crystal structures 4,5,8 suggest that the chelated Zn2+ holds the portion of the “crossover loop” downstream of S6 away from the core sheet, thereby forming an arch to allow the C-terminal tail of the substrate (Ub/Ubl) to access the adenylating active site. However, the CxxC motifs are independently and sporadically lost in many families (Table 1). In some cases its loss correlates with absence of catalytic activity (e.g. inactive eukaryotic E1 families; Fig. 1) or absence of a peptide substrate (e.g. FeeI, see below). But in other cases there is no evidence for such a correlation, suggesting that alternative interactions might have taken the role of the structural zinc to stabilize the “crossover loop” region. Even less understood is a strongly conserved ExxK motif in H5 (Supplementary Material). We observed that in available crystal structures glutamate and lysine residues of the ExxK motif form a salt-bridge and contact the tip of the S7–S8 hairpin from the opposite subunit in the E1-like domain dimers. Hence, these residues might play a role in stabilizing and orienting the interface of subunits in the dimer.
The wide conservation of the above structural features across the entire E1-like superfamily suggests that they represent the ancestral condition for this domain. Our sequence and structure comparisons indicated that among the Rossmannoid superfamilies, E1-like domains are closest to NAD(P)/FAD-dependent dehydrogenases and S-AdoMet-dependent methyltransferases (see SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/). These proteins show a congruent structural core spanning the β-α units determined by strands 1 to 5 and display a glycine rich loop between strands S1 and H1 which is involved in nucleotide binding (ATP in the E1-like superfamily). They also share a frequently conserved aspartate residue at the end of strand 2, and a characteristic aspartate or asparagine residue at the end of S4 (Figure 2). Thus, these three superfamilies are distinct from another monophyletic assemblage of Mg2+ chelating Rossmann-fold domains that unites several superfamilies with phosphoesterase activity, namely haloacid dehalogenases (HAD), receiver domains, DHH phosphoesterases, TOPRIM domains and PIN/5′-3′ nucleases. This latter assemblage is characterized by two acidic residues in their active site 19 (Figure 2). The E1-like superfamily, NAD(P)/FAD-dependent dehydrogenases and S-AdoMet-dependent methyltransferases are also distinct from the HUP assemblage, which unites the Rossmannoid catalytic domains of Class-I aminoacyl tRNA synthetases and related nucleotidyl transferases, PP-ATPases, the USPA superfamily, and photolyases 22,23 (Figure 2). The above-described three features specific to the E1-like domains appear to have been central to their acquisition of phosphotransfer/adenylation and thiolation activity (see below, Figure 2, and Table 1). Thus, phosphotransfer activity of the E1-like domain represents an “invention” that occurred independently in an ancestral version of the Rossmann fold resembling the nucleotide-binding version in NAD(P)/FAD-dependent dehydrogenases and S-AdoMet-dependent methyltransferases.
A case-by-case consideration showed that several E1-like families have developed unique family-specific features, including modifications of catalytic or substrate-binding sites (Table 1). In contrast to the major prokaryotic versions (ThiF and MoeB) which function as homodimers, eukaryotic E1-like domains often function as heterodimers 24. This appears to have emerged concomitant with a certain “division of labor” between the two subunits of the dimer. Members of the eukaryotic E1 families UBA1-N, AOS1/SAE1 and APPBP1, are by themselves catalytically inactive but supply the arginine finger to the active site. Conversely, the UBA1-C, SAE2/UBA2 and UBA3 families lack an arginine finger, but constitute rest of the active site of the dimer. The resulting asymmetry in the location of the active site with respect to the dimer interface appears to be critical for positioning the E2 polypeptide (via binding to UFD; see below) during trans-thiolation.
The eukaryotic Apg7/Atg7 family, and the prokaryotic families 6A, 6B, 6D and 6E, HesA, MJ0693-like, and a group of related bacterial families Rv3196, GodD, MccB and PaaA show different variations, each affecting a subset of the residues and structural features influencing phosphotransfer (Table 1 and Fig. 1). Of these a cluster of related families (MJ0693-like, Rv3196, GodD, MccB and PaaA), the prokaryotic group of 6A, 6D and 6E families and the eukaryotic Apg7/Atg7 family lack the N-terminal arginine finger. In the latter case adenylation and trans-thiolation of Apg12 (an Ubl) is experimentally supported 25, suggesting that these catalyze typical E1 reactions despite lacking the conventional arginine finger. Consistent with this, we detected structurally plausible candidates for alternative arginine fingers elsewhere in the same polypeptide in the Apg7/Atg7 family or from other potentially interacting proteins (in the bacterial 6A), which might substitute for the canonical arginine finger (See Table 1 for details).
Emergence of distinct arginine fingers has been previously observed in P-loop NTPases, where arginine fingers have independently evolved on multiple occasions, and are either provided from within the protein or from distinct polypeptides interacting with the NTPases 26,27. The bacterial HesA family lacks the Mg2+-coordinating aspartate, and along with the archaeal MJ0693 family displays substitutions in some of the conserved residues in the loop between S2 and H2. However, the HesA family maintains the arginine finger suggesting that it might still possess catalytic activity, perhaps with lower efficiency or function as a heterodimer.
A spectrum of further structural alterations is seen in the Rv3196 and GodD families (Table 1). In several members of both these families the entire loop between S2 and H2, along with H2 and S3 has been independently lost. Further, in the GodD family, the glycine-rich loop between S1 and H1 and in some cases the entire N-terminal region including S3 has been lost. Despite these alterations, most members of these families retain the Mg2+-coordinating aspartate and the remaining C-terminal portion of the E1-like domain (Table 1, supplementary material). This feature along with certain unique aspects of their domain architecture suggests that, despite their dramatic remodeling, this cluster of related families are likely to perform a catalytic function in conjunction with other fused domains or through dimerization with conventional E1-like domains (see below).
We also detected great diversity in regions that are known or predicted to participate in peptide substrate interaction and dimerization. This includes the region in the vicinity of the thiolating cysteine, between S6 and H5 (the “crossover loop” region), containing the pocket that interacts with Ub/Ubl and E2 in Ub/Ubl adenylating families 28,29. Diverse inserts with different predicted secondary structures are observed in this region in the MccB, Apg7/Atg7, prokaryotic 6B, 6C and 6D and eukaryotic E1 families (Table 1). These inserts are organized around and atop the active sites of E1-like domains, reminiscent of the cap domains inserted into the core Rossmannoid fold of the HAD superfamily. In the HAD superfamily they have been shown to influence substrate recognition, access to the active site and catalytic efficiency 19. By analogy, it is possible that these inserts in the E1-like superfamily correlate with distinct substrate specificities of the respective families. Another region that shows great sequence diversity is the β-hairpin formed by strands S7 and S8 (see supplementary material). Given the role of this hairpin in dimerization (see above), this diversity might correlate with differences in the dimer interface of different families. Crystal structures of eukaryotic E1 proteins show that this region also contacts the exposed face of the Ub/Ubl proteins 30. Hence, this region has also possibly diversified to recognize cognate Ubl substrates 30.
Though the thiolating cysteine is strongly conserved in most catalytically active E1-like families, which transfer Ub/Ubl or thiolate ThiS/MoaD-like substrates, it appears to have been lost in multiple families (Table 1, Fig. 1 and supplementary material). Known or predicted Ub/Ubl-interacting E1-like families lacking the thiolating cysteine (Table 1) are the bacterial 6E family, a small prokaryotic E1-like family with an N-terminal fusion to a ThiS-like Ubl domain and the E1-like family that is functionally associated with tungsten-cofactor-utilizing aldehyde-ferredoxin oxidoreductases. Additionally, several archaeal E1-like proteins lack the thiolating cysteine. These are encoded in conserved gene neighborhoods along with genes for molybdopterin, thiamine and cysteine biosynthesis enzymes (Supplementary material). Site-directed mutagenesis studies on Escherichia coli MoeB showed that MoaD can be thiolated despite disruption of the thiolating cysteine, if other sulfur abstracting functional partners such as IscS or cysteine sulfinate desulfinase are available 6,31. Thus, in some of the above instances such extrinsic partners might provide an alternative to the thiolating cysteine. Two related uncharacterized families, namely the bacterial YgdL-like family and the eukaryotic YKL027W-like family possess an intact adenylating active site but have a divergent C-terminus lacking the structural Zn-chelating motifs and thiolating active site. Hence, they are also predicted to only catalyze the adenylation step. However, they might cooperate with other proteins in subsequent sulfotransfer reactions, possibly in conjunction with Ubls (see below).
The remaining families lacking the thiolating cysteine show no evidence for interaction with Ub/Ubl proteins and appear to be either purely adenylating enzymes or catalytically inactive (Fig. 1 and Table 1). Chief among these are the FeeI, MccB, GodD, Rv3196, and PaaA assemblage of bacterial families known or predicted to participate in biosynthesis of secondary metabolites, polypeptide antibiotics and small signaling peptides. The eponymous prototype of the FeeI family apparently adenylates a fatty acid in the formation of N-acyl tyrosine, a potential signal released by soil bacteria 32,33. Here, instead of the thiolating cysteine, the thiol group of a phosphopantetheinyl moiety attached to the acyl carrier protein FeeL forms a thiocarboxylate with the adenylated fatty acid. The MccB family is involved in biosynthesis of microcin C7-like peptides and appears to be the enzyme which adenylates the carboxy terminus of the microcin 14,15. GodD is involved in biosynthesis of goadsporin, an actinobacterial signaling peptide 13, and other members of this family might be involved in synthesis of other thiazole-, oxyazole- and lanthionine-containing peptides. A subset of proteins of the GodD family is predicted to be catalytically active and participate in adenylating steps in the synthesis of these modified peptides. Likewise, members of the uncharacterized PaaA family (which is closer to MccB) and a subset of the Rv3196 family are predicted to catalyze similar adenylation reactions in biosynthesis of peptide secondary metabolites.
Four forms of contextual information are valuable in uncovering functional linkages and predicting biochemical interactions of uncharacterized proteins: 1) conserved gene neighborhoods or predicted operons; 2) domain architectures; 3) phyletic distribution profiles; 4) information regarding interacting partners gleaned from large-scale protein interaction maps. Gene neighborhoods and phyletic profiles are particularly useful in prokaryotes in determining the biochemical pathways to which the E1-like superfamily has been recruited 34,35 (Figure 1, ,33 and Table 2).
In prokaryotes the primary E1-like enzyme i.e. ThiF/MoeB ortholog is usually embedded in a gene neighborhood encoding thiamin biosynthesis genes, and less frequently in one containing molybdopterin biosynthesis genes 36. Recent characterization of cysteine biosynthesis in actinobacteria showed that the E1-like enzyme MoeZ adenylates a ThiS/MoaD-like Ubl which is then thiolated and used as a sulfur donor for the reaction catalyzed by cysteine synthase. This results in formation of a cysteine at the C-terminus of the Ubl which is then released by a JAB domain peptidase 9. In addition to the previously observed conserved linkage of the genes coding the Ubl, JAB peptidase and cysteine synthase in actinobacteria36 we uncovered novel gene neighborhood linkages between the ThiF/MoeB ortholog and several genes related to cysteine synthesis in several distant bacterial lineages (Figure 3). The linked genes encode several enzymes related to cysteine and methionine biosynthesis (Figure 1, ,33 and Table 2; Supplementary Material). Interestingly, we noted that planctomycetes and several proteobacteria share a conserved gene neighborhood, which displays a PDZ-domain containing C-terminal-processing serine peptidase instead of the JAB peptidase. Thus, two unrelated types of peptidases might be utilized to release the newly synthesized C-terminal cysteine in different bacterial lineages. We also identified conserved gene neighborhoods in archaea linking ThiF/MoeB-like genes with those coding O-acetylserine/O-phosphoserine sulfhydrylase. However, the archaeal proteins consistently lacked a linked gene coding for a JAB peptidase, instead showing linkages with either the cysteinyl tRNA synthetase (e.g. Aeropyrum pernix) or O-phosphoseryl tRNA synthetase (e.g. Methanospirillum hungatei). This suggests that in both euryarchaea and crenarchaea E1-dependent biosynthesis of cysteine appears to be directly linked to tRNA aminoacylation. Linkage with the O-phosphoseryl tRNA synthetase suggests that in some methanogenic archaea (Methanocaldococcus and Methanospirillum) the E1-like enzyme-dependent mechanism might be active in in situ cysteine synthesis that occurs after charging of Sep-tRNA by the O-phosphoseryl tRNA synthetase 37.
Phyletic patterns suggest that in majority of prokaryotes a single E1-like protein is utilized in molybdopterin, thiamin and cysteine (if present) biosynthesis pathways. Thus, it appears that all these key adenylation and sulfotransfer reactions can be catalyzed by the same protein. However, on multiple occasions lineage-specific duplications have spawned dedicated paralogs functioning in particular pathways (Figure 1, ,3).3). These include E1-like enzymes fused to or associated via predicted operons with JAB domain peptidases, E2 homologs and diverse Ubls, which are involved in siderophore biosynthesis, metal-sulfur cluster biogenesis or constitute predicted prokaryotic Ub-conjugation-like systems 12,36. We observed that the YdgL family shows gene neighborhood linkages with proteins predicted to participate in sulfur transfer during biosynthesis of metal-sulfur clusters38–40 (Figure 1 and Table 2; Supplementary Material). Hence, we predict that in metal-sulfur cluster synthesis YgdL proteins are likely to provide an initial adenylation step, which is followed by thiolation and thio-transfer mediated by SufE and a cysteine sulfinate desulfinase-like enzyme (Table 2). The HesA family is unique to nitrogen-fixing heterocyst-forming cyanobacteria and vesicle-forming actinobacteria. HesA genes are embedded in neighborhoods encoding proteins involved in formation of metal-sulfur clusters in nitrogen fixation complexes (Figure 1 and Table 1 and Supplementary Material). Thus, the HesA family too is likely to adenylate substrates prior to sulfotransfer in the biosynthesis of these complexes 41,42.
The FeeI, MccB, PaaA, Rv319, and GodD families are usually found in bacteria with complex organization or development such as actinomycetes, cyanobacteria and endospore-forming firmicutes. These families never show linkages to genes encoding Ubls or JAB peptidases. Instead, they show gene-neighborhood associations, which are consistent with their role in adenylation steps in biosynthesis of diverse secondary metabolites. FeeI is often in the neighborhood of a gene coding the N-acyl amino acid synthase with which it cooperates in the synthesis of N-acyl tyrosine (Figure 1 and Table 2; Supplementary Material)32,33. In the MccB, PaaA, Rv319 and GodD families we discovered several distinct gene-neighborhood associations encoding multiple enzymes reflective of the wide array of additional modifications with which adenylation might combine in secondary metabolite biosynthesis (Table 2). Additionally, a subset of the GodD neighborhoods contain a pair of adjacent E1-like genes, one of which encodes a full length version, while the other codes the N-terminally truncated version lacking S1 to S3 (Figure 1; supplementary material). These are predicted to physically interact to generate a dimer with a single active site.
Experimental studies have suggested an important role for the interplay between different E1-like domains and the respective unrelated C-terminal domains in various reactions catalyzed by them. For example in MOCS3, the rhodanese domain fused to the C-terminus of the E1-like domain initiates sulfotransfer by forming a persulfide linked to its conserved cysteine. The Ubl (i.e. MoaD) adenylated by the E1-like domain is then attacked by this persulfide to form an acyl-disulfide linkage. This linkage is then reduced by the thiolating cysteine on of the E1-like domain of MOCS3 to release a thiolated MoaD 18,43,44. Likewise, many eukaryotic E1 families contain a C-terminal permuted ubiquitin-like domain, the UFD domain that recruits the E2 enzyme, delivering it to the trans-thiolating active site of the E1 enzyme 16. These observations suggested that interaction between C-terminal domains and the active sites on the E1-like domain might be a general theme required for linking successive reactions that follow the initial adenylation.
Across the three superkingdoms of life MoeB/ThiF-like proteins involved in cofactor and cysteine synthesis and their paralogs involved in siderophore biosynthesis are often fused to a rhodanese domain (Fig.1 and and3).3). However, E1-like proteins in low GC Gram-positive bacteria and sporadically in other bacterial and archaeal lineages lack a C-terminal rhodanese domain or even one encoded by a standalone neighboring gene. Interestingly, we found that many of these proteins had another C-terminal domain, which also occurs as a standalone protein in euryachaea (e.g. Pyrococcus furiosus PF0466). Using transitive sequence profile searches with the PSI-BLAST program and profile-profile comparisons with the HHpred program we established a statistically significant relationship (p-value=10−5 in profile-profile comparisons) between this domain and the TATA-box binding protein (TBP) domain 45. A multiple alignment of the domain revealed an absolutely conserved cysteine residue in the N-terminal strand (Fig. 4) and accordingly we term it the CCTBP (cysteine containing TBP-like) domain. Comparisons with the TBP structure suggest that the cysteine is present on the same face of the helix-grip fold of TBP that mediates contact with DNA 46, and is hence likely to be a surface residue available for persulfide bond formation. Thus, it is probable that in E1-like proteins that possess the CCTBP domain, it is functionally equivalent to the rhodanese domain. Consistent with this prediction, the CCTBP domain is found in contexts analogous to the rhodanese domain which is suggestive of a role in sulfur metabolism and metal-sulfur cluster assembly. For example, the CCTBP domain is fused to PP-loop ATPase domain in some archaea (e.g. MTH990 from Methanothermobacter). This is reminiscent of the fusion of the rhodanese domain and the PP-loop ATPase domain in the ThiI protein, where the two domains cooperate in successive adenylation and sulfur transfer steps during 4-thiouridine biosynthesis 47. The CCTBP domain is also fused to 4Fe-4S ferredoxins and DNA-binding helix-turn-helix domains in several archaea, and is found in the gene neighborhood of cysteine desulfurases and proteins involved in redox reactions (Fig. 4, Supplementary Material). These observations suggest that the CCTBP domain might also function in assembly of metal-sulfur clusters in ferredoxins as well as a redox sensor of single-component transcription factors.
The C-terminal UFD domain is limited to only three of the active families of eukaryotic E1s (Fig. 1). The Urm1p-activating enzyme UBA4 has a rhodanese domain just like its orthologs from other organism involved in cofactor biosynthesis (MOCS3/ThiF/MoeB). Extensive genetic screens and biochemical characterization to date have not yielded an E2 in urmylation 48. However, in MOCS3 the C-terminal rhodanese domain interacts with the thiolating active site in a manner comparable to the delivery of E2 by the C-terminal UFD of UBA3 18,43. Hence, we predict that the rhodanese domain functions like the E2 with its active cysteine behaving like the E2 catalytic cysteine during urmylation. However, other catalytically active families, namely UBA5 (Ufm activating enzyme) and Agp7/Atg7 (Apg8, Apg12 activating enzyme) which utilize E2s, also lack an UFD domain. We observed that the UBA5 family contains a conserved C-terminal region, which is predicted to form a distinct globular domain apparently unrelated to the UFD domain. Given the C-terminal location of this domain in the UBA5 proteins it is possible that it represents a functional analog of the UFD domain, which independently emerged in this family. In contrast, the Apg7/Atg7 family instead displays a large N-terminal domain which might help it recruit its functional partners, such as the two distinct E2 enzymes, Apg3 and Apg10 (Figure 1 and Table 2). The eukaryotic YKL027W family is closely related to the bacterial YgdL family (see above). YKL027W is fused to the recently reported TRS4-C domain and associates with Rpn6p, the PINT domain subunit of the proteasomal lid 49,50 (Figure 1 and Table 2). The TRS4-C domain is also fused to an Ubl domain in the TRS4 family of proteins 30. These associations suggest that YKL027W might be associated with Ub-conjugation despite the absence of the thiolating cysteine. The TRS4-C domain possibly plays a role analogous to the UFD domain in recruiting a downstream partner after the initial adenylation reaction catalyzed by the E1-like domain.
In the FeeI, MccB, GodD, PaaA and Rv3196 families we identified several domain architectures that predict a close linkage between the E1 catalyzed adenylation and other associated reactions in secondary metabolite biosynthesis (Figure 1 and Table 2; supplementary material). Of particular interest is the frequent fusion to the McbD domain in the GodD family. In the processing of microcin B17, another peptide with heterocyclic modifications, a McbD domain protein forms a complex with a flavin-dependent oxidoreductase (McbC) belonging to the same family as those fused to some FeeI proteins and encoded by predicted operons of the GodD family (Figure 1 and Table 2). These proteins are required for the formation of aromatic heterocyclic thiazole or oxazole rings from cysteine or serine respectively and their adjacent residue 51. Formation of both thiazole and oxazole rings involve a dehydrogenase and a dehydratase reaction 52. The flavin-dependent oxidoreductase McbC is likely to catalyze the former reaction. While McbD was earlier claimed to show similarity to GTPases 53, we found that neither sequence profile searches, nor the conservation pattern, nor alignment-based secondary structure predictions support this relationship. Instead the McbD domain was predicted to adopt an α/β fold that showed a completely different conservation pattern with a glycine-rich loop and other absolutely conserved polar residues suggestive of a distinct enzymatic role. Hence, McbD probably catalyzes the dehydratase reaction required in these modifications. Additionally, in the case of the thiazole formation, there is likely to be an adenylation step catalyzed by the E1-like domain of the GodD-family enzyme prior to carbon-sulfur bond formation. Many McbD domain proteins that lack fusions to the GodD family are instead fused to OsmC-like domains 54 with conserved cysteines that are capable of carrying sulfur atoms. In these cases the OsmC domains might provide an alternative mechanism for sulfur delivery in conjunction with the heterocyclization catalyzed by McbD domains.
The MccB, GodD, PaaA and Rv3196 families are united by the presence of an N-terminal winged HTH domain (Figure 1 and Table 2), which we established by means of sequence profile analysis (PSI-BLAST e-value 10−3). Based on available E1 structures we predict that the N-terminal wHTH domain in these families probably forms a cap over the active site of the adjacent monomer. Thus, it could potentially provide an additional nucleotide-binding interface and also a means to guide the peptide substrate to the active site. Such a role played by the wHTH domain might explain some unusual features observed in these families, such as loss of the arginine finger, the N-terminal divergence and loss of the nucleotide-binding loop between S2 and H2 (Table 1).
The presence of at least one representative of the E1-like superfamily in the three superkingdoms of life suggests that it was present in the last universal common ancestor (LUCA). Based on extant versions we can infer that this ancestral version resembled ThiF/MoeB proteins and functioned as a dimer with a symmetric pair of adenylating and thiolating active sites. Earlier studies on Rossmannoid superfamilies have shown that S-AdoMet dependent methyltransferases and FAD/NAD(P) dependent dehydrogenases had already diversified to spawn multiple lineages by the time of LUCA 22,23. Thus, the evolutionary divergence of E1-like domains and these related Rossmannoid superfamilies occurred prior to LUCA. In contrast to the E1-like domain, the ThiS and MoaD families of Ubls appear to have been distinct in LUCA 30 itself, probably providing the primary determinants of pathway specificity. Hence, it is likely that LUCA possessed a single multi-functional E1-like domain that interacted with both ThiS and MoaD-like proteins. Subsequently, by the time of divergence of the bacterial superkingdom functional associations between E1-like domains and sulfur-carrying rhodanese and JAB peptidase domains appear to have emerged. Superposition of domain architectures and gene neighborhoods on the phylogenetic tree of the prokaryotic ThiF/MoeB proteins suggests that at least in bacteria the rhodanese domain was ancestrally fused to the E1 or associated as a neighboring gene in an operon.
In some bacteria the rhodanese domain was also displaced by the non-homologous CCTBP domain (Figure 3). The TBP domain is universally conserved in the archaeo-eukaryotic lineage as a component of the basal transcription apparatus. Other than the CCTBP domain, the only TBP-like domain found in bacteria is also predominantly present in low GC Gram-positive bacteria and is found fused to the RNAse domain in RNase HIII proteins 55,56. These phyletic patterns suggest that the CCTBP- and RNAse HIII- associated TBP-like domains were laterally transferred to low GC Gram-positive bacteria from archaea. However, the CCTBP domain appears to have acquired a role in mediating sulfur transfer/redox reactions prior to the transfer. The lack of concordance between the protein tree and the prokaryotic species tree (Figure 3) 44 suggests rampant lateral transfer of the E1 domain between distant lineages in the post-LUCA evolution of the superfamily. Moreover, on multiple occasions the E1 enzyme duplicated to give rise to separate paralogs dedicated to MoCo and thiamine biosynthesis (e.g. independently in γ-proteobacteria and in the low GC Gram-positive bacteria) or cysteine biosynthesis (e.g. in mycobacteria). The independence of these events is also supported by the distinct domain architectures of the E1-like proteins associated with the thiamine and MoCo pathways in each of these lineages (Figure 3).
Another major facet of the post-LUCA evolution of E1-like domains in the bacteria was the emergence of several novel lineage-specific paralogs associated with the innovation of novel metabolic capabilities. For example, E1-like enzymes involved in biosynthesis of siderophores and related protective compounds were derived from the MoeB/ThiF proteins (Figure 3). They were recruited to perform biochemically similar reactions as the latter in these new secondary metabolism pathways. The most dramatic adaptation of this type was the origin and radiation of the monophyletic group of FeeI, MccB, PaaA, Rv319, and GodD families (Figure 1 and Table 3). These families display extraordinary sequence divergence relative to E1-like domains involved in the more conserved primary metabolic systems. Hence, they are possibly under strong selection due to the need to recognize a rapidly diversifying set of secondary metabolite substrates that range from fatty acids to several small peptides with no detectable sequence similarity.
New insights regarding the origin of the E1s of Ub/Ubl conjugation systems had emerged from our earlier discovery of potential bacterial cognates of eukaryote-type Ub/Ubl conjugation systems 36. E1-like domains of these systems belong to a cluster of five related families (6A–E in Fig. 1), which are consistently found in operons or fused to E2 domains. Gene-neighborhoods encoding these E1-like proteins never contain any genes for cofactor, cysteine or secondary metabolite biosynthesis. This strongly supports the conjecture that these E1-like domains function in association their cognate E2s in primitive Ub/Ubl conjugation-type systems. A version of these bacterial systems probably spawned the ancestral E1–E2 pair of all eukaryotic E1s functioning in conjunction with an E2 in the first eukaryotic common ancestor. The abundance of these systems in α-proteobacteria 36, from which the mitochondrial endosymbiont emerged, makes it a plausible source for this ancestral E1–E2 pair. By the time of the last eukaryotic common ancestor (LECA), E1s had radiated into 7 distinct families, which appears to have occurred concomitantly with an even more extensive radiation of Ubls, resulting in a wide range of protein modifiers 30. Prior fixation of robust de-ubiquitination and degradation systems in the form of the proteasome and its lid complex in the eukaryotic progenitor possibly favored this proliferation of modifications by Ub/Ubls.
The first divergence in the pre-LECA evolution of E1s appears to have given rise to the Apg7/Atg7 family that conjugates Apg12/Apg8 to protein and lipid substrates 25. The next divergence resulted in formation of the respective ancestors of the active and inactive subunits of all extant Ub/Ubl-conjugating enzymes. By LECA the ancestral active and inactive monomers had concomitantly diversified into 3 families each (Fig. 1), of which an active and inactive pair fused to give rise to the UBA1 family. These 3 pairs of families constituted the activating enzymes of Ub (UBA1 N- and C-terminal domains), SUMO (UBA2 and AOS1/SAE1 families) and Nedd8 (UBA3 and APPBP1 families). Subsequently the UBA5 family appears to have emerged just prior to the divergence of kinetoplastid-heterolobosean lineage, and acquired specificity for Ufm1, a pre-existing Ubl. Further, throughout eukaryotic evolution there were several lineage-specific duplications of E1-like domains. This was most rampant in the UBA1 family, where a duplication in the common ancestor of amoebozoa, fungi and metazoa resulted in UBA6, which might activate Fat10 57. In vertebrates, another duplication in the UBA1 family resulted in UBE1L, the activating enzyme for ISG15 involved in interferon response 58. Similarly, a lineage-specific duplication of the UBA1 family occurred in ciliates along with a fusion to an N-terminal E2 domain of the BRUCE family 59. The UBA1 family also underwent sporadic lineage-specific duplications in stramenopiles and kinetoplastids suggesting their possible diversification into different functional contexts.
Interestingly, there appear to have been additional independent transitions of other eukaryotic E1-like families to Ubl-conjugation-related roles. Eukaryotic MoeB/ThiF orthologs (e.g. MOCS3) have been shown to function like their prokaryotic counterparts in MoCo biosynthesis along with their Ubl partners 44. However, the yeast ThiS/MoaD ortholog, Urm1p is conjugated to protein targets by its cognate E1-like enzyme (UBA4, the fungal MOCS3 ortholog). Genetic studies have also implicated a distinct complex of proteins (Table 2), which are additionally required for synthesis of 2-thiouridine at the wobble position of the tRNA 48, in conjugation of Urm1p to some target proteins. Of these Ncs2p and Ncs6p are PP-loop ATPases, which catalyze a adenylating reactions similar to the E1-like enzymes 23,60. Hence, Ncs2p and Ncs6p could have independently acquired an E1-like function required for some of the Urm1 conjugation events, possibly functioning in conjunction with Elp2p a WD40-type β-propeller and Elp6p, an unusual RecA superfamily P-loop NTPase. In this light it would be of interest to investigate if Urm1p-like ThiS/MoaD orthologs are also involved in tRNA thiobase synthesis as sulfur carriers in conjunction with the above protein complex. The YKL027W family appears to have emerged from an independent lateral transfer of the bacterial YgdL family into eukaryotes after the divergence of diplomonads and parabasalids. It also appears to have independently acquired an Ub-related function in eukaryotes (Table 2; see above).
Here we present a synthetic view of the natural history of the E1-like superfamily by combining all available sequence, structure, biochemical and contextual information. Consequently, we were able to develop a natural classification of the superfamily that allowed us to reconstruct its structural and biochemical diversification. We also clarify the multiple origins and subsequent evolution of different Ub/Ubl-activating versions in eukaryotes. The observations reported here have generated several hypotheses (e.g. Table 2) testable by experimental biochemical studies. We hope that this synthesis provides a resource (see supplementary material) that spurs new directed investigations on the less-studied E1-like families and their functions.
The FSSP program was used for structure similarity searches 61, and the MUSTANG program to generate structural alignments 62. Protein structures were visualized and manipulated using Swiss-PDB 63 and PyMol (http://pymol.sourceforge.net/) programs. Sequence profile searches were performed against the NCBI non-redundant (NR) protein database (National Center for Biotechnology Information, NIH, Bethesda, MD), and a locally compiled database of proteins from completely or near-completely sequenced genomes. PSI-BLAST searches were performed using an expectation value (E-value) of 0.01 used as the threshold for inclusion into the profile 64; searches were iterated until convergence. Alignment-derived HMM searches were performed using the HMMer package 65. Multiple alignments were constructed using the MUSCLE 66 and Kalign 67 programs, followed by manual correction based on PSI-BLAST high-scoring pairs, secondary structure predictions, and available crystal structures. Protein secondary structure was predicted using a multiple alignment as the input for the JPRED2 program, which uses information extracted from a PSSM, HMM, and residue frequencies in alignment columns68. Pairwise comparisons of HMMs, using a single sequence or multiple alignment as query, against profiles of proteins in the PDB database were performed with the HHPRED program 69. Similarity-based clustering was performed using the BLASTCLUST program [ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html] with empirically determined length and score threshold parameters. Gene neighborhoods in prokaryotes were obtained by isolating conserved genes immediately upstream and downstream of the gene in question showing separation of less than 70 nucleotides between gene termini. Neighborhoods were determined by searching NCBI PTT tables (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome) with a custom PERL script. Phylogenetic analysis was carried out using neighborhood-joining and minimum evolution-based methods with gamma distributed rates and a JTT substitution matrix as implemented in the MEGA4 program 70. The shape parameter α was estimated empirically through a series of trials. Maximum likelihood trees were also obtained by first generating the least-squares tree (FITCH program of the PHYLIP package 71) with subsequent local rearrangement using the PROTML program (MOLPHY package 72). The reliability of the tree topology was assessed using the RELL bootstrap method of MOLPHY, with 10 000 replicate 72. All large-scale procedures were carried out using the TASS software package (Anantharaman V, Balaji S, Aravind L, unpublished).
The authors acknowledge the Intramural research program of the National Library of Medicine, National Institutes of Health, USA for funding their research.
A comprehensive alignment containing all E1-like families, alignments of different domains fused to E1-like domains, conserved gene neighborhoods and a comprehensive list of gis are provided at ftp://ftp.ncbi.nih.gov/pub/aravind/temp/supplementary_material_E1.html