Inventory of the E2 Enzymes in Seven Species
Our primary goal was to propose a list and classification of the complete set of E2 proteins encoded by the human genome. To obtain a clearer view of the relation and the evolution of this superfamily of proteins, we added several other species with fully sequenced genomes distributed in the tree of life. As the other mammal, we choose the mouse because many transgenic animal studies allow functional evaluations of proteins in this species. C. elegans
and D. melanogaster
are two multicellular organisms representative of distantly related lineages with many available functional genomic data. All of these species are members of Bilateria
in the Animalia
phylum. Two distantly related yeast species were chosen to evaluate the ancestral set of E2 proteins in eukaryotes, using information from another phylum (Fungi
). Finally, we used genes from A. thaliana
as the outgroup to design the phylogenetic trees. Prokaryotic homologs of the E2 enzymes have recently been described in bacteria (Iyer et al. 2006
); however, we did not include these too distantly related genes in our study.
We chose to work with proteins rather than nucleotide sequences because mutational noise is less important in amino acid sequences (Inagaki and Roger 2006
). Indeed, the fast evolution of nucleotides in the third position of the codons, allowed by the degeneration of the genetic code, produces an accumulation of mutational bias (Jeffroy et al. 2006
Genbank, Swiss-Prot, TrEMBL (including Pfam and the InterPro database of protein families), Gender and Ensembl were initially used to retrieve a total of 78 RefSeq protein coding sequences in A. thaliana, 16 in S. cerevisiae, 15 in S. pombe, 31 in C. elegans, 36 in D. melanogaster, 181 in M. musculus, and 71 in H. sapiens.
We eliminated redundant sequences and pseudogenes as well as sequences for which PRSS scores were <1. This led us to exclude the TSG101-UEVLD family, the UFC1 family, and the Ub conjugation–like ATG3 and ATG10 enzymes. These families may be considered distantly related or converged to similar 3D structures.
Our final list includes 48 E2 protein sequences in A. thaliana, 14 in S. cerevisiae, 14 in S. pombe, 26 in C. elegans, 32 in D. melanogaster, 36 in M. musculus, and 37 in H. sapiens. Table and Supplementary Table 2 list all 207 E2 proteins, including their main characteristics, species of origin, chromosomal localization, synonyms, name and length of the corresponding deduced protein, identified homologues, and GenBank accession numbers.
List of the 37 human ubiquitin-conjugating enzymes
General Features of the E2 Enzyme Families in the Seven Species
In a first step, to identify the main groups of proteins (families) with maximum confidence, we aligned truncated protein sequences to avoid long-branch attractions and to minimize noise from C- and N-terminal extremities. The definition of the central core was arbitrary in its details; however, we verified that small differences in the definition of the core sequences had no influence on the obtained results (data not shown). In contrast, the large set of studied sequences (n
= 207) gave a good idea of the general organization of the primary structures of these proteins. Compared with pair-wise alignments, multiple ones allowed for better definitions of orthologous sequences. It was suggested that for distant species, a minimum of 20 sequences needed to be aligned to obtain good results (Margulies et al. 2006
). The study of several distantly related species facilitated the recognition of the minimum relations inside the families (the core signatures).
We defined the limits of the superfamily by fixing the PRSS score of protein sequences to be <10−30
. The alignment of all protein sequences, the global phylogenetic analysis, and the computing of similarity scores showed the existence of 17 subgroups (see Supplementary Figs. 1, 2, and 3), which we named “families.” Others have classified E2 enzyme proteins into 18 groups by splitting family 3 into three groups (XIII, XIV, and XV) and family 4 into two groups (IV and V) while overseeing families 16 and 17 (Jones et al. 2001
Usual classes of E2 enzymes were defined by the presence or absence of an N and/or C extension. The most frequent class was class 1, containing only the core domain. Among the 17 families, 5 contained >1 class, suggesting that the notion of class generally has no phylogenetic meaning.
A large part of the gene diversity in the different species can be represented by the 14 genes of S. cerevisiae. We used the known nomenclature of yeast, or Caenorhabditis genes, to classify the families; however, more functional information is necessary to propose a better nomenclature.
All families had at least one member in humans (Fig. ). Chromosomal locations of each E2 coding gene in the human genome are drawn on the karyotype representation in Supplementary Fig. 4. Figure depicts the distribution of the genes in each family in the 7 species. It is possible to distinguish 4 types of E2 enzyme families, taking into account their species distribution. Ten families are present in all species (families 1 to 10); 2 families are present in all species except C. elegans (families 11 and 12); 4 families are only absent in the 2 yeasts (families 13 to 16); and 1 family is present only in Bilateria (family 17).
Simplified phylogenetic tree of the 37 human E2 enzymes drawn after computational analysis, including proteins of seven species, of the phylogenetic tree. Each branch represents a different family, the number of which is located near the root
Species distribution of the 207 E2 genes in each family. Each rectangle represents one gene
The phylogeny of a family of proteins is important to identify the true homology of proteins (orthology) among different species. Using this information, it is then possible to create a 3D structural model of the candidate proteins and/or to assign biologic functions to them. Studying a large protein superfamily through various species is extremely difficult; therefore, it is no surprise that we found several errors in the ortholog nomenclature in the literature (Supplementary Table 3).
As mentioned previously, the primary sequence signature HPN, followed by a tryptophan residue at 16 (up to 29) amino acids, is highly conserved. This tryptophan 95 has not been shown to make contact with the HECT or the RING domain. However, the crystal structure analysis of the complexes with E2 and either RING or HECT E3 proteins reveals that the side chain of Trp95 is positioned closely to Pro97 at the tip of loop L7 as well as to Pro65 and Pro66 at the base of loop L4 (Martinez-Noel et al.2001
). Pro65 and Pro66 are found in a motif strongly conserved (Y/FPxxPP) 7 to 11 amino acids from the N-terminal side of the HPN motif. Interaction between Trp95 and the proline residues might stabilize the L7 loop and contribute to the correct positioning of the L4 and L7 loops relative to each other. Ala98 of the L7 loop seems to be important for interaction because it makes direct contact with the HECT or the RING domain; however, it is not necessary and is not conserved (Martinez-Noel et al. 2001
When present, the active cysteine is found at seven to eight amino acids from the C-terminal side of HPN. The analysis of primary and secondary structures highlights several original features of certain families (see specific results in later text).
Phylogenetic Analysis for the Classification of E2 Proteins
We used the four main classical algorithms for phylogenetic reconstruction, and the results were mainly coherent. Because one of the four methods gave different results in several cases, we only kept the results of the three convergent methods. The construction of consensus trees allowed for simple and meaningful representation of theses results. However, although only the nodes of the trees were considered meaningful; the time scale or relative evolution speeds of branches were lost with such an approach.
In the next paragraphs, we provide short descriptions and indicate the main characteristics for each family (Fig. and Supplementary Fig. 6), adding some information on known functions, although a complete review on this subject is beyond the scope of the present work. Depending on the cases, the order of the families was chosen according to the numbering order in S. cerevisiae or C. elegans. Family 10 is an exception and was placed at this position because it belongs to the 10 families with members in all species studied.
Fig. 4 Examples of phylogenetic trees for “simple” families, such as family 7 (a), and complex families, such as families 3 (b) and 4 (c). Each tree represents the consensus of four algorithms (NJ, ML, MP, and BI). Only branches present in at (more ...)
The first six families can be considered “classical E2 enzyme” families. The hallmark of family 1 was an important C-terminal ubiquitin-associated (UBA) supplementary domain linked to the core domain by a flexible tether of approximately 20 amino acids. This UBA domain is important for polyubiquitination by allowing binding to a second subunit of Ub. The MP analysis identified ubc-20_Ce as the closest C. elegans gene to mammals and Drosophila, suggesting that this gene was the ortholog of the other genes. In contrast, ubc-21_Ce, ubc-22_Ce, and ubc-23_Ce, diverged and can be considered paralogs.
Family 2 had a classical structure without particularities; however, it can be observed that in both mammals we found two genes, suggesting a duplication of the UBC2 gene in their common ancestor, whereas mouse had an additional third gene.
Families 3 and 4 are the only families that possessed two members in both yeasts S. cerevisiae
and S. pombe
. Family 3 was characterized by two specific regions. One single amino acid insertion altered the orientation of the turn between the first two β-strands in the UBC7_Sc crystallized protein (glutamate 31 in sequence PKSENNIF); however, this glutamate was not conserved in the other members of the family. An insertion of 13 extra residues followed the conserved motif of the active site (HxPGDDPxxxExx) and corresponded to the 3/10 “h” helix. We also identified three subfamilies, each containing different human genes: UBE2G1 (Watanabe et al. 1996
), UBE2G2 (Katsanis and Fisher 1998
), and two UBE2R (Plon et al. 1993
) genes. We obtained different results with the MP analysis compared with other methods, so this tree was excluded in the consensus tree of family 3.
With its 40 members, family 4 was the largest E2 enzyme family, and some members of this family were the most difficult to assign to a particular subfamily. The proline of the HPN signature of the superfamily was not conserved in family 4 and was replaced by a cysteine.
Family 5 missed the canonical tripeptide HPN, which was replaced by TPNGRF or TANGRF. This observation led to the characterization of this family under the NCUBE denomination (n
iquitin conjugating e
nzymes) (Lester et al. 2000
). There was an insertion of two amino acids (aspartate-aromatic) between strand 4 and helix 2 on the C-terminal side of the active cysteine, whose structure was unknown. The orientation of helix 3 and 4 was nonclassical, with this C-terminal extremity corresponding to a hydrophobic transmembrane domain for association with the endoplasmic reticulum. These enzymes had electrostatic potentials that were more similar to the small Ub modifier (SUMO)–conjugating family 7 orthologs (Winn et al. 2007
). For this family we obtained discordant results with species of known phylogenetics, although analyses were recomputed several times, changing the order of input sequences and bootstrap values.
In family 6, we found a duplication of the ancestral gene in D. melanogaster and probably two duplications of the ancestral gene in A. thaliana. All other species possessed only one ortholog gene.
Families 7 to 9 are particular because of the conjugation of Ubl. The proteins in family 7 conjugated SUMO. Like family 3, there were two insertions—one of five residues (positions 32 to 37) between β-strands 1 and 2 and one of two residues near the active cysteine (glu-asp at position –2 and –3 from the conserved tryptophan—rather than asp-lys (Tong et al. 1997
)). The N-terminal helix had a nonclassical electrostatic surface, which may be involved in the recognition of SUMO (Giraud et al. 1998
). This surface, similar to family 5, was involved in ubiquitination (Winn et al.2007
). We found a duplication of the ancestral gene in C. elegans
; however, one of these genes diverged greatly and appeared near the root (MP analysis).
The proteins of family 8 conjugated NEDD8-RUB1. Family 8 had a specific N-terminal extension of 26 residues involved in neddylation (VanDemark and Hill 2004
), which was not shown on the 3D model (Supplementary Fig. 5). Like family 5, this family was difficult to analyze because of duplication of the mammalian genes, which seemed to have diverged greatly, creating an apparent subfamily. However, the four mammalian genes appeared at the place most distant from the root in the tree.
The proteins of family 9 conjugated Ub and ISG15. Family 9 had an N-terminal extension not shown in Fig. , which may have been involved in recognition of ISG15.
The particularity of family 10 is that no protein had an active cysteine; therefore, they were named “variants of Ubc” or UEV. However, this nomenclature was also used for other proteins that did not belong to this family, such as UEVLD and TSG101. Proteins of family 10 were devoid of enzymatic activity and had no canonical HPN motif. The two last alpha helices were missing.
In families 11 and 12, all species possessed only 1 ortholog gene, except C. elegans
, which had no identifiable member, whereas A. thaliana
possessed 2 homologs. Family 12 possessed a specific N-terminal supplementary domain (not shown in Supplementary Fig. 5). For this family, analyses were run several times; however, each time we obtained unexpected results. This was likely caused by the fact that the D. melanogaster
gene was anchored near the root and that the genes of yeast and mammals diverged. This suggests that the Drosophila
gene evolved rapidly because this lineage separated. A Blast search for homologs of these two families lost in C. elegans
was run in the 31 whole-sequenced species of nematodes using NemaBLAST (http://www.nematode.net/BLAST/
). No member was found except for family 11, for which UBE2S_Hs seemed to have high homology with the XI04817 gene from Xiphinema index
(e value 8.8e−51
). This species belonged to an early clade of the Nematode group (Blaxter et al. 1998
), which would indicate that the genes of these 2 families were lost successively during the evolution of Nematodes.
The loss of four families in yeast does not appear to be phylum specific because genes of families 13 to 16 were present in other species belonging to the Fungi phylum (Candida albicans, Yarrowia lipolytica, Aspergillus terreus, A. fumigatus, A. nidulans, A. oryzae, Coccidioides immitis, Neurospora crassa, Gibberella zeae, Magnaporthe grisea, and Chaetomium globosum). This raises the question of the general relevance of the two yeasts as models in the study of E2 enzyme mechanisms. The missing gene families may be replaced on a functional level by genes obtained from duplications in other families because no organism that we analyzed had <14 E2 enzyme genes.
Families 13 to 17 possessed no member in both yeast species S. cerevisiae
and S. pombe
. In family 13, in the absence of an asparagine residue in the tripeptide HPH, the enzymes of this family should have had no catalytic activity (Wu et al. 2003
). A supplementary domain was present at the N-terminus (RLQKEL and GAPGTLYxyE, x = A or E and y = G or N).
Family 14 had no available 3D structure. There was an insertion of seven amino acids between the active cysteine and the conserved tryptophan, containing the conserved sequence TWxG and corresponding to a small “h” helix. These enzymes had an N-terminal extension.
Family 15 had no evident structural particularity. Theses enzymes were known to conjugate Ub and ISG15 (Zhao et al. 2004
Family 16 had no known 3D structure, and we found no evident consensus primary sequence. Only one A. thaliana and the C. elegans orthologs had an active cysteine. RCE1 from A. thaliana was a RUB1-NEDD8–conjugating enzyme.
Family 17 lacked any member in both yeasts S. cerevisiae and S. pombe and in A. thaliana. This family had a particular orientation of helix 4 and 3. Strangely, there was no HPN motif in any but the human UBE2Q2 sequence. There is a large extension at the N-terminal extremity not shown in Fig. . Family 17 is present only in Bilateria and probably evolved from one of the initial ancestral genes; however, the phylogenetic information was lost in our set of species. This family was the only clade-specific family that we were able to identify and may have participated in the evolution of the Bilateria lineage. Further analysis of other species may narrow the precise period of the apparition of this family.
The 10 families present in all species may correspond to the minimal number of initial genes in the ancestors of eukaryotes. However, it is more probable that the common ancestor of all 3 phyla already possessed a set of 18 ancestral genes in 16 families, given the fact that A. thaliana
possessed genes of 16 families. C. elegans
lost 2 families (family 11 and 12), and yeasts lost 4 families (families 13 to 16). The genome of A. thaliana
was the richest in E2 enzyme genes, indicating the importance of this pathway in plants. Several events of genome duplications were at the origin of this rich set of UBC genes in the lineage of Arabidopsis
(Adams and Wendel 2005
). A schematic representation of this discussion is proposed in Fig. .
General summary of the evolution hypothesis of the E2 enzyme families
The general tree of all proteins (Supplementary Fig. 3) illustrates that we cannot describe the relations between the different families with high confidence. Although more information could be gained by adding species from several other clades, it is also possible that the phylogenetic information contained in the primary sequences has been lost once and for all because of the long evolution of E2 genes in the common ancestors of all eukaryotes. Identification of specific primary sequence signatures or spatial signatures may possibly aid in distinguishing subgroups of families, such as the “ICLDIL” subgroup (see Fig. for a summary of hallmarks of all families).
Fig. 6 Schematic of primary and secondary structures summarizing the hallmarks of the 17 E2 enzyme families. Alpha helices are represented by rectangles; β-strands are represented by arrows; and several consensus sequences are highlighted. UBA = UBA (more ...)
Although our definition of families was pragmatic, a biologic significance of such a classification can be determined. The timescale, however, was clearly not the significance. First, the subdivisions inside families 3 and 4 were anterior to the separation between the Animalia
phyla, a separation estimated to have taken place 1.3 billion years ago (Feng and Doolittle 1997
). Second, the unique family specific to Bilateria
organisms represented a late “invention” because the separation of Bilateria
from other organisms has been estimated to have occurred 615 million years ago (Peterson et al. 2004
). Therefore, the proposed classification more probably represents strong selective functional pressures rather than real evolution time.
A functional significance that we could expect from a protein classification would be information about interactions with the protein partners evolving in parallel, such as the E3 enzymes. Unfortunately, available sequence information does not allow the drawing of a general picture. In fact, nothing is known about E3 interaction for families 13, 16, and 17. Only families 3, 4, 8, 11, and 15 are known to interact with an HECT E3, and no clear common sequence signature emerges in these families. Allosteric communication between the E3-binding and E2 active site relies on a complex structural unit formed by a large network of coevolving residues instead of a linear pathway consisting of a small set of residues (Özkan et al.2005
The topology of the family trees was generally coherent with known phylogenetic data; however, the family 5 tree was quite complex. It may be proposed that a duplication of the ancestral gene occurred in Bilateria and that one of these duplicated genes (ubc-26_Ce) may have strongly diverged in C. elegans. Ubc-26_Ce probably belonged to subfamily UBEJ2_Hs because subfamily UBE2J1_Hs already included two homolog genes, and subfamily UBE2J2_Hs was devoid of any homolog. Ubc-6_Ce and ubc-15_Ce were probably duplications in the C. elegans lineage in the UBEJ2_Hs subfamily, whereas the UBE2J1_Hs ortholog of Drosophila must have been lost because the nearest gene in Drosophila was CG5823_Dm, which was an ortholog of UBE2J2_Hs (e value 6.e−31 by way of blastp).
UEV proteins are as old as E2 proteins (Villalobo et al. 2002
). Family 10 is a good example because it contains several UEV proteins that have been highly conserved in eukaryotes, from Protists to Humans (Andersen et al. 2005
). Other families also contain several other UEV proteins; e.g., uev-3_Ce belongs to family 4, but uev-2_Ce is not an ortholog of the human UEV2-UBE2V2. The human UEV3-UEVLD and TSG101 proteins belong to another distant family and therefore were not included in our study, exemplifying that UEV proteins are a polyphyletic heterogeneous group.