|Home | About | Journals | Submit | Contact Us | Français|
Glycosylation is an important aspect of epigenetic regulation. Glycosyltransferase is a key enzyme in the biosynthesis of glycans, which glycosylates more than half of all proteins in eukaryotes and is involved in a wide range of biological processes. It has been suggested previously that homooligomerization in glycosyltransferases and other proteins might be crucial for their function. In this study, we explore functional homooligomeric states of glycosyltransferases in various organisms, trace their evolution and perform comparative analyses to find structural features which can mediate or disrupt the formation of different homooligomers. First we make a structure-based classification of the diverse superfamily of glycosyltransferases and confirm that the majority of the structures are indeed clustered into the GT-A or GT-B folds. We find that homooligomeric glycosyltransferases appear to be as ancient as monomeric glycosyltransferases and go back in evolution to the last universal common ancestor (LUCA). Moreover, we show that interface residues have significant bias to be gapped out or unaligned in the monomers implying that they might represent features crucial for oligomer formation. Structural analysis of these features reveals that the vast majority of them represent loops, terminal regions and helices indicating that these secondary structure elements mediate the formation of glycosyltransferases' homooligomers and directly contribute to the specific binding. We also observe relatively short protein regions which disrupt the homodimer interactions although such cases are rare. These results suggest that relatively small structural changes in the non-conserved regions may contribute to the formation of different functional oligomeric states and might be important in regulation of enzyme activity through homooligomerization.
Post-translational modifications play an important role in regulating protein function and glycosylation is one of the most complex forms of post-translational modifications making a large contribution to phenotypic diversity. It has been established that more than 50% of proteins in eukaryotes are glycosylated1 and glycans are involved in a wide range of biological processes, such as immune response, cell–cell interaction, cellular regulation, tumor growth, and cell invasion.2-7 Glycosyltransferase (GT) is a key enzyme in the biosynthesis of glycans, which sequentially transfers a monosaccharide unit from a nucleotide sugar donor to acceptor substrates including protein residues. Understanding the function and evolution of glycosyltransferases is crucial in providing the insights into the relationships between glycan diversification, protein glycosylation and the phenotype – the main area of research in glycomics. It has been shown that knockouts of some glycosyltransferases may result in cellular dysfunction and even death.8 Moreover, in bacteria, the majority of GTs are involved in the synthesis of glycolipids, peptidoglycans, and lipopolysaccharides and can be suitable targets for drug development against bacterial pathogens. Much effort has been devoted to the identification of genes encoding glycosyltransferases and the characterization of the structures and functions of these enzymes.9,10 Currently, there are more than 100 glycosyltransferase genes found in humans and according to the CAZy database glycosyltransferases consist of approximately 90 very diverse families,11 and account for 1-2% of protein-coding genes in eukaryotic genomes (the number of genes is correlated with the number of coding genes).12
Many soluble and membrane-bound proteins form homooligomeric complexes in a cell, in fact majority of all proteins from Protein Data Bank are homooligomers. The activity of many proteins such as enzymes, ion channel proteins, receptors, and transcription factors are regulated through homooligomerization. Indeed, it has been suggested that large assemblies consisting of many identical subunits have advantageous regulatory properties as they can undergo sensitive phase transitions.13 Moreover, oligomerization can also provide sites for allosteric regulation, generate new binding sites at dimer interfaces, increase affinity through multivalent binding and enhance the diversity in the formation of regulatory complexes. The regulation of protein activity through the transitions between different oligomeric states has been experimentally demonstrated on several systems.14-17 Phosphorylation of a residue on a dimer interface or interaction with different ligands may destabilize the oligomeric structure and might shift the reaction equilibrium which might be important in regulation of apoptosis and tumor formation. In the case of low affinity interactions between proteins and glycans, it has been shown that they can be compensated through the multivalent interactions realized through homooligomerization.2 In our recent work we showed, for example, that homooligomeric interfaces in proteins from the galectin family are very well conserved among a diverse spectrum of species.18 Such homooligomeric structures allow for precise positioning of glycoligands on galectins and increases their binding affinity.
Several glycosyltransferases are similarly proven to form functional homooligomers, which play important roles in controlling enzyme activity and Golgi localization.19,20 For example, α2,6-sialyltransferase and GM2 synthase form very stable functional homodimers by disulfide bonds,21,22 In addition, a recent study indicated that glycosyltransferases may catalyze consecutive steps on the glycan biosynthesis pathways by interacting with each other and forming larger complexes.23 Indeed, there are some glycosyltransferases functioning as homooligomers while others function as monomers even though they can belong to the same protein fold class. In this paper we explore functional oligomeric states of these enzymes in various organisms and study their evolution. We also perform comparative analyses to find structural features which can mediate or disrupt the formation of biological homooligomers in glycosyltransferases. Detecting such features will help to understand the nature of regulation of enzyme activity through homooligomerization and sets the stage for prediction of functionally important oligomeric states for glycosyltransferases and other proteins.
We obtained 68 non-redundant glycosyltransferase structures of 31 families from the CAZy database. Then we calculated the gapped structural alignment score (GSAS)24 and loop Hausdorff metric (LHM)25 as measures of structural similarity for clustering members of the GT-A and GT-B folds (Figure S1, see Methods). First we confirmed that the majority of the structures were correctly clustered into the GT-A or GT-B folds. Only three families GT42 (sialyltransferase), GT51 (peptidoglycan glycosyltransferases), and GT66 (oligosaccharyltransferase), had no significant similarity to any other families and were excluded from the analysis. We generated another score matrix to assess similarities among members of the GT-B fold at the domain level (Figure S2). The score matrix heat map representation shows similarities in the same domain as well as the similarities between the N-terminal and the C-terminal domains. There are some structures with very low similarity to others, such as GT44 (Toxin B) in GT-A and GT23 (fucosyltransferases) in the GT-B fold. It is not a trivial task to classify members of the extremely diverse GT superfamily but despite low similarity between the members of these folds (12% and 11% average sequence identity for GT-A and GT-B folds respectively) these structures still share the core elements as described below.
Previous studies have pointed out the structural similarities in the diverse glycosyltransferase families.26,27 Here we defined structurally conserved core elements by using explicit structure-structure alignments and deliberately chosen structural similarity metrics. The core structure of the GT-A fold consists of a seven-stranded beta-sheet, flanked by six helices, and a two-stranded beta-sheet (Figure S3). The D×D motif, often interacting with the divalent cation, lies at the junction of the two beta-sheets. The structure of the N-terminal half of the core, including the first five strands, three helices, and the D×D motif, is more conserved than the C-terminal half of the core, in which insertions of secondary structures are frequently observed.
GT-B fold enzymes are composed of two domains that are similar to each other. The core structure of the N-terminal domain consists of a seven-stranded beta-sheet, flanked by six helices (Figure. S3). This domain is involved in acceptor sugar molecule binding and is more variable consistent with the diversity of acceptor molecules. The core structure of the C-terminal domain consists of a six stranded beta-sheet, flanked by five helices, and the consecutive three helices at the C-terminal end. This domain involves the donor molecule binding (usually nucleotide-diphospho-sugar) and is more conserved consistent with the conserved nature of the substrates. An additional strand is observed only in the GT72 family, indicating the C-terminal domain is more conserved than the N-terminal domain at the secondary structure level.
Figure 1 presents the structural similarity dendrogram trees for the GT-A and GT-B folds which should reflect the major course of evolution of glycosyltransferases. They were constructed based on the GSAS similarity matrix (trees constructed based on the loop similarity metric can be found in Figure S4). As can be seen from these figures, in the vast majority of the cases the members from the same CAZy family are clustered together with the exception of the poorly conserved GT4 family. Glycosyltransferase families can be classified into two groups based on the mechanism of their glycosylation reactions: retaining and inverting. For the former the stereochemistry of the anomeric carbon is retained between the donor substrate and the product, for the latter it is inverted between them. The trees show that enzymes with retaining and inverting of the anomer correspond very well to two phylogenetic groups on the tree (with the exceptions of the GT27 families of the GT-A fold and the GT70, GT63 and GT10 families with the GT-B fold). This indicates that two types of glycosyltransferase reaction mechanisms diverged quite early in the evolution of the GT-A and GT-B folds and go back in evolution to the last universal common ancestor.
We identified and verified homooligomeric states (monomers, dimers, tetramers, and octamers) for all family members that are classified into the GT-A or GT-B folds using PISA algorithm (Protein Interfaces, Surfaces and Assemblies) and additionally confirmed with IBIS (Inferred Biomolecular Interactions Server) algorithm.28 We mapped oligomeric states and their binding modes on the external nodes of the model phylogenetic trees for folds A and B and inferred ancestral states using maximum parsimony (see Methods). Several observations are evident from these trees. First, the higher order homooligomeric states (dimers, tetramers, and octamers) appear to be as ancient as the monomeric state for GT and go back in evolution to the last universal common ancestor. Indeed as can be seen from Figure S5 majority of glycosyltransferase families include all three major kingdoms of life (additional sequences without structures were purged from CAZy database to assign taxonomic diversity) and monomeric and oligomeric states existed before the family specifications. Moreover, there seem to be multiple gene duplications leading to a large number of paralogs with different GT specificities and homooligomeric states. There are three families in each fold which are consistently annotated as monomers (GT44, GT7, GT27 in GT-A and GT5, GT9 and GT80 in GT-B) while some families seem to function only as homodimers (GT81, GT43 in GT-A and GT35, GT23 in the GT-B fold, respectively). In addition, five families from both folds show mixed oligomeric states which suggests that their activity can be regulated through the transition between different oligomeric states. We also observed strikingly different binding arrangements in various homodimeric glycosyltransferases. There is only one homodimer binding mode from the GT35 family which is well-conserved in a number of very diverse species. Others are conserved among paralogs from the same species (example: binding modes from the GT43 and GT81 families) while the diverse families GT1 and GT4 can be characterized by multiple binding arrangements.
There are some glycosyltransferases functioning as monomers while others function as homooligomers even though they can belong to the same fold class. This indicates that relatively small differences between structures may enable or disable some enzymes to form a homooligomer. We performed a comparative analysis of monomers and homooligomers to detect these structural differences. We first identified the interface residues of all homooligomers using the PISA algorithm, and then mapped these interfaces to the closest homologs in the monomeric state using structure-structure superpositions (the mapping of monomers to closest homooligomers produced similar results (Table S1). We call those regions “gapped” which are present in homooligomers and absent from the monomers (and vice versa) and “unaligned” those regions which include both gapped and misaligned residues (see Figure 2 for the illustration). We checked whether there is an association between the interaction interface and gapped/unaligned regions, and found that for the overall dataset the interface residues of the homooligomers correspond to both gapped and unaligned regions (Fisher exact test with Bonferroni's correction p-values 0.01). Table 1 shows a list of the individual homooligomers, their corresponding monomers, and p-values calculated for each case. In 22 out of 30 homooligomers (73%) the interface residues have a significant bias toward unaligned regions. In a smaller but significant fraction of cases (13 out of 30, 43%) the interface residues have a significant bias to be gapped out in the monomers implying that they might represent features crucial for oligomer formation. Interestingly, for the GT-B fold this tendency is more pronounced and 9 out of 16 oligomers (56%) have interface features which are absent from monomers. We found that the presence of oligomer mediating (or disabling) features in a protein does not depend on the level of sequence and structural similarity between monomer and oligomer. This together with the observation that these features are preferentially located on the oligomer-forming interfaces implies that they cannot be attributed merely to the structural differences occurring as a result of evolutionary divergence.
Next we analyzed the structural and sequence characteristics of those unaligned/gapped peptides which affect the homooligomerization. We identified secondary structures for 1534 interface residues in 30 homooligomers. Gapped loops, terminal regions and helices account for approximately 33% of all interface residues and unaligned loops, terminal regions and helices account for 53% of all interface residues, whereas beta strands are in a minority (Figure S6). There is a statistically significant preference for gapped loops, terminal regions and helices to be present on interface regions in comparison with noninterface regions (p-values < 0.01, Table S2), this tendency is especially evident for the GT-B fold. This indicates that these secondary structure elements play an important role in the formation of glycosyltransferase homooligomers. We have also calculated the amino acid composition of unaligned interface regions compared to aligned core regions but did not find any significant amino acid bias (Figure S7).
A typical example which illustrates how unaligned loops and helices can contribute to the formation of a homodimer is the glucuronyltransferase (GlcAT) from the GT43 family, which is involved in the biosynthesis of glycosaminoglycans. The C-terminal end of GlcAT, especially a long loop and the last helix, participates in the formation of a homodimer interface (marked by yellow in Figure 3a). Each subunit in a dimer possesses a substrate binding site near the interface, where a functionally important glutamine in one subunit interacts with a substrate of the other subunit.29 On the other hand, the C-terminal part (marked by yellow in Figure 3b) of SpsA protein, the closest monomer to the GlcAT dimer, is tied up by a disulfide bond (blue), cannot participate in any interactions and is located on the opposite side compared to the C-terminal region of the aligned dimer subunit.
Another example is glycogen phosphorylase from the GT35 family, which is an allosterically regulated enzyme, involved in the glycogen metabolism. The oligomeric states of all six members of this family, including four eukaryotes and two bacteria, are homodimers which is confirmed by previous experiments.30,31 Figure 4 shows a sequence alignment based on the pairwise structural alignment of six glycogen phosphorylases and three glycogen glucosyltransferases in the GT5 family, which were found to be the closest monomers. There are several alpha helices and loops in the glycogen phosphorylases that do not exist in the GT5 enzymes. Moreover, the majority of these structural elements are located on the homooligomeric interfaces. In particular, a long N-terminal tail containing many interface residues is distinctive because usually glycosyltransferases have only a short transmembrane domain at the N-terminal end. Interestingly, this N-terminal region also includes a phosphorylation and ligand binding sites, such as ATP and glucose. It has been shown that conformational changes upon phosphorylation occur at the dimer interface, involve ordering of the flexible N-terminal region and affect the enzymatic activity.32,33
The functional and structural features mentioned above are examples of specific regions that exist only in dimeric forms whereas we also found specific protein regions which are present only in a monomeric form and might disrupt the oligomerization. We tried to detect those residues in a monomer that are absent from a dimer and that are located between residues aligned to the dimer interface. We found four such monomers, they are indicated by the pound sign in Table 1. One of them represents a case of two GT8 structures: one is glycogenin glucosyltransferase from a eukaryote (1LL2), and the other is galactosyltransferase from bacteria (1G9R), which catalyzes a key step in the biosynthesis of lipooligosaccharide structure. Our results show that the eukaryotic enzyme is a homodimer and the bacterial enzyme is a monomer. In glycogenin glucosyltransferase from a eukaryote the interface of the homodimer has a cleft, which allows the formation of a stable complex (Figure 5a), whereas in galactosyltransferase from bacteria the residues corresponding to the interface are covered by other residues, thereby filling the cleft (Figure 5b). The alignment of these two structures reveals that the residues that cover the interface region are two loops not present in the homodimer and located close to the interface region as judged from the monomer-dimer alignment (Figure 5c). This result indicates that these two inserted loops might disrupt the formation of a homodimer and the deletion of the two loops enables the formation of the homodimer.
We analyzed the evolution, functional importance, and structural features which mediate or disable the formation of homooligomers in the glycosyltransferase family. Glycosyltransferases represent an extremely interesting and appropriate example to study these effects since this family is very ancient, encompassing all three kingdoms of life, showing a diversity of biological functions, substrates, binding selectivities, reaction mechanisms, and oligomeric states. The interactions between subunits as well as with other proteins are extremely important for the functionality of glycosyltransferases. Several roles of homooligomerization in the functionality of glycosyltransferases have been discussed previously. For example, binding affinity and selectivity might be regulated through oligomerization or namely through the bond formation between the substrate bound to one subunit in a dimer and the specific residues of another subunit.29 Oligomerization might also be required for stability and thermostability of some GTs (GT81 and GT78 families)34,35 while inter-subunit interactions in glycogenin dimers (GT8 family) are very important in self-glucosylation of glycogenin.36 At the same time glycogen phosphorylase from the GT35 family can undergo conformational changes upon phosphorylation at the dimer interface which involves ordering of the flexible N-terminal region 32 thus affecting and regulating the enzymatic activity.
Our phylogenetic analysis of the glycosyltransferases points to the ancient origin of higher order homooligomeric states which appear to be as ancient as the monomeric state and goes back in evolution as early as the time of divergence of prokaryotes and eukaryotes. Moreover, the examination of the trees for both the GT-A and GT-B folds shows multiple gene duplications leading to a large number of paralogs with different specificities. Although many different glycosyltransferase families form biological dimers, the dimer binding modes are surprisingly diverse among different families implying that paralogs might acquire different binding arrangements throughout evolution. Our recent study on a larger dataset of families exhibiting multiple binding arrangements showed that binding modes are conserved among homologous proteins from the same family sharing 50-70% identity or higher 18. It implies that oligomeric state, binding mode, and features affecting oligomerization can be reliably inferred only from close homologs.
As mentioned earlier, different oligomeric states may provide sites for allosteric regulation, generate new binding sites at dimer interfaces to increase specificity, increase affinity through multivalent binding, and provide the regulation through the transition between active and non-active forms which in turn depend on the protein oligomeric state. To understand such functional mechanisms and predict those features which might mediate/disrupt oligomerization and provide regulatory switches one might want to investigate the sequence and structural features which differentiate between different oligomeric states. In a recent landmark study this question was addressed on a set of fifty domain families.37 Manual inspection of these domains in monomeric and homooligomeric states revealed that for eleven domain families (22%) certain loop regions were responsible for enabling or disabling the oligomeric interfaces. In the current study we performed an automated analysis of different oligomeric forms from 31 families of two folds of glycosyltransferases and made several main observations. First, we have not observed any amino acid changes which would account for the difference between oligomeric states and even very remotely related glycosyltransferases with quite different amino acid usage on the homodimer interface regions have the same oligomeric state. We also have not observed examples of domain swapping which can be an important mechanism to form oligomeric proteins.38 Second, we found that there is a statistical bias for structurally unaligned and to a smaller degree for gapped regions (present in homooligomers, absent in the corresponding monomers) to occur on homodimer interfaces, which points to the importance of these regions in the formation of homooligomers. Finally, our further analysis showed that these unaligned/gapped interface regions for glycosyltransferases predominantly include loops and to a lesser extent alpha-helices. The observation that alpha-helices can affect the formation of specific homooligomeric interfaces is consistent with a recent study which showed that alpha-helices in some proteins from the p53 family might be essential for stabilizing p53 tetramers and can be lost in other species or acquire different functions.39 Although more flexible and variable than the core regions, loops are under strong evolutionary constraints to preserve their properties and might have pronounced functional significance for protein binding. It is evident from the correct GT classification provided by our loop similarity metric and consistent with the previous studies that loops might offer important clues to gauge remote evolutionary relatedness.25,40-42. Consistent with Akiva et al.37 we also show that loops play a crucial role in forming homooligomers and might regulate the binding affinity of homooligomeric interactions. Loops on interaction interfaces can be also used in a negative design strategy to protect proteins from non-specific aggregation.43 Consistent with the negative design strategy we also observed in some cases that inserted regions (with respect to the monomer) disrupt the homodimer interactions (so called “disabling” loops according to Akiva et al terminology) although such cases are rare compared to those regions which mediate the homooligomer formation. This points to the relative importance of enabling compared to disabling regions in the monomer-oligomer evolution of glycosyltransferases.
First, we derived a list of glycosyltransferase structures from the CAZy database,11 which provides PDB codes of crystallographically determined glycosyltransferases. We then selected representative structures with the highest resolution in cases where the CAZy database lists multiple structures per a single protein. Structures of the GT-A fold with divalent metal ions were given priority since these ions are essential for their catalysis.10 The representative 73 non-redundant glycosyltransferase structures from 31 families are listed in Table S3. Additional information, such as function and reaction types of specific glycosyltransferases, was collected from CAZy, KEGG GLYCAN,44 and Conserved Domain Database (CDD).45
Glycosyltransferases are classified into two different folds: the so-called GT-A and GT-B folds. The GT-A fold consists of one domain while the GT-B fold comprises two similar domains. The diversity between glycosyltransferases is very high. Structure superpositions should provide additional insights into their classification. To achieve this goal we use all pairwise structure-structure alignments between glycosyltransferase representative structures, which are pre-calculated and stored in the PubVAST database.46 We calculate the gapped structural alignment score (GSAS)24 and loop Hausdorff metric (LHM)25 as measures of structural similarity for clustering members of the GT-A and GT-B folds. GSAS together with the LHM scores were shown to perform very well in ranking structurally similar proteins with respect to their homology.47 The GSAS score is defined as following24 (see Supplementary material for the LHM score definition):
Where Nmat is the number of aligned residues, Ngap is the number of gaps, and RMS is the root mean square deviation between the aligned pairs of α-carbon atoms. These two scores produced very similar cluster trees and therefore we will present only the tree obtained using the GSAS score (the LHM based tree is shown in Figure S4). We normalized the scores by dividing them by the maximum GSAS score among all pairwise alignments. To obtain more accurate alignments and similarity scores in the case of the GT-B fold, we separately aligned the N-terminal and C-terminal domains and calculated an average GSAS score over these two domains. If there were other domains present in the chains (SH3, TRP and others) we excluded them from the analysis. The domain boundaries were taken from the Molecular Modeling Database (MMDB).48 The dendrograms were calculated from GSAS distance matrices using the neighbor-joining method49 from PHYLIP version 3.68.50 It should be mentioned that very recently a new classification of glycosyltransferases from both folds has been released in CDD, which overall agrees with our structural classification with the exception of some entries from the diverse GT4 and GT1 families which are not very well resolved on both the CDD tree and our tree.
We annotated oligomeric states and interfaces of all the representative glycosyltransferases using the PISA algorithm (Protein Interfaces, Surfaces and Assemblies). PISA (http://www.ebi.ac.uk/msd-srv/prot_int/pistart.html) is based on chemical thermodynamics and estimates the stability of macromolecular assemblies in protein crystal structures 51. It should be mentioned that PISA can infer macromolecular assemblies and their interfaces using crystallographic symmetries even if such assemblies/interfaces are not present in the PDB asymmetric unit. To confirm the PISA results, we used IBIS (Inferred Biomolecular Interactions Server), a new server (http://www.ncbi.nlm.nih.gov/Structure/ibis/ibis.cgi), which reports interactions observed in experimentally determined protein structural complexes and infers conserved binding sites by inspecting protein complexes formed by close homologs.28 We then examined whether or not interfaces are conserved between homooligomers. Similarly to the conserved binding mode idea,52 if more than 50% of interface residues are structurally superimposed in different homooligomeric complexes, the interface is defined as a conserved binding mode.
Oligomeric states were assigned to internal tree nodes using maximum parsimony. We implemented a modification of Fitch's algorithm,53 usually defined for a binary tree, to allow nodes to have any number of children. The first step of this algorithm applies the following rule recursively, starting from the leaves, to label every internal node: if a given oligomeric state (we distinguish here monomers and homooligomers) is present in more than, less than, or exactly half of labeled children, label the parent “present,” “absent,” or “unknown,” respectively. A second traversal of the tree from root to leaf removes unknown labels by assigning each node the same label as its parent. We break ties at balanced trees, i.e. trees with unknown root, by setting the root to “present”.
We compared glycosyltransferases in monomeric and homooligomeric states to determine the regions which are structurally different between the monomers and the oligomers. Two types of such regions were considered: “unaligned” and “gapped” regions. Here “gapped region” refers to the set of residues that are present in the oligomer and absent from the monomer (or vice versa), whereas “unaligned region” includes both gapped residues and structurally unaligned residues (Figure 2). First, we selected the most similar structure between monomers and oligomers using the GSAS similarity score. We then detected unaligned and gapped residues between an oligomer and its closest monomer using their VAST structure-structure alignment.46 The Fisher exact test was applied to find whether unaligned/gapped residues have a tendency to be among the interface residues. For each oligomer, for example, we counted the number of unaligned/aligned residues on the interface/non-interface, and generated a 2 × 2 contingency table (Table S4). The one-sided P-values were calculated and adjusted for multiple-testing using Bonferroni's correction.
We thank Yuri Wolf for careful reading of the manuscript. This work was supported by National Institutes of Health/DHHS (Intramural Research program of the National Library of Medicine). K.H. was supported by a JSPS Research Fellowship from the Japan Society for the Promotion of Science.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.