Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.
The TATA binding protein (TBP) is an essential transcription initiation factor in Archaea and Eucarya. Bacteria lack TBP, and instead use sigma factors for transcription initiation. TBP has a symmetric structure comprising two repeated TBP domains. Using sequence, structural and phylogenetic analyses, we examine the distribution and evolutionary history of the TBP domain, a member of the helix-grip fold family. Our analyses reveal a broader distribution than for TBP, with TBP-domains being present across all three domains of life. In contrast to TBP, all other characterized examples of the TBP domain are present as single copies, primarily within multidomain proteins. The presence of the TBP domain in the ubiquitous DNA glycosylases suggests that this fold traces back to the ancestor of all three domains of life. The TBP domain is also found in RNase HIII, and phylogenetic analyses show that RNase HIII has evolved from bacterial RNase HII via TBP-domain fusion. Finally, our comparative genomic screens confirm and extend earlier reports of proteins consisting of a single TBP domain among some Archaea. These monopartite TBP-domain proteins suggest that this domain is functional in its own right, and that the TBP domain could have first evolved as an independent protein, which was later recruited in different contexts.
Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence–structure–function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker’s yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).
CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily.
Evolutionary innovation in eukaryotes and especially animals is at least partially driven by genome rearrangements and the resulting emergence of proteins with new domain combinations, and thus potentially novel functionality. Given the random nature of such rearrangements, one could expect that proteins with particularly useful multidomain combinations may have been rediscovered multiple times by parallel evolution. However, existing reports suggest a minimal role of this phenomenon in the overall evolution of eukaryotic proteomes. We assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most phylogenetically complete set of genomes analyzed so far. By employing a maximum parsimony approach to compare repertoires of Pfam domains and their combinations, we show that independent evolution of domain combinations is significantly more prevalent than previously thought. Our results indicate that about 25% of all currently observed domain combinations have evolved multiple times. Interestingly, this percentage is even higher for sets of domain combinations in individual species, with, for instance, 70% of the domain combinations found in the human genome having evolved independently at least once in other species. We also show that previous, much lower estimates of this rate are most likely due to the small number and biased phylogenetic distribution of the genomes analyzed. The process of independent emergence of identical domain combination is widespread, not limited to domains with specific functional categories. Besides data from large-scale analyses, we also present individual examples of independent domain combination evolution. The surprisingly large contribution of parallel evolution to the development of the domain combination repertoire in extant genomes has profound consequences for our understanding of the evolution of pathways and cellular processes in eukaryotes and for comparative functional genomics.
Most proteins in eukaryotes are composed of two or more domains, evolutionary independent units with (often) their own individual functions. The specific repertoire of multidomain proteins in a given species defines the topology of pathways and networks that carry out its metabolic and regulatory processes. When proteins with new domain combinations emerge by gene fusion and fission, it directly affects topology of cellular networks in this organism. To better understand the evolution of such networks we analyzed a large set of eukaryotic genomes for the evolutionary history of known domain combinations. Our analysis shows that 70% of all domain combinations present in the human genome independently appeared in at least one other eukaryotic genome. Overall, over 25% of all known multidomain architectures emerged independently several times in the history of life. The difference between a global and species specific picture can be explained by the existence of a core set of domain combinations that keeps reemerging in different species, which are accompanied by a smaller number of unique domain combinations that do not appear anywhere else.
Measuring gene transcription using real-time reverse transcription polymerase chain reaction (RT-qPCR) technology is a mainstay of molecular biology. Technologies now exist to measure the abundance of many transcripts in parallel. The selection of the optimal reference gene for the normalisation of this data is a recurring problem, and several algorithms have been developed in order to solve it. So far nothing in R exists to unite these methods, together with other functions to read in and normalise the data using the chosen reference gene(s).
We have developed two R/Bioconductor packages, ReadqPCR and NormqPCR, intended for a user with some experience with high-throughput data analysis using R, who wishes to use R to analyse RT-qPCR data. We illustrate their potential use in a workflow analysing a generic RT-qPCR experiment, and apply this to a real dataset. Packages are available from http://www.bioconductor.org/packages/release/bioc/html/ReadqPCR.htmland http://www.bioconductor.org/packages/release/bioc/html/NormqPCR.html
These packages increase the repetoire of RT-qPCR analysis tools available to the R user and allow them to (amongst other things) read their data into R, hold it in an ExpressionSet compatible R object, choose appropriate reference genes, normalise the data and look for differential expression between samples.
The first crucial step in any structural genomics project is the selection and prioritization of target proteins for structure determination. There may be a number of selection criteria to be satisfied, including that the proteins have novel folds, that they be representatives of large families for which no structure is known, and so on. The better the selection at this stage, the greater is the value of the structures obtained at the end of the experimental process. This value can be further enhanced once the protein structures have been solved if the functions of the given proteins can also be determined. Here we describe the methods used at either end of the experimental process: firstly, sensitive sequence comparison techniques for selecting a high-quality list of target proteins, and secondly the various computational methods that can be applied to the eventual 3D structures to determine the most likely biochemical function of the proteins in question.
Structural genomics; target selection; function from structure; functional annotation
The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called “hidden spindle hub”, proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
In order to understand the evolution of enzyme reactions and to gain an overview of biological catalysis we have combined sequence and structural data to generate phylogenetic trees in an analysis of 276 structurally defined enzyme superfamilies, and used these to study how enzyme functions have evolved. We describe in detail the analysis of two superfamilies to illustrate different paradigms of enzyme evolution. Gathering together data from all the superfamilies supports and develops the observation that they have all evolved to act on a diverse set of substrates, whilst the evolution of new chemistry is much less common. Despite that, by bringing together so much data, we can provide a comprehensive overview of the most common and rare types of changes in function. Our analysis demonstrates on a larger scale than previously studied, that modifications in overall chemistry still occur, with all possible changes at the primary level of the Enzyme Commission (E.C.) classification observed to a greater or lesser extent. The phylogenetic trees map out the evolutionary route taken within a superfamily, as well as all the possible changes within a superfamily. This has been used to generate a matrix of observed exchanges from one enzyme function to another, revealing the scale and nature of enzyme evolution and that some types of exchanges between and within E.C. classes are more prevalent than others. Surprisingly a large proportion (71%) of all known enzyme functions are performed by this relatively small set of 276 superfamilies. This reinforces the hypothesis that relatively few ancient enzymatic domain superfamilies were progenitors for most of the chemistry required for life.
Enzymes, as biological catalysts, are crucial to life. Understanding how enzymes have evolved to perform the wide variety of reactions found across all kingdoms of life is fundamental to a broad range of biological studies, especially those leading to new therapeutics. To unravel the evolution of novel enzyme function requires combining information on protein structure, sequence, phylogeny and chemistry (in terms of interacting small molecules and reaction mechanisms). We have developed a protocol for integrating this wide range of data, which we have applied to a relatively large number of families comprising some very diverse relatives. This has permitted us to present an initial overview of the evolution of novel enzyme functions, in which we observe that some changes in function between relatives are more common than others, with most of the functionality observed in nature confined to relatively few families. Moreover, we are able to identify the evolutionary route taken within a superfamily to change the enzyme function from one reaction to another. This information may help in predicting the function of an enzyme that has yet to be experimentally characterised as well as in designing new enzymes for industrial and medical purposes.
Many persistent pain states (pain lasting for hours, days, or longer) are poorly treated because of the limitations of existing therapies. Analgesics such as nonsteroidal anti-inflammatory drugs and opioids often provide incomplete pain relief and prolonged use results in the development of severe side effects. Identification of the key mediators of various types of pain could improve such therapies. Here, we tested the hypothesis that hitherto unrecognized cytokines and chemokines might act as mediators in inflammatory pain. We used ultraviolet B (UVB) irradiation to induce persistent, abnormal sensitivity to pain in humans and rats. The expression of more than 90 different inflammatory mediators was measured in treated skin at the peak of UVB-induced hypersensitivity with custom-made polymerase chain reaction arrays. There was a significant positive correlation in the overall expression profiles between the two species. The expression of several genes [interleukin-1β (IL-1β), IL-6, and cyclooxygenase-2 (COX-2)], previously shown to contribute to pain hypersensitivity, was significantly increased after UVB exposure, and there was dysregulation of several chemokines (CCL2, CCL3, CCL4, CCL7, CCL11, CXCL1, CXCL2, CXCL4, CXCL7, and CXCL8). Among the genes measured, CXCL5 was induced to the greatest extent by UVB treatment in human skin; when injected into the skin of rats, CXCL5 recapitulated the mechanical hypersensitivity caused by UVB irradiation. This hypersensitivity was associated with the infiltration of neutrophils and macrophages into the dermis, and neutralizing the effects of CXCL5 attenuated the abnormal pain-like behavior. Our findings demonstrate that the chemokine CXCL5 is a peripheral mediator of UVB-induced inflammatory pain, likely in humans as well as rats.
Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein–protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Some superfamilies contain large numbers of protein domains with very different functions. The ability to refine the functional classification of domains within these superfamilies is necessary for better understanding the evolution of functions and to guide function prediction of new relatives. To achieve this, a suitable starting point is the detailed analysis of functional divisions and mechanisms of functional divergence in a single superfamily. Here we present such a detailed analysis in the superfamily of HUP domains. A biologically meaningful functional classification of HUP domains is obtained manually. Mechanisms of function diversification are investigated in detail using this classification. We observe that structural motifs play an important role in shaping broad functional divergence, whereas residue-level changes shape diversity at a more specific level. In parallel, we examine the ability of an automated protocol to capture the biologically meaningful classification, with a view to automatically extending this classification in the future.
FunTree is a new resource that brings together sequence, structure, phylogenetic, chemical and mechanistic information for structurally defined enzyme superfamilies. Gathering together this range of data into a single resource allows the investigation of how novel enzyme functions have evolved within a structurally defined superfamily as well as providing a means to analyse trends across many superfamilies. This is done not only within the context of an enzyme's sequence and structure but also the relationships of their reactions. Developed in tandem with the CATH database, it currently comprises 276 superfamilies covering ∼1800 (70%) of sequence assigned enzyme reactions. Central to the resource are phylogenetic trees generated from structurally informed multiple sequence alignments using both domain structural alignments supplemented with domain sequences and whole sequence alignments based on commonality of multi-domain architectures. These trees are decorated with functional annotations such as metabolite similarity as well as annotations from manually curated resources such the catalytic site atlas and MACiE for enzyme mechanisms. The resource is freely available through a web interface: www.ebi.ac.uk/thorton-srv/databases/FunTree.
Protein Kinases are a superfamily of proteins involved in crucial cellular processes such as cell cycle regulation and signal transduction. Accordingly, they play an important role in cancer biology. To contribute to the study of the relation between kinases and disease we compared pathogenic mutations to neutral mutations as an extension to our previous analysis of cancer somatic mutations. First, we analyzed native and mutant proteins in terms of amino acid composition. Secondly, mutations were characterized according to their potential structural effects and finally, we assessed the location of the different classes of polymorphisms with respect to kinase-relevant positions in terms of subfamily specificity, conservation, accessibility and functional sites.
Pathogenic Protein Kinase mutations perturb essential aspects of protein function, including disruption of substrate binding and/or effector recognition at family-specific positions. Interestingly these mutations in Protein Kinases display a tendency to avoid structurally relevant positions, what represents a significant difference with respect to the average distribution of pathogenic mutations in other protein families.
Disease-associated mutations display sound differences with respect to neutral mutations: several amino acids are specific of each mutation type, different structural properties characterize each class and the distribution of pathogenic mutations within the consensus structure of the Protein Kinase domain is substantially different to that for non-pathogenic mutations. This preferential distribution confirms previous observations about the functional and structural distribution of the controversial cancer driver and passenger somatic mutations and their use as a proxy for the study of the involvement of somatic mutations in cancer development.
The Gene3D structural domain database provides domain annotations for 7 million proteins, based on the manually curated structural domain superfamilies in CATH. These annotations are integrated with functional, genomic and molecular information from external resources, such as GO, EC, UniProt and the NCBI Taxonomy database. We have constructed a set of web services that provide programmatic access to this integrated database, as well as the Gene3D domain recognition tool (Gene3DScan) and protein sequence annotation pipeline for analysing novel protein sequences. Example queries include retrieving all curated GO terms for a domain superfamily or all the multi-domain architectures for the human genome. The services can be accessed using simple HTTP calls and are able to return results in a range of formats for quick downloading and easy parsing, graphical rendering and data storage. Hence, they provide a simple, but flexible means of integrating domain annotations and associated data sets into locally run pipelines and analysis software. The services can be found at http://gene3d.biochem.ucl.ac.uk/WebServices/.
The Midwest Center for Structural Genomics (MCSG) is one of the large-scale centres of the Protein Structure Initiative (PSI). During the first two phases of the PSI the MCSG has solved over a thousand protein structures. A criticism of structural genomics is that target selection strategies mean that some structures are solved without having a known function and thus are of little biomedical significance. Structures of unknown function have stimulated the development of methods for function prediction from structure.
We show that the MCSG has met the stated goals of the PSI and use online resources and readily available function prediction methods to provide functional annotations for more than 90% of the MCSG structures. The structure-to-function prediction method ProFunc provides likely functions for many of the MCSG structures that cannot be annotated by sequence-based methods.
Although the focus of the PSI was structural coverage, many of the structures solved by the MCSG can also be associated with functional classes and biological roles of possible biomedical value.
The inhibitory T-cell surface-expressed receptor, cytotoxic T lymphocyte-associated antigen-4 (CTLA-4), which belongs to the class of cell surface proteins phosphorylated by extrinsic tyrosine kinases that also includes antigen receptors, binds the related ligands, B7-1 and B7-2, expressed on antigen-presenting cells. Conformational changes are commonly invoked to explain ligand-induced “triggering” of this class of receptors. Crystal structures of ligand-bound CTLA-4 have been reported, but not the apo form, precluding analysis of the structural changes accompanying ligand binding. The 1.8-Å resolution structure of an apo human CTLA-4 homodimer emphasizes the shared evolutionary history of the CTLA-4/CD28 subgroup of the immunoglobulin superfamily and the antigen receptors. The ligand-bound and unbound forms of both CTLA-4 and B7-1 are remarkably similar, in marked contrast to B7-2, whose binding to CTLA-4 has elements of induced fit. Isothermal titration calorimetry reveals that ligand binding by CTLA-4 is enthalpically driven and accompanied by unfavorable entropic changes. The similarity of the thermodynamic parameters determined for the interactions of CTLA-4 with B7-1 and B7-2 suggests that the binding is not highly specific, but the conformational changes observed for B7-2 binding suggest some level of selectivity. The new structure establishes that rigid-body ligand interactions are capable of triggering CTLA-4 phosphorylation by extrinsic kinase(s).
Cell Surface Receptor; Crystal Structure; Phosphotyrosine Receptor; Receptor Structure-Function; Signal Transduction; Conformational Change; Receptor Triggering
CATH version 3.3 (class, architecture, topology, homology) contains 128 688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.
Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.
To model accurate protein networks we need to extend our knowledge of protein associations in molecular systems much further. Biologists believe that high-throughput experiments will fill the gaps in our knowledge. However, if these approaches perform biased screenings, leaving important areas poorly characterized, success in modelling protein networks will require additional approaches to explore these ‘dark’ areas. We assess the value of integrating bio-computational approaches to build accurate and comprehensive network models for human and yeast proteomes and compare these models with models derived by combining multiple experimental datasets. We show that the predicted networks resemble the topological and error features of the experimental networks, and contain information on true protein associations within and beyond their constitutive first order binary predictions. We suggest that the majority of predicted network space is dark matter containing important functional areas, elusive to current experimental designs. Until novel experimental designs emerge as effective tools to screen these hidden regions, computational predictions will be a valuable approach for exploring them.
The ability to assign function to proteins has become a major bottleneck for comprehensively understanding cellular mechanisms at the molecular level. Here we discuss the extent to which structural domain classifications can help in deciphering the complex relationship between the functions of proteins and their sequences and structures. Structural classifications are particularly helpful in understanding the mosaic manner in which new proteins and functions emerge through evolution. This is partly because they provide reliable and concrete domain definitions and enable the detection of very remote structural similarities and homologies. It is also because structural data can illuminate more clearly the mechanisms by which the functions of homologues can be modified during evolution and a broader functional repertoire emerge.
One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centres, targets representatives from large, structurally uncharacterised protein domain families, and from structurally uncharacterised subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly over-represented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first three years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.
In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Gene fusion prediction is one approach to detection of such functional relationships. Its use is however known to be problematic in higher eukaryotic genomes due to the presence of large homologous domain families. Here we introduce CODA (Co-Occurrence of Domains Analysis), a method to predict functional associations based on the gene fusion idiom.
We apply a novel scoring scheme which takes account of the genome-specific size of homologous domain families involved in fusion to improve accuracy in predicting functional associations. We show that CODA is able to accurately predict functional similarities in human with comparison to state-of-the-art methods and show that different methods can be complementary. CODA is used to produce evidence that a currently uncharacterised human protein may be involved in pathways related to depression and that another is involved in DNA replication.
The relative performance of different gene fusion methodologies has not previously been explored. We find that they are largely complementary, with different methods being more or less appropriate in different genomes. Our method is the only one currently available for download and can be run on an arbitrary dataset by the user. The CODA software and datasets are freely available from ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v6.1.0/CODA/. Predictions are also available via web services from http://funcnet.eu/.
Functionally analogous enzymes are those that catalyze similar reactions on similar substrates but do not share common ancestry, providing a window on the different structural strategies nature has used to evolve required catalysts. Identification and use of this information to improve reaction classification and computational annotation of enzymes newly discovered in the genome projects would benefit from systematic determination of reaction similarities. Here, we quantified similarity in bond changes for overall reactions and catalytic mechanisms for 95 pairs of functionally analogous enzymes (non-homologous enzymes with identical first three numbers of their EC codes) from the MACiE database. Similarity of overall reactions was computed by comparing the sets of bond changes in the transformations from substrates to products. For similarity of mechanisms, sets of bond changes occurring in each mechanistic step were compared; these similarities were then used to guide global and local alignments of mechanistic steps. Using this metric, only 44% of pairs of functionally analogous enzymes in the dataset had significantly similar overall reactions. For these enzymes, convergence to the same mechanism occurred in 33% of cases, with most pairs having at least one identical mechanistic step. Using our metric, overall reaction similarity serves as an upper bound for mechanistic similarity in functional analogs. For example, the four carbon-oxygen lyases acting on phosphates (EC 4.2.3) show neither significant overall reaction similarity nor significant mechanistic similarity. By contrast, the three carboxylic-ester hydrolases (EC 3.1.1) catalyze overall reactions with identical bond changes and have converged to almost identical mechanisms. The large proportion of enzyme pairs that do not show significant overall reaction similarity (56%) suggests that at least for the functionally analogous enzymes studied here, more stringent criteria could be used to refine definitions of EC sub-subclasses for improved discrimination in their classification of enzyme reactions. The results also indicate that mechanistic convergence of reaction steps is widespread, suggesting that quantitative measurement of mechanistic similarity can inform approaches for functional annotation.
When species evolve, their genes duplicate and diverge to allow for adaptation of their functional repertoires to the changing environment. In this scenario, unrelated genes can convergently evolve to produce proteins with the same molecular function, termed “functionally analogous.” A quantitative determination of the reaction similarities among functionally analogous enzymes could provide insight about the different structural solutions nature has used to evolve similar catalysts. Bond changes between substrates and products, and between successive reaction intermediates, were used to compare the reactions catalyzed and the mechanisms of catalysis for 95 pairs of functionally analogous enzymes. Less than half of the reactions catalyzed by unrelated enzymes, but defined as similar by the Enzyme Commission (EC) classification, are similar in terms of bond changes, suggesting that this classification often fails to capture quantitative differences between many enzyme reactions. Furthermore, we addressed for the first time whether the chemical mechanisms by which similar overall reactions are achieved in functional analogs are also similar. We conclude that convergence of reaction is often accompanied by convergence of chemical mechanism. These results will be useful for classifying enzymes, guiding functional annotation of newly determined enzyme sequences and structures and for informing the engineering of enzymes with new functions.