|Home | About | Journals | Submit | Contact Us | Français|
With genomic data skyrocketing, their biological interpretation remains a serious challenge. Diverse computational methods address this problem by pointing to the existence of recurrent patterns among sequence, structure, and function. These patterns emerge naturally from evolutionary variation, natural selection, and divergence—the defining features of biological systems—and they identify molecular events and shapes that underlie specificity of function and allosteric communication. Here we review these methods, and the patterns they identify in case studies and in proteome-wide applications, to infer and rationally redesign function.
Proteins remain difficult to characterize functionally despite the exponential growth in experimental data on sequence, structure, and function. There are many reasons for this persistent challenge. Proteins have not a single molecular function but rather multiple features that cooperatively sustain their biological fitness. The details and parameters of these features, e.g. folding, dynamics, cellular targeting, molecular interactions, catalytic activity, allosteric control, post-translational modifications, and degradation, to name a few, are often vague for a lack of laboratory assays to measure them accurately, on a large scale, and in their relevant cellular context. As a consequence, as of March 2012, fewer than 0.1% of the 21 million protein sequences from 3173 completely sequenced genomes 1 had experimentally tested functions, and only two-thirds had at least one automated computationally inferred annotation 2–4. The number of genes without known function is 37% in eukaryotes, 24% in humans, 33% in the far simpler and much studied E. coli, and 40% in other bacteria 2, 5. Although most of the 4225 E. coli genes were recently assigned putative annotations of functional associations, they were not assigned biochemical function 6. Given concerns that some of these annotations may not be accurate 7, the problem of translating sequence into function, and more broadly of translating genotype into phenotype, remains daunting.
Computational methods have long sought to fill this role. A remarkable early success was to realize that sequence and structure diverge smoothly: the root mean square deviation of protein backbones increases exponentially with the sequence divergence of evolutionarily related proteins, or homologs 8. This elegant observation is robust 9, and extends to other functional features besides folding 10 so that, in practice, it justifies homology-based predictions of structure and of function 11, arguably the two most widespread computational applications in biology. Other basic evolutionary principles are emerging from high throughput and systems biology 7. Protein mutation rate and protein expression are inversely correlated 8, biological networks obey power-laws and are scale-free 12; and the evolutionary rates of orthologs follow a Gaussian spread 13. Despite their statistical power, because these principles involve ensemble averages over whole sequences, structures, families, genomes and networks, as well as very long time-scales, they carry limited information on the direct role of individual sequence positions to the function of a given protein.
Single residue variations may profoundly impact function, and explain why homology-based function prediction can lead to incorrect annotations: although alike in sequence and structure, two homologs may harbor differences at one or just a few residues with disproportionate impact on function 14. The identification of such key residues is therefore essential to distinguish meaningful variations of function. This review therefore focuses on methods to identify functionally relevant evolutionary patterns among sequence, structure, and function. Such patterns emerge naturally from random variations and natural selection; they identify molecular events and shapes that determine function and specificity; and they can be approached by focusing on sequences, on structures, and on evolutionary classification. In the second part of the review, the focus will shift to the combination of these techniques in a unifying Evolutionary Trace framework.
Throughout the review, we will refer to two popular functional classification systems. Gene Ontology (GO) 4 provides well-defined terms for the molecular function, cellular component, and biological process of a gene product, along with evidence codes that specify the basis for the annotation and therefore its reliability. Enzyme Commission classification designates enzymatic function into four (EC) numbers 15, indicating the mechanism of the enzyme, the type of bond, the catalyzed reaction, and the substrate, respectively.
The simplest and most widespread evolutionary pattern for defining function is homology between proteins or domains. The rationale is that homology implies that proteins share a common ancestry and hence the function of that common ancestor. Once it is recognized by similarity searches with BLAST or PSI-BLAST 16, function is transferred between close homologs. A concern is that these homologs may have already evolved distinct functions. Thus homology-based annotation errors are not uncommon: divergence of activity has been observed even between enzymes with as much as 70% sequence identity 17. To compound this problem, these errors may in turn propagate across databases 7. To reduce incorrect annotations, multiple techniques, including GOtcha 18, ESG 19, and GOPred 20, tally the GO terms of all of the most significant sequence similarity matches and identify those with the best statistics. For example, GOtcha weighs this tally by the significance of each PSI-BLAST match to a database of proteins with GO annotations, to generate a probability that the query protein performs a particular function.
Other methods go beyond whole sequence comparison to focus on alignment columns with significant conservation 21, 22. The results are generalized profiles to infer structural or functional similarities. Pfam 23 is a widely used database of Hidden Markov Model profiles generated by HMMER 24 applied to the Uniprot database 2. To enhance specificity, Pfam-A uses a smaller set of almost 12,000 sequences representative of individual families that were hand-curated with functional annotations from literature references; to achieve sensitivity, Pfam-B uses a larger set of nearly 140,000 families that were clustered automatically and without dedicated annotation or reference. While Pfam and methods such as Prosite 25 and Interpro 26 focus primarily on the entire protein domain, other sources, such as the ELM database 27, focus instead on smaller motifs.
Even more refined searches focus on specific residues that together define a functional signature. Transfer of function based on these signatures can increase annotation specificity, i.e. lower false positives, by recognizing functionally inconsistent differences among key residues. Several sequence motif-based algorithms were designed specifically for this task, including Confunc 28, DME 29, and EFICAz2 30. All rely on discovering discriminatory sequence fragments shared by proteins with identical function and not others. ConFunc applies GO terms to partition homologs into multiple subsets. The sequences of each subset are then aligned to identify conserved residues. A GO term can then be transferred to a new homolog if it shares this residue signature. Controls suggest 24% greater accuracy of annotation compared to BLAST for homologs with less than 35% sequence identity. Likewise, DME and EFICAz2 use conservation to key in on functional residues specific to given enzyme functions.
Together these studies show that comparative sequence analyses identify evolutionary patterns at different levels of resolution, from whole sequence to profiles to motifs, that are all relevant to structure and function and useful to transfer annotations among proteins.
Structural information adds another dimension to the search for functionally relevant similarities among proteins. First, global structure alignments will detect homologies that elude sequence searches 8. Additionally, spatial correlation among key residues can reveal highly specific three-dimensional (3D) functional features 31. Some structural comparisons treat the structure as a rigid body, as in DALI 32 and TM-align 33, while others tolerate flexibility, as in TOPS++FATCAT 34. A challenge for these structural alignment is the lack of a universally accepted definition of structural similarity 35. In order to address this, CATH 36 and SCOP 37 created manually curated protein structure classification codes based on both domain and evolutionary similarities. These classifications enable functional inference of protein structure in many cases, but overall, and for the same reasons that a few amino acid prove determinant of function in sequence comparisons, the structure-to-function relationship over protein domains is not one-to-one 38.
This motivated searches for specific structural regions resembling previously characterized pockets for catalysis and ligand-binding or surface regions for macromolecular interactions 39. In a control set of 332 ligand-binding proteins, ConCavity 40 correctly predicted the binding site in 80% of cases by searching jointly for the local conservation of sequence and structural topology. Similar methods 41, 42 are listed in Table 1. FINDSITE 43 and 3DLigandSite 44 extend these ideas to homology models and detect the functional determinants of a ligand binding site. FINDSITE specifically creates homology models of the query, structurally aligns these to determine a likely binding site, and then suggests ligands and other GO functional annotations. In controls with less than 35% sequence identity to the nearest target protein, FINDSITE reached 67% accuracy. A related method, pevoSOAR 45, annotates structures for enzymatic function with 80% accuracy in limited controls. Together these studies show that patterns of local structural similarities add important information for functional inference.
Further following the logic of sequence comparisons, structural searches can also focus on just the few residues that mediate the most essential aspects of catalysis or binding. The example of the Ser-His-Asp catalytic triad of serine proteases illustrates that only a few amino acids in a well-defined structural conformation are sufficient to annotate function in structures 46. This suggests a general strategy in which a small but functionally essential structural motif, called a 3D template, is matched geometrically in other protein structures. A matched protein may then potentially perform the function associated with the template 47. Several methods, including FunClust 48, GASPS 49, SuMo 50, PAR-3D 51, and PINTS 52 follow this strategy. They typically rely on a source of structural motifs that are functionally relevant, such as The Catalytic Site Atlas 53 database, which compiles templates for enzyme activity taken from the experimental literature. To identify enzymatic templates more generally, FLORA defines them in terms of recurrent structural patterns in the superimposed structures of enzyme homologs 54.
Molecular function may also be inferred from phylogenomic classifications. Starting with an alignment of homologs and an associated phylogenetic tree, annotations are transferred within branches following the topology of the tree 55. Typically, uncharacterized proteins can inherit the annotation of the ortholog subfamily to which they belong. GeMMA 56, SCI-PHY 57, PROTONET 58, and SIFTER 59, 60 reflect these ideas. The phylogenetic tree of PROTONET 58 has nearly 10 million sequences, and a user can retrieve the evolutionary tree relevant to a query protein of their choice, and navigate its branches to search for functional information. In a more automated approach, SIFTER models protein evolution to propagate GO annotations within the tree 59, 60. This is a slow process, but limiting the number of possible combinations of molecular functions for individual proteins significantly raises efficiency without loss of prediction accuracy 60.
Because paralogs arise from gene duplication and usually evolve different functions, it is important to distinguish them from orthologs. Algorithms that detect orthology often rely on tree reconciliation approaches. Typically, a phylogenetic tree of homologs is compared to a speciation tree, allowing paralogs and orthologs to be identified by inferring the order of events for gene loss and duplication. TreeFam 61 provides ortholog and paralog assignments based on this approach, as well as phylogenetic trees for individual proteins for mammal families. PhylomeDB 62 uses a different species-overlap algorithm, which compares the species identity of closely related branches to decide whether their parental node is a duplication or a speciation. It provides orthology predictions, alignments, and phylogenetic trees for human, the Saccharomyces cerevisiae, and Escherichia coli.
It is possible to integrate the diverse evolutionary patterns seen in sequences, motifs, templates, and phylogenies through Evolutionary Trace (ET) analysis 63. This approach applies proteome-wide and has been extensively validated in experimental case studies. It yields tools to map functional sites in proteins, identify their key determinants, guide protein redesign studies, and extract 3D functional motifs with which to annotate protein function in novel structures. In view of this variety of applications, ET patterns arise from a surprisingly basic classification procedure.
In order to discover which residues are important to structure and function, ET systematically ranks amino acid positions by their phylogenetic patterns of variation. Starting with a protein family alignment and the corresponding evolutionary divergence tree, ET ranks residue positions better, or worse, depending on whether the substitutions in their alignment column correlate with larger, or smaller, tree divergences (Figure 1). Thus, by definition, variations of top-ranked ET residues entail big evolutionary steps, suggesting that they contribute importantly to structure and function. Variations of poorly-ranked residues, by contrast, entail small evolutionary steps and suggest at best a limited influence on structure and function. Thus, by systematizing these comparisons between alignment and tree, ET ranks residue positions relative to each other by the size of their phylogenetic variations. This procedure mimics the laboratory strategy of measuring with assays which substitutions disrupt function, replacing assays and mutations in the wet lab with divergences and variations, respectively, in silico 63.
A series of technical studies show that the ET rank of evolutionary importance reveals structurally and functionally relevant patterns (Table 2). First, top-ranked ET residues cluster spatially in protein structure 63–65. Second, this clustering is widespread in the structural genome and greater than expected by chance as measured with a z-score to yield an overall measure of structural clustering of important residues (Figure 2). When no structure is available, sequence-based quality measures can also assess the significance of ET patterns 66. Third, these clusters overlap with functional sites as shown in 37 of 38 proteins with known ligand binding sites, and so can yield insights into the regions of a protein that mediate function most directly 64, 67. Fourth, the ET link between sequence and structure is such that better clustering z-score strongly correlates with more accurate functional sites discovery 67, as shown in 50 diverse proteins by varying the input parameters of ET and observing correlations mostly above 0.7 68. Mapping evolutionarily important residues to the structure has also been useful in other studies. Spatial clustering of important residues formed presumed functional sites useful for protein-protein docking 69 and the prediction of catalytic residues 70. Thus phylogenetic patterns of residue variations in sequences are linked to a clustering bias in structures that reveals functional sites. As discussed next, one may then interrogate a novel structure with ET to identify its functional sites and its residue determinants. In a variety of prospective experimental case studies, this guided the design of separation-of-function mutations; the rewiring of functional specificity, such as the discovery and reprogramming of an allosteric pathway; and the design of peptide inhibitors. On a structural proteomic scale, top-ranked ET residues enable large-scale function prediction.
Selective separation of function mutations helped clarify in the eukaryotic Ku70/80 heterodimer how different and antagonistic functions co-exist in the same complex, and suggested a long-sought interaction site with the gene repressor LexA in the prokaryotic protein RecA. The former study identified two structurally distant clusters of top-ranked ET residues that suggested distinct functional sites in Ku70/80. Targeted mutations to one of the clusters disrupted end-joining but not telomere-maintenance, and mutations of the other cluster did the reverse. Thus double-strand break DNA repair and telomere maintenance segregate to opposite ends of the Ku structure which explains how both functions may be performed without risking end to end chromosome fusion 71. Likewise, in RecA, ET revealed a number of new functional sites that were then mutated. These mutations disrupted either DNA repair by recombination, or LexA interaction, but not both. Thus, even though RecA is a heavily mutagenized, classic example for homologous DNA repair, ET patterns of evolutionary importance revealed previously unrecognized functional regions including the potential trigger of LexA-mediated error prone DNA repair—one of the root causes of antibiotic resistance 72.
ET patterns typically identify functional sites on protein surfaces, but they can also suggest internal mechanisms. An ET study mapped key functional residues in the seven-helical transmembrane core of G protein-coupled receptors (GPCR) and suggested that distinct internal functional modules couple allosterically the binding of extracellular ligands to intracellular signaling through G proteins or β-arrestin-mediated internalization. Consistent with predictions, mutations of top-ranked ET residues in each module variously inhibited ligand binding, caused constitutive activity 73, and could even block G protein signaling while leaving β-arrestin signaling intact 74. More recently, a difference analysis of ET applied solely to bioamine receptors and applied to all rhodopsin-related receptors suggested a set of residues uniquely important to bioamine function. Single point mutations then transferred these putative bioamine specificity determinants from the 5HT-2A serotonin receptor into the D2R dopamine receptor and, as a result, increased serotonin signaling and decreased dopamine signaling independent of changes in binding affinity 75. These mutations, located deep in the GPCR transmembrane core, show that the GPCR allosteric pathway can encode signaling response specificity independently of binding, demonstrating the concept of allosteric specificity, and that this specificity code can be traced back and rekeyed, at least in part, by swapping top-ranked ET residues between paralogs.
Besides point mutations, ET patterns have been moved whole into a new scaffold to create functional mimetics. A clusters of ET residues suggested a novel binding site on surface exposed helices of G protein-coupled receptor kinases (GRK), proteins that phosphorylate the intracellular loops of GPCRs to regulate their activity 67. This site was then mimicked with peptides designed to keep the evolutionarily important residues intact, while less important amino acids were substituted in order to stabilize a helical structure. Some of these peptides inhibited GPCR phosphorylation by 80% 67. Together these studies show that in diverse proteins and in diverse types of experimental manipulation, top-ranked ET residues consistently identify the key determinants of functional sites. They should therefore be useful for 3D functional motifs to annotate function in novel protein structures.
In order to annotate function of novel protein structures solved by structural genomics, ET Annotation (ETA) follows the 3D motifs strategies reviewed above. Uniquely, this approach repeatedly exploits ET patterns to select motifs and to filter acceptable matches. ETA applies ET ranks to the structure of an unknown protein, the query, to identify six best clustering, top-ranked ET residues at or near a protein structure’s surface: the 3D template. Simple geometric matches of such templates to protein structures of known function, the targets, often prove too non-specific to suggest identical functions accurately. However, false positives can be reduced dramatically by requiring that the matched sites in the target be composed of top-ranked residues 76; that a 3D template from the target reciprocally match the query 77; and that a plurality of targets concur in suggesting the same function 76. If so, this functional annotation may be reliably transferred to the query in high throughput fashion, with 92% accuracy for enzymes at three-digit EC numbers; and 94% accuracy for non-enzymes at the third GO depth level in over a thousand Structural Genomics protein controls 78. These studies confirm, on a large scale, that phylogenetic residue variation patterns convey highly specific structure-function information.
A recent extension of ETA exploits graph-based semi-supervised learning to improve function annotation specificity and coverage. The approach ties all-against-all ETA matches among all known protein structures into a network, in which nodes represent protein structures and links indicate ETA 3D structural template matches between proteins 79. Labels that indicate function are then diffused globally following the topology of this network. Although all labels reach nearly all nodes, only a fraction does so with any statistical significance. This global analysis improves accuracy by 6% (to 96% accuracy) at 65% coverage over all four EC numbers compared to ETA, and it also performs favorably against other methods 54. As further validation, a novel and nontrivial ETA network annotation was experimentally confirmed as a carboxylesterase (EC 22.214.171.124) in a vancomycin resistant strain of Staphylococcus aureus 79. This annotation was based on matches to three structures with sequence identities ranging between 11 and 13%. These data show that global comparison of phylogenetic variations patterns of 6 residues, in a well-defined structural arrangement, uncovers accurate and specific functional information, including the resolution of substrate specificity, far into the twilight zone of protein sequence similarity.
The relationship between sequence, structure and function is part of the broad effort to understand how genotype is linked to phenotype. Some approaches rely on biophysical modeling and others are purely experimental. However, because genotype information is constantly in flux and a gene’s survival depends on the fitness that it encodes, evolutionary analysis is another central approach to understand how genotype relates to phenotype. The exponential dependence of deviations in structure and function as a result of deviations in sequence among homologs suggests that evolution proceeds smoothly following regular processes over long time periods. A challenge is to complement these statistical observations of evolutionary regularity with equally precise molecular level patterns that help to recover biological meaning from high throughput sequence, structure, and function data. This review shows that different approaches that compare sequences and structures, motifs and templates, correlations and phylogenetic classification are able to identify general patterns that contain precise information on molecular function.
Many of the benefits of each of these approaches are naturally contained in Evolutionary Trace analysis. This approach scores sequence positions by their relative evolutionary impact, as judged from the size of the evolutionary steps associated with their variations. Thus, residues are ranked by how well their own evolution correlated with the evolution of all other sequence positions, represented by the phylogenetic tree. Critically, residues with variations that correlate with root divergences are more important and have remarkable structural and functional properties: they cluster structurally; these clusters map functional sites; clustering quality correlates with functional site prediction; experimental mutations at top-ranked residues control function and specificity; and their mimicry enable the transfer of function to a peptide, or to other protein structures on a proteomic scale in silico. Thus top-ranked ET residues embody features in the sequence, in the structure, in the protein function, and in the phylogeny that are reproducible as general across the proteome. This suggests that they capture basic patterns linking genotype to phenotype during evolution. To fully support this view, however, it remains to reframe evolutionary trace analysis in a formal and extensible framework to make explicit the genotype to phenotype relationship. Such a relationship might then, in turn, help clarify the impact of missense mutations on protein function.
We wish to thank Rhonald Lua and Eric Venner for helpful discussions, and gratefully acknowledge grant support from the National Institute of Health through R01GM079656 and R01GM066099, and from the National Science Foundation, through CCF 0905536, NSF DBI-0851393, CCF 0905536, as well as from the Cancer Prevention Research Institute of Texas, through CPRIT RP120258.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.