|Home | About | Journals | Submit | Contact Us | Français|
Protein interactions give rise to networks that control cell fate in health and disease; selective means to probe these interactions are therefore of wide interest. We discuss here Evolutionary Tracing (ET), a comparative method to identify protein functional sites and to guide experiments that selectively block, recode, or mimic their amino acid determinants. These studies suggest, in principle, a scalable approach to perturb individual links in protein networks.
Protein interactions are an emerging frontier for therapy because they underlie all aspects of cellular activity . They organize cellular components into complexes, macromolecular machines, cellular pathways and biological networks that sustain development, growth and homeostasis. Upon disruption, deregulated interactions can lead to amyloidosis, to cancer, or to many other ailments .
Unfortunately, such disruptions are common and diverse. A survey of deleterious protein mutations recently suggested 65 diseases likely caused by a gain, or loss, of specific protein-protein interactions (PPI) . Moreover, in a complex disorder such as ataxia, the same disease may arise in different individuals from defects in different interconnected proteins . Therapies directed to a single specific protein may thus fail. This realization, plus the slow rate of new drug development relative to the rapid expansion of biological knowledge, make a case for a network approach to medicine , namely, discovering the components of a disease process; elucidating their interactions; diagnosing those at fault; and developing flexible therapeutic tools to counter their abnormal interaction. This review focuses on the last step in this process: approaches to understand the molecular details of protein functional sites in order to gain control over them .
A first step to manipulate a protein interaction is to characterize the amino acids that control it and which, together, define a functional site. Many different approaches try to detect various types of sites: for catalysis , for binding small ligands , for macromolecules , or sites and amino acids that control functional specificity . Nearly all these approaches search a protein structure for features typical of a functional signature. This includes geometric searches for ligand binding pockets and clefts [11,12], chemical titration models to identify catalytic residues ; and energy calculations to identify ligand binding, or “hot spot”, residues [14–16]. Other approaches focus on evolutionary conservation such as Consurf , siteFiNDER3 , INTREPID , and Joint Evolutionary Trace (JET) , often related to Evolutionary Trace (ET) [21–24].
In turn these features can be combined to increase the accuracy of searches. DISCERN, for example, identifies catalytic residues from their structural clustering and conservation , and many other methods search for surface cavities that exhibit sequence conservation in order to detect potential small ligand sites [8,26]. The PPI-pred server focuses on protein-protein interfaces ; it is a machine learning method that optimizes predictions by combining structural information, surface features, evolutionary information, and residue composition. FINDSITE is a threading based method for ligand binding site prediction .
To compare these many different functional site prediction methods is desirable but not straightforward. Ultimately, this is an experimental task that must be tailored to the problem at hand. Arguably, residues beyond a catalytic site can define its activity  and thus could be part of it; conversely, not all PPI residues are energetically important to an interaction  so that the elements of an interface may not all contribute to it equally, or even significantly. Moreover, there are distinct approaches to define an interface. They agree broadly but can differ in details: some approaches are based on proximity between chains , and others are based on surface buried away from the solvent  or on pure geometry . More problematic, gold standard mutational studies are themselves limited in the number of substitutions and assays that probe the biological role of sequence position. Finally, while some amino acids are important for structure rather than function [34,35], it is clear that the former are a main cause of deficits in the context of human genetic variation . These observations suggest that stringent tests of functional site and residues predictions might focus on whether they are necessary and sufficient for function: (a) can they guide protein function redesign experiments efficiently; and (b) can they predict protein function based on residue similarities?
Functional site redesign strategies are distinct from larger transfer of sequence segments that form modular protein chimeras . Rather redesign means to target, or graft, the amino acids of a functional epitope to modulate function [38,39]. Often the focus of these experiments is on controlling the character of an interaction.
Some studies manipulate a protein to raise its affinity . In calmodulin, affinity with CAM-dependent protein kinase II was increased 900-fold . The method requires a reliable structure as a starting point and knowledge of the actual site meant for redesign. The strategy is then to increase the hydrophobic surface area of the binding site, and assess whether this is likely to increase binding energy while maintaining a stable structure as determined by energy, and protein models, from Rosetta  or CHARMM . Binding affinity may increase as much as 10-fold by replacing a single polar residue with a hydrophobic one, or small hydrophobic residues by larger ones . Strikingly, MHC peptide binding to its T Cell receptor improved nearly 100-fold by several single point mutations that each alone improved affinity six-fold or less . Thus increasing the overall hydrophobicity of the interface synergistically improved affinity.
Other studies aim to redesign specificity, either at protein-ligand binding sites  or at protein-protein interfaces [47–50]. Thus, some enzymes have been rationally redesigned to be active with a particular substrate [51,52]. Typically, this entails mutations that: increase electrostatics complementarity to the desired ligand (positive design); decrease the fit to undesired ones (negative design); and maintain stability in the original structure . A challenging example of negative design developed target-specific peptides for 20 bZIP families, this despite the high similarity in sequence and structure of these proteins . The method, CLASSY, systematically explored positive and negative designs. To gain efficiency, the group combined linear programming to optimize energy calculations and a cluster expansion to convert the structure-based problem to a sequence-based one. In 40 of the 48 cases, the new peptide bound the intended target while losing binding to undesired competitor targets.
Of note, these protein redesign methods are typically not closely linked to functional site predictions. Instead they exploit a priori knowledge of a known complex to focus on contact residues and compute potential gains in affinity or specificity based on energetics, taking the functional site for granted. Comparative methods such as the Evolutionary Trace, however, frequently link both together.
The Evolutionary Trace aims to guide experiment to the amino acids involved directly in protein function . It does so by ranking the impact of each sequence position on evolutionary divergence, as illustrated in Figure 1. Conceptually, ET mimics experimental mutational scanning. Whereas, in the laboratory, a sequence residue is “important” when its mutation changes the response of an assay, here ET assumes a residue is (more or less) important when its variations correlate with (greater or lesser) evolutionary divergences . Unlike conservation-based methods, ET information depends crucially on the evolutionary branching pattern, every split being interpreted as a functional divergence.
Critically, top-ranked ET residues are far from being random: they typically cluster together spatially in the protein structure, and these clusters map out functional sites. The structural clustering (for example at, but not restricted to, the 30th percentile-rank) was statistically significant in nearly all of the 46 proteins tested with diverse functions . And this could then can be quantified statistically on a large scale, and in closed form, by a clustering z-score: the distance between the observed clustering pattern and the one expected by chance, expressed in units of standard deviation . In turn, the surface clusters of top-ranked residues overlapped known functional sites significantly more than expected by chance, as seen retrospectively in 79 diverse proteins , or in prospective case studies [58,59].
The clustering and biological relevance of evolutionary important residues are tightly interconnected and general features of sequence, structure and function. First, structurally, the better top-ranked residues clustered in decoy models of protein folds, the closer these models were to the native fold , an observation that was independently verified . Likewise, functionally, the more top-ranked residues clustered in the structure, the better these clusters predicted functional sites–a correlation tested in over 50 diverse proteins . Although the structural resolution was not sufficient to guide protein folding usefully, it does provide a feedback mechanism to improve functional site predictions . Together, these studies show that ET predictions are non-random; their reliability is quantified through a measure of confidence, the clustering z-score; they identify functional sites in retrospective controls; and they are widely applicable to the structural proteome.
Besides these retrospective controls, laboratory studies extensively tested whether ET information could guide experiments to perturb protein interactions. A simple test was to selectively separate functions in multifunctional proteins by targeting point mutations to top-ranked amino acids [59,63–65]. In one instance, the Ku heterodimer, ET-guided experiments that produced in months many more separation of function mutants than a multi-year experimental screen in yeast, and which showed that the site mediating double-strand break DNA repair and telomere maintenance segregated to opposite ends of the Ku structure, thereby clarifying how these antagonistic functions could coexist in one complex .
In a second type of perturbation, manufactured peptides copy the molecular determinants of a binding site and then compete with or substitute for a native interaction . In one case, a helical peptide engineered to mimic the most important residues of a new binding site suggested by ET disrupted function . Although peptides have non-traditional pharmaceutical profiles, much work aims to increase their delivery and stability  and, clearly, they can be effective at disrupting critical pathways, such as Notch signaling . Moreover, the preponderance of protein-peptide interactions, estimated to form 15–40% of all interactions within a cell , makes this approach a rich potential source of new molecular perturbing agents.
A third type of network perturbation rewires an interaction to a different function . Since ET explicitly points out which residues are important and how they vary from branch to branch of the evolutionary tree, this provides, in principle, a protein family-specific cipher to recode function among protein homologs by swapping their cognate top-ranked residues . This hypothesis has been extensively tested and led, in vitro, to swapped activity or binding [73,74], and, in vivo, to adapt a frog proneural transcription factor to a fly environment, and vice versa .
These different examples of functional perturbation demonstrate that bona fide predictions could be repeatedly validated in a variety of different experimental systems and laboratories. Of note, all these examples focused so far on functional residues at, or near a protein surface. A fourth type of network perturbation can also target the internal components of a protein structure.
About 30% of current drugs target G protein-coupled receptors (GPCRs)  or their associated protein network. ET was created specifically to study this pathway, which underlies smell, taste, vision, pain and much of endocrine and autonomic pharmacology. One goal is to identify and then rationally modify the molecular basis of signaling to identify novel possible therapeutic targets. Thus, following the same type of protein redesigns as above, separation of function mutations in the receptor , Gα , and Regulator of G protein Signaling (RGS)  validated prior predictions of interaction sites for export, receptor coupling , and downstream effector activation , respectively. Functional rewiring was demonstrated in the RGS case by switching top-ranked cognate residues among homologs . And a peptide designed to mimic the top-ranked residues of a novel site in G protein receptor kinases impaired GPCR phosphorylation, confirming a role in this interaction . All these studies target top-ranked ET residues at, or near, the surface of the protein.
However, evolutionary analysis can also inform the internal mechanisms of a protein. A trace of visual rhodopsins versus the broader set of rhodopsin-like GPCRs identified two separate structural subdomains buried in the core of the seven-helical transmembrane bundle . The first one, unique to rhodopsins, was a putative ligand-specific binding site. The other one, common to all GPCRs, was a putative evolutionary conserved allosteric pathway that, over a distance, transforms ligand binding into effector activation. As predicted, point mutations to these sites then respectively impaired ligand binding or caused constitutive activity . Moreover, three mutations in the allosteric pathway near the G protein coupling site blocked G protein signaling but kept the β-arrestin signaling intact . These studies highlight the existence of functional modules within the structure and how they may be exploited to effectively sever just one of the two signaling branches efferent from activated receptors. More recently, related work investigated the correlations between sequence positions within a protein family and found similar structure-function partitioning of the protein into groups of residue positions referred to as “protein sectors” . This analysis was extended to the S1A serine proteases and found mutations of the individual “protein sectors” lead to different effects focused either on catalytic power or thermal stability.
To test further how well such evolutionary modules guide protein engineering, and to understand the origin of ligand-biased signaling, whereby different ligands signal via G proteins or β-arrestin to different extents, a study swapped top-ranked ET residues from the putative common allosteric pathway between two antagonistic psychoactive receptors, those for serotonin and dopamine , Figure 2. All single point mutants were expressed and bound normally to a bioamine antagonist. Strikingly, all of them also exhibited altered binding or signaling, by either dopamine or serotonin, and these effects were mostly separable or even paradoxical. Notably, four mutants significantly enhanced serotonin response without increasing serotonin binding. And two of these four mutants had decreased dopamine signaling, even though dopamine affinity was as good or better than in the wild type receptor.
This independent reprogramming of binding and of signaling from dopamine to serotonin highlights allosteric specificity, namely, the pathway itself can determine which bound ligands signal, separate of binding site affinity. Moreover, the key determinants of the allosteric pathway can be traced evolutionarily and then recoded, one top-ranked ET residues at a time–much as the tumblers of a lock are rekeyed. Presumably, during evolution, single mutations constantly change ligand affinity, effector biases, or the wiring between them, and thus probe alternative wiring at GPCR network nodes. In practice, many of the key residues surround structural waters, as shown in Figure 2, suggesting a potential site where a drug could influence ligand-biased signaling .
Case studies such as these are informative, but they cannot prove that a method is broadly applicable. To do so would require that functional determinants be identified and shown to be predictive of function–on a proteomic scale. A simple example is the Serine-Histidine-Aspartate catalytic triad, a three amino-acid structural motif often sufficient to identify proteases . More generally, methods to annotate the unknown function of the novel structures produced by Structural Genomics  follow this logic and transfer annotations between proteins based on local sequence and structure similarities [84,85].
Likewise, an Evolutionary Trace Annotation (ETA) server was developed to predict the function of novel protein structures, as illustrated in Figure 3. Starting with a query structure, ETA traces it, identifies the largest surface cluster of top-ranked residues and picks from those a structural motifs, or 3D template. The template can then be used to search all PDB structures for similarities that suggest a common function. While geometric matches within 2 Å root mean square deviation are often random, the specificity rises to over 90% once these matches are also filtered for (i) the importance of the matched site , (ii) reciprocity, so the 3D-template of the match matches back to the query , and (iii) plurality, so that multiple matches point to the same function more than to any other . This approach is scalable the structural proteome to annotate over 1200 structural genomics enzyme up to three Enzyme Classification digits with 92% accuracy , or non-enzymes using the Gene Ontology functional classification . These annotations suggest that only six top-ranked amino acids are sufficient to identify function. Moreover, in enzymes, simply substituting other residues that may be more directly associated with catalysis lowered rather than raised accuracy, showing that the definition of a necessary set of residues to define function is complex . Overall, these studies complement case controls to confirm the proteome-wide possibility of picking functional determinants from evolutionary comparisons.
Predictive algorithms must fulfill specific objective criteria: (a) to produce results that are non-random; (b) to match retrospective controls; (c) to also match prospective controls, i.e. make genuine predictions that are then experimentally validated; (d) and to be scalable to a well-defined domain of application. A fifth requirement is, since in biology a single method is unlikely to be unfailingly predictive, (e) to quantify prediction confidence to distinguish favorable cases from others that are less so. As seen above, the Evolutionary Trace fulfills these conditions.
A biological result is that ET servers offer a reliable approach to focus protein redesign studies to the most relevant parts of a protein [23,24,89]. The elucidation of allosteric specificity determinants linked to ligand-based signaling in GPCRs is an example. From a theoretical perspective, ET also points to general proteomic rules and to fundamental evolutionary patterns: Sequence residues are ranked by evolution; the important ones cluster; these clusters indicate functional sites; clustering quality correlated with functional site prediction quality; and variations at top-ranked residues generally control functional specificity.
It is helpful to understand the origin of this diverse information, which seems at odds with what might be expected from simple conservation analysis. First, it is important to stress that ET does not rely on conservation but rather on variation. The variations linked to small divergences are ranked poorly, those linked to major divergences are ranked superiorly. In effect, the tree divergences provide virtual functional “assays”. Since a tree with N sequences has N−1 nodes, ET analysis benefits from vastly more functional assays than an experimental laboratory. Moreover, the sequence variations under study are all informative since they occur in living species, and thus are evolutionary successful. Thus, comparative analysis with ET benefits from a wealth of relevant biological information. Second, the clustering of top-ranked is also informative, as it links functional impact to the structural continuity of evolutionary selection forces. The tree thus literally acts as a cipher to deconvolute evolutionary variations in sequence and function and, with a structure, enables high-resolution definition of functional sites.
An open question remains the ultimate domains of application of these techniques, tested thus far in structurally well-defined proteins. For example, many important interactions couple protein folding with protein binding in intrinsically disordered regions . Whether disordered proteins can fit into the same evolutionary framework is not demonstrated. Even more challenging would be to fit in exotic experiments that demonstrate the remarkable change that a single residue mutation can bring about on both structure and function . Also with the advent of personal genomes , improved interpretations of sequence variations from an evolutionary perspective would be desirable.
In the near term, since comparative analysis does not rely on energetics, these distinct and experimentally validated approaches should be complementary. This suggests that as protein networks become better defined, computational tools may reliably design protein variants and peptide tools that guide systematic perturbations in order to assess the mechanisms, and screen for therapeutics, aimed at networks.
We thank Matthew Ward and Serkan Erdin for contributing Figure 3. O.L. gratefully acknowledges support by grants from the NIH, GM079656 and GM066099, and from the NSF, DBI-0547695 and CCF 0905536. A.D.W, was supported by training fellowships from the National Library of Medicine to the Keck Center for Interdisciplinary Bioscience Training of the Gulf Coast Consortia (NLM grant 5T15LM07093).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.