|Home | About | Journals | Submit | Contact Us | Français|
Directed evolution circumvents our profound ignorance of how a protein's sequence encodes its function by using iterative rounds of random mutation and artificial selection to discover new and useful proteins. Proteins can be tuned to adapt to new functions or environments via simple adaptive walks involving small numbers of mutations. Directed evolution studies have demonstrated how rapidly at least some proteins can evolve under strong selection pressures, and, because the entire ‘fossil record’ of evolutionary intermediates is available for detailed study, they have provided new insight into the relationship between sequence and function. Directed evolution has also shown how mutations that are functionally neutral can set the stage for further adaptation.
Millions of years of life's struggle for survival in different environments have led proteins to provide diverse, creative and efficient solutions to a wide range of problems, from extracting energy from the environment to repairing and replicating their own code. Good solutions to biological problems can also be good solutions to human problems — proteins are in fact widely used in the food, chemicals, consumer products, and medical fields. Not content with Nature's protein repertoire, however, protein engineers are working to extend known protein function to new environments or tasks1-4 and to create new functions altogether5-7.
Notwithstanding significant advances, a molecular-level understanding of why one protein performs a certain task better than another remains elusive. This state of affairs is perhaps not surprising when we remember that a protein often undergoes conformational changes during function and exists as a dynamic ensemble of conformers that are only slightly more stable than their unfolded and nonfunctional states and that might themselves be functionally diverse8. Mutations far from active sites can influence protein function9, 10. Engineering enzymatic activity is particularly difficult, because very small changes in structure or chemical properties can have very significant effects on catalysis. Thus predicting the amino acid sequence, or changes to an amino acid sequence, that would generate a specific behavior remains a challenge, particularly for applications requiring high performance (such as an industrial enzyme or a therapeutic protein). Unfortunately, where function is concerned, details matter, and we just don't understand the details.
Evolution, however, had no difficulty generating these impressive molecules. Despite their complexity and finely-tuned nature, proteins are remarkably evolvable: they can adapt under the pressure of selection, changing behavior, function and even fold. Protein engineers have learned to exploit this evolvability using ‘directed evolution’ — the application of iterative rounds of mutation and artificial selection or screening to generate new proteins. Hundreds of directed evolution experiments have demonstrated the ease with which proteins adapt to new challenges11. Notable recent examples include a recombinase evolved to remove proviral HIV from the host genome (providing a new strategy for treating retroviral infections)12, a cytochrome P450 fatty acid hydroxylase that was converted into a highly efficient propane hydroxylase (thereby proving that a cytochrome P450 is fully capable of hydroxylating small alkanes, even though most propane-utilizing organisms utilize structurally and mechanistically-unrelated enzymes)13, a more than 40 °C increase in the thermostability (T50) of lipase A (extending its application in biocatalysis to a whole new set of environments)14, and a variant of GFP which tolerates having all its leucine residues replaced with a nonnatural amino acid, trifluoroleucine15. Roger Tsien won the Nobel prize last year for his work on the fluorescent proteins that have transformed biological imaging16. Directed evolution had a key role by improving many features of fluorescent proteins, including emission and excitation properties, quantum yield, multimerization state and maturation rate4, 17.
Directed evolution has become a common laboratory tool for altering and optimizing protein function (as well as that of other biological molecules and systems, including RNA, DNA regulatory elements, biosynthetic pathways and genetic regulatory circuits 18-20; BOX 1). To understand the power, and the limitations, of directed evolution, it is helpful to view it as a biological optimization process. We therefore introduce the concept of evolution on a fitness landscape in protein sequence space and use this framework to explain directed evolution strategies. Laboratory evolution experiments have revealed important features of this fitness landscape and the types of trajectories that can traverse it efficiently. This landscape picture can help explain why decomposing a large functional hurdle into a series of smaller ones and exploiting protein modularity and structural information are useful strategies for dealing with the combinatorial explosion of possible paths in an evolutionary search. This picture also helps us appreciate the power of recombination to generate functional sequences with large numbers of (mostly neutral) mutations, novel combinations of which can give rise to new protein behaviors and therefore new starting points for optimization of protein function.
Evolution is unique because it works at all scales, from molecules to ecosystems — no other engineering design algorithm can make that claim. A simple algorithm of mutation and artificial selection has proved effective for everything from the selective breeding of plants and animals to discovering self-replicating nucleic acid sequences. Biological components and systems have shown a remarkable ability to adapt under the pressure of artificial selection, an evolvability that very likely reflects their own history of natural selection 100.
Functional nucleic acids have been evolved in the laboratory to achieve new and improved properties 18-20, 101. Because the phenotype and genotype are encoded in the same molecules, these experiments involve in vitro selections, where pools of up to 1015 sequences can be synthesized and evaluated outside of cells 102. Hydrolysis of nucleic acid phosphodiester bonds and binding of specified ligands are among the functions that have been discovered this way 103, 104. Recently, a set of self-replicating RNA enzymes that catalyse their own synthesis in a self-sufficient manner was created 105.
Directed evolution can also be applied to enzyme pathways and networks of interacting molecules such as genetic regulatory networks 106, 107. These systems are intimately tied to cellular function. Experimental selections for the desired behavior can often be developed, allowing very high-throughput testing, particularly for evolution of gene regulation 108. However, the sequence space associated with these networks is enormous, encompassing multiple protein coding sequences in addition to their regulatory regions. Mathematical models of how elements interact to generate desired functions can help focus the directed evolution search to components that are more likely to produce the targeted behavior 109. For example, an analysis of a mathematical model identified a particular ribosome binding site (RBS) as having a key role in the target function of a circuit110. Experiments verified that mutations to the RBS were effective at altering this target function.
There is little doubt that directed evolution is one of the most effective and reliable approaches to engineering useful new proteins. Perhaps less well appreciated, however, is how much our understanding of protein function and evolution has been enriched by these experiments. Directed evolution allows us to disconnect a protein from its natural context and observe how adaptation to different functional challenges can occur. These experiments can explore the boundaries between biological relevance (the protein's ability to contribute to the reproductive fitness of an organism) and what is physically possible (the protein's ability to carry out a specific function in vitro or in vivo) in ways that studies on natural proteins alone cannot. Directed evolution can test alternative adaptive scenarios, explore the range of possible solutions to a given functional challenge, examine relationships (for example, tradeoffs, where improvements in one property are accompanied by losses of another) between different protein properties, and provide biophysical explanations for evolutionary phenomena. Much has been discovered since these topics were first reviewed in the context of temperature adaptation21, 22. In this Review, we revisit some of these early lessons and discuss new ones that have emerged.
In his influential 1970 paper, John Maynard Smith eloquently described protein evolution as a walk from one functional protein to another in the space of all possible protein sequences23. He arranged all proteins of length L such that sequences differing by one amino acid mutation were neighbors. Although the distance between any two sequences is small (that is, equals the number of mutations required to interconvert them and is therefore ≤L), this high-dimensional space contains an incomprehensibly large number of possible proteins. For even a small protein of 100 amino acids there are 20100 (~10130) possible sequences, or more than the number of atoms in the universe. Searching in this space for billions of years for solutions to survival, nature has explored only an infinitesimal fraction of the possible proteins24. And, of course, natural evolution keeps only sequences that are biologically relevant; others are discarded, even if they represent solutions to other interesting problems. There are so many proteins waiting to be discovered, and we can only dream about the extent of their capabilities. Directed evolution is one way to extend protein function to new, nonnatural tasks and convert dreams into actual proteins.
Each sequence in Maynard Smith's protein space can be assigned a ‘fitness’, which in natural evolution is a measure of the host organism's ability to reproduce in a given environment: more-fit organisms reproduce faster and their genes spread throughout the population25. When artificial selection is imposed, fitness is defined by the experimenter. High-fitness sequences satisfy all of the criteria for a protein to function as desired, or at least to perform well in the assay used for screening, and might include the ability to recognize one substrate but not another, to be expressed at high levels in a particular host organism, to not aggregate, to have a long life-time, and so on. Protein evolution can then be envisioned as a walk on this high-dimensional fitness landscape, in which regions of higher elevation represent desirable proteins, and iterations of mutation and artificial selection continuously discover new sequences further uphill, with higher fitnesses (FIG. 1a).
As with any optimization problem, the structure of the objective function — the fitness landscape — influences the effectiveness of a search strategy26. Possibilities range from smooth, single-peaked ‘Fujiyama’ landscapes to rugged, multi-peaked ‘Badlands’ landscapes27 (FIG 1b). it might be helpful to label the ‘Fujiyama’ and ‘Badlands’ landscapes in the figure. Please, could you mark them on the attached file of the redrawn figures?] The rougher the landscape, the harder it is for evolution to climb: local optima create traps that evolution cannot escape unless a side-step or even temporary decrease in fitness is permitted, or unless multiple simultaneous mutations enable a jump to a new peak. The easiest landscape to climb is one that offers many smooth, uphill paths to the desired fitness (the Fujiyama landscape).
This terrestrial landscape analogy should be interpreted cautiously, however, because it cannot accurately represent the large number of possible paths that evolution can take to higher fitness (or the even larger number of possible downhill paths). While it is easy to visualize being caught on a local optimum in a three-dimensional landscape, a local optimum in protein sequence space (in which all possible mutations are deleterious) might be quite rare, unless stability has been compromised and few new mutations can be accepted. The introduction of stabilizing mutations, for example, can increase a protein's mutational robustness, opening new routes for further adaptation28, 29.
The vast size of sequence space makes it impossible to characterize (or even model) more than a minute fraction of this fitness surface. Despite this, several important features have emerged from accumulated experimental studies. The first is the low overall density of functional sequences: the vast majority do not code for any functional protein, much less the desired protein30-32. Another important feature is the uneven distribution of functional sequences. Although representing a very small fraction of all possible sequences, functional sequences are often next to other functional sequences33-35. Maynard Smith recognized that this feature was a requirement for evolution by point mutation to have been successful. Evolution can step one mutation at a time only if there is a continuous network of functional proteins; otherwise mutation would always lead to lower fitness, and evolution would stop23. Proteins are in fact robust to mutation — a significant fraction of possible mutations retain fold and function36, 37.
While natural evolution can discover new protein functions along circuitous paths that involve many neutral or even slightly deleterious mutations, directed evolution does not have that luxury. Because the possible evolutionary paths grow exponentially as mutations accumulate and there are too many ways to take neutral or deleterious steps that do not ultimately lead uphill, directed evolution is constrained to moving continuously uphill in an adaptive walk38. This is often not a severe limitation because many interesting proteins are accessible by short and simple adaptive walks. Although the resulting proteins, or even the mutations, might not be the same as those discovered by more convoluted paths to the same fitness level, they nonetheless provide valuable insights into protein function and routes of adaptation.
Before we describe some of the key lessons that directed evolution studies have taught about protein function and evolution, we would like to briefly discuss the experimental strategy. How the experiment is performed obviously influences the outcome and therefore the information that one extracts from it. Finding a sequence that performs a desired function in a vast space of possible sequences that is only sparsely populated might seem like a daunting task. Inefficient searches of this space could take essentially forever, and the task of the protein engineer is to choose a strategy that will reach the objective and do so quickly and easily. Starting with a functional protein, directed evolution uses repeated generations of mutation to create functional variation and selection of the most fit variants to direct the search to higher elevations on the fitness landscape. It involves four key steps (FIG. 2). First, identifying a good starting sequence, second, mutating this ‘parent’ to create a library of variants, third, identifying variants with improved function, and last, repeating the process until the desired function is achieved. There are many options for implementation of each step, the choice of which can greatly affect both the efficiency and the endpoint of an evolutionary search.
Directed evolution (and for that matter natural evolution) relies on the ability of proteins to function over a wider range of environments or carry out a wider range of functions than what might be biologically relevant at a given time and therefore selected for. This ability to tolerate a nonnatural environment or to exhibit ‘promiscuous’ functions at some minimal level provides the jumping off point for optimization towards that new goal. A good parent protein for directed evolution, then, exhibits enough of the desired function that small improvements (expected from a single mutation) can be reliably discerned in a high throughput screen38. It is also easy to work with and sufficiently stable to accommodate multiple, potentially destabilizing mutations if the target function is some other property. Some proteins can be significantly more evolvable than others11, 29, 39, 40. Possible molecular mechanisms that contribute to evolvability have been discussed, including the key role of the chemical mechanism in enzyme functional evolution41, 42 and the idea that evolvable proteins exist in multiple closely related but functionally diverse conformations whose distribution is easily altered by mutation8. These ideas, however, are still largely speculative, and little other than the ability to accept mutations29, 43 has been conclusively demonstrated in laboratory evolution experiments to contribute directly to allowing one protein to adapt to a new challenge more readily than another. A good heuristic indicator of a protein family's evolvability is its natural functional diversity40, 44: proteins that have adapted to exhibit a range of functions across the family, for example members of an enzyme family that accepts a wide range of substrates (although individual enzymes in the family might be specific), are likely to be adaptable in the laboratory.
The next step is to create a library of variants. Since screening is often the most difficult experimental step, the library is usually created to generate the highest probability of finding improved proteins given the screening capability. Because most mutations are deleterious and multiple mutations frequently inactivate proteins (vide infra), this usually involves a low mutation rate (1 or 2 amino acid substitutions per gene). If screening is not difficult (for example, there is a good genetic selection), then the library can be constructed to generate the largest potential improvement. This might mean a slightly higher mutation rate 45. In either case, mutations can be introduced randomly 1 or, if structural or mechanistic information is available, they can be made in a more directed fashion46-48, in an effort to increase the frequency of improved proteins and reduce the load in the next step.
Screening (with high-throughput functional assays) or selection (for example, a genetic selection in which hosts having improved proteins outcompete the others) is used to identify the library members improved in the target property. A good screen or selection accurately assesses the target properties. The rule ‘you get what you screen for’ is always useful to remember — screening (or selecting) for something else is risky49. It is also important not to demand too much improvement in a single generation. The hurdle must be tuned to the screening capacity and should usually be no greater than the improvement that can be provided by a single mutation. If the desired function is beyond what a single mutation can accomplish, the problem can be broken down into a series of smaller ones that can be solved by the accumulation of single mutations, for example by gradually increasing the selection pressure or evolving against a series of intermediate challenges13. The process of mutation and selection is repeated until the fitness objective is met; the number of iterations required obviously depends on the starting fitness and the improvement that can be achieved in each round, but is often only 5-10 generations.
An evolutionary search relies on the presence of functional diversity within a population, which is the result of underlying genetic variation. At the molecular level, this genetic variation can take many forms: point mutations, insertions, deletions, recombination, circular permutation, etc50-52. To search efficiently and minimize the screening load, the underlying genetic variation should be set to generate the highest probability of improvement. Statistically, random mutations tend to be quite harsh, usually decreasing activity and sometimes destroying it altogether. Typically, 30-50% of single amino acid mutations are strongly deleterious, 50-70% are neutral or slightly deleterious, and 0.01-1% are beneficial11, 29, 37, 53-56. If the fitness landscape is Fujiyama-like with many smooth uphill paths, one need only accumulate beneficial mutations (either in multiple rounds of mutagenesis and screening or by recombining beneficial mutations found in each round57, 58) until the desired fitness is reached. In a single-peaked landscape, all beneficial mutations make a cumulative contribution to the desired function, and all paths uphill eventually converge to the same, optimal solution.
Of course, no real protein landscape consists of a single peak. Most mutations are deleterious and therefore most paths end downhill, with inactive proteins, rather than uphill at more-fit sequences. Furthermore, epistatic interactions occur when the presence of one mutation affects the contribution of another. Such epistatic interactions lead to curves in the fitness landscape and constrain evolutionary searches. Extreme forms of epistasis, in which mutations that are negative in one context become beneficial in another (so-called sign epistasis59), create local optima on the landscape that can frustrate evolutionary optimization. Epistatic interactions are a ubiquitous feature of protein fitness landscapes60, 61. We argue, however, that they are not important for most optimizations by directed evolution, which instead follow one of many smooth paths that bypass the more rugged, epistatic routes on this high-dimensional surface62-64. Among the large number of mutational trajectories between a starting point and a solution, smooth uphill paths can often be found (FIG. 1c).
Knowing of epistatic interactions and local fitness optima, some protein engineers worry about the need to make and find multiple mutations at one time. If multiple mutations are in fact needed to climb the peak, the combinatorial explosion of mutational possibilities makes them especially challenging to find. For even a small protein of 100 amino acids, there are 1,900 single amino acid mutants and more than 1.5 million double mutants. The number of possible sequences increases exponentially with the number of mutations, and a complete sampling of even just the double mutants is beyond the capacity of most screens.
Ever higher-throughput screening approaches have been developed to enable sampling of more mutants and more combinations of mutations3, 65, 66. These screens can allow multiple paths to be explored simultaneously, increasing the probability of discovering good adaptive routes to higher fitness. Higher-throughput screens or selections usually come at the cost of decreased accuracy, however, especially when a surrogate function that is more amenable to high throughput measurement is substituted for the desired function. Furthermore, increasing the mutation rate to capture rare synergistic mutations can make it more difficult to identify improved single-mutation variants, because common deleterious mutations will tend to mask the rare beneficial ones. It is thus often better to focus on sampling single mutants with a higher quality, lower-throughput screen rather than on increasing the throughput to capture multiple simultaneous mutations. Although a search through single adaptive steps cannot find mutations exhibiting negative epistasis, there are usually other, step-wise adaptive routes to the objective.
The high dimensionality of sequence space that makes finding simultaneous beneficial mutations so difficult can be reduced by taking advantage of structural, functional or phylogenetic information to focus mutations to those residues most likely to lead to the desired properties. For example, the modularity of protein structures permits the separate optimization of protein domains13, 67. Phylogenetic analyses suggest that nature might separately optimize other, structurally non-obvious subunits, or ‘sectors’68, which could prove to be appropriate targets for directed evolution. The search space can also be reduced by focusing mutations to specific residues within a domain, for example, in an active site or binding pocket in which functional changes might be more likely to occur11, 46, 69-71. This strategy only works, however, when the experimenter is able to select the right residue combinations for random mutagenesis and leaves out the possibility of finding surprising and informative solutions elsewhere. Numerous studies have shown, for example, that plenty of activating mutations lie outside enzyme catalytic sites and exert their influence through mechanisms that might not be obvious from structural analysis 9, 10, 72.
Evolution by the accumulation of single mutations has proven to be very effective at optimizing a function or property that already exists or can be reached through a series of intermediate steps. Some functions, however, simply can not be reached through a series of small uphill steps and instead require longer ‘jumps’ that include mutations that would be neutral or even deleterious when made individually. Examples of functions that might require multiple simultaneous mutations include the appearance of a new catalytic activity or activity on substrate for which the parent and its single mutants show no measurable activity.
Because most mutations are deleterious, the probability that a variant retains its fold and function declines exponentially with the number of random substitutions 36, 37, and random jumps in sequence space uncover mostly inactive proteins. Thus new functions are extremely difficult to obtain without altering some aspect of the search. One approach is to create a new starting point, a parent protein with at least some minimal function, and improve that by directed evolution7. Where natural examples of a desired function are not practical or might not even exist, emerging protein design tools have identified functional sequences5. Expanding the sequence space by incorporation of nonnatural amino acids can also introduce a whole array of new functions, and directed evolution can do the fine-tuning that might be needed to optimize these novel designs15. Another approach is to find more conservative ways to make multiple mutations, for example, using computational protein design tools to identify sets of mutations that are likely to be compatible with retention of structure47.
An approach to making multiple mutations that is used extensively in nature is recombination. Naturally-occurring homologous proteins can be recombined to create genetic diversity within protein sequence libraries73-75 (FIG 3a). It has been shown that mutations made by homologous recombination are much less disruptive and generate functional proteins with much higher frequency than random mutations56 (FIG 3b). Methods based on homologous recombination direct crossovers to regions of high sequence identity and are generally limited to sequences that are very similar (more than 70% identity) 75, whereas various sequence-independent methods can recombine at random 76, 77 or user-specified sites78, 79. Recombining homologous proteins by choosing crossovers based on structural information allows construction of libraries of chimeric proteins that simultaneously exhibit a high level of functionality and significant genetic diversity80. In all cases, the chimeric proteins inherit the best (and worst) residues the parents have to offer, in new combinations not observed in nature.
Chimeric proteins can differ by tens or even hundreds of mutations from their parent sequences and still function. The conservative nature of recombination can be exploited to make whole families of novel enzymes. For example, in one set of more than 6,000 chimeric cytochrome P450 proteins having an average of 70 mutations from the closest parent, approximately half folded properly, and at least 75% of the folded P450 proteins displayed enzymatic activity80.
The new combinations of residues can give rise to novel properties81. Because many of the mutations made by recombination are neutral or nearly neutral, recombination is an efficient way to generate the ‘neutral drifts’, or accumulation of neutral mutations, that have been demonstrated to lead to increases in promiscuous functions 82, 83 and mutational robustness 84, 85. For example, members of the chimeric cytochrome P450 library exhibited higher enzymatic activity than any of the three parents across a panel of 11 non-native substrates that included substrates on which the parent enzymes showed no measurable activity86. A large number of P450 chimeras were also more thermostable than the most thermostable parent enzyme, and the thermostable chimeras could be readily identified based on a small sampling of the library 87 (FIG. 3c). This approach was subsequently used to generate dozens of highly stable, highly active fungal cellobiohydrolase II enzymes that degrade cellulose into fermentable sugars (for example, for biofuels applications) 79. Recombination is thus an interesting way to explore new functions, although it might not be the best way to obtain or optimize a specific desired property or set of properties.
In addition to generating a plethora of novel proteins, directed evolution studies have elucidated available pathways and molecular mechanisms of adaptation, demonstrated a key role for stability in epistasis and evolvability, identified important evolutionary trade-offs in protein properties, and demonstrated the simultaneously conservative and exploratory nature of recombination, all shedding light on long-standing questions in protein chemistry and evolutionary biology. First and foremost, directed evolution experiments have demonstrated time and again how rapidly proteins can adapt to exhibit new functions and properties. Protein behavior can change dramatically upon mutating a very small fraction of the protein sequence. Directed evolution also provides a detailed view into the adaptive process.
A directed evolution approach to studying sequence-function relationships circumvents several challenges associated with inferring mechanisms of adaptation using comparisons of evolutionarily-related natural amino acid sequences21, 22. Such studies are confounded by the large numbers of mostly neutral mutations that accumulated during divergence of the sequences and the complex and largely unknown selection pressures under which the natural sequences evolved. In contrast, the sequences generated by directed evolution contain a small number of adaptive mutations that accumulated under well-defined selective pressures. Furthermore, performing the evolution in the laboratory permits access to the full ‘fossil record’ of evolutionary intermediates, whose sequences, structures, and functions can be analysed in an attempt to explain how new properties were acquired10, 44, 72, 88. Fasan and coworkers analysed selected intermediates that arose during the directed evolution of a cytochrome P450 fatty acid hydroxylase into a highly efficient and highly specific propane monooxygenase13, 72 (FIG. 4). The gradual increase in activity on propane (as measured by total turnovers of propane to propanol, the property targeted during directed evolution) was accompanied by other interesting changes in the enzyme's behavior, the most notable of which was the decrease in thermal stability (T50). Activating mutations came at the cost of stability, to the point that it became necessary to incorporate stabilizing mutations (generation 9 in FIG. 4) before further increases in activity could appear. This apparent trade-off between functionally beneficial mutations and stability reflects the fact that most mutations are destabilizing and therefore most activating mutations are also destabilizing. Because evolution favors the most likely solutions over rarer ones, it favors marginal stability in the absence of selection for higher stability. It also favors properties that are compatible with marginal stability32. Such trade-offs have also been demonstrated to constrain the evolution of antibiotic resistance enzymes89 and will be discussed further below.
The mutations that accumulated in the heme domain of the cytochrome P450 are also depicted in Figure 4b, color-coded according to the generation in which they appeared. Many of the mutations that conferred the increased activity on propane lie outside the substrate-binding pocket, where they influence substrate recognition through mechanisms that are difficult to discern from crystal structures or modeling. That the effects of the adaptive mutations are difficult to rationalize, much less predict, underscores how little we understand of how sequence determines protein structure and function. Directed evolution deals with the details of molecular interactions, and one hopes that those details will eventually inform protein design efforts7.
Directed evolution can explore alternative evolutionary scenarios, for example, to identify other possible solutions to the same functional challenge or whether multiple paths can lead to the same solution, as was done with a laboratory-evolved β-lactamase variant that contains five mutations responsible for a 100,000 fold increase in cefotaxime resistance63. In this study, the authors constructed variants having all 32 (25) combinations of the adaptive mutations, representing all intermediate sequences along all 120 (5 factorial) possible mutational pathways. They were able to estimate the probability of each pathway based on the relative change in antibiotic resistance conferred to the bacteria by each mutation along each path. Whereas most of the possible paths were constrained by epistasis and were therefore highly unlikely, there were 18 different, simple uphill walks to the final solution.
Even the earliest directed evolution experiments noted how rapidly proteins could adapt to new selective pressures1, 58, indicating the ready availability of smooth uphill paths in the fitness landscapes. Stability, the ability to tolerate new environments and low-level side reactions, or ‘promiscuous’ functions, all tend to respond well to directed evolution. One study used a well-controlled set of experiments to select for six different promiscuous activities starting from three different enzymes11. After two rounds of directed evolution, yielding just one to four mutations, the promiscuous enzyme activities (kcat/KM) had increased by up to 150-fold over the activities of the parent enzymes. Interestingly, these newly-evolved activities came at little cost to the native enzymatic activities, suggesting a particular robustness of the native functions to mutation and supporting a scenario for evolution of new activities that allows both the native and novel activities to be displayed in the same gene for some period of time8.
While demonstrating the availability of smooth paths uphill, directed evolution has also provided insight into the molecular epistasis that curves the landscapes. Several studies have revealed a key link between stability and epistasis, where the effect of a mutation can be conditional on the stability of the parent sequence36, 43, 90 (FIG. 5). This was demonstrated, for example, in a study of cephalosporin antibiotic resistance mutations in β-lactamase, where the fitness effects of several active site mutations were found to depend on presence of a stabilizing M182T mutation89 (FIG. 5a). These epistatic interactions are the result of catalytically beneficial but destabilizing mutations in the active site that cannot be tolerated unless the stabilizing M182T mutation is present. Without M182T, the active site mutations destabilize the enzyme to the point that total activity is compromised.
Many examples of stability-mediated epistasis are best explained in terms of a protein stability threshold, where stability is under selection only insofar as it allows a protein to fold and function36, 43, 91 (FIG. 5). The consequences for evolution are profound: a protein whose stability is low cannot accept more than a small fraction of the possible mutations, because most mutations are destabilizing. Thus it can become trapped on a local optimum, unable to go further. As illustrated in Figure 5b, proteins enjoying a larger margin above the minimal stability threshold can explore many more mutations and can therefore continue to adapt to other tasks such as acquiring activity towards a new substrate or partner29. Stability-mediated epistasis is a mechanism whereby neutral mutations can shape the available adaptive pathways during natural evolution as well as in the laboratory. When an evolutionary search in the laboratory seems to have exhausted all options for further uphill steps, the incorporation of stabilizing mutations has opened up new adaptive routes13.
Despite being performed on different protein folds with selection for different protein functions, the repeated evaluation of thousands of random mutations have revealed the general features of protein fitness landscapes. In addition to the uphill paths that lie alongside a large number of less favorable, epistatic routes, there are an even larger number of sideways steps in the protein fitness landscape. The high frequency of neutral mutations observed during evaluation of random mutant libraries suggests a myriad of sequences with essentially equivalent fitness. This is completely consistent with the existence of natural protein homologs that differ at large number of positions, the majority of which are functionally neutral. Even sequences that are highly optimized are likely just one of many potential solutions to a given functional challenge. In fact, it is probably more accurate to imagine protein evolution occurring on ‘neutral networks’, rather than on fitness landscapes where each neighbor has a different fitness28, 62. This pervasive neutrality is in fact exploited when families of functional proteins are constructed by recombination of homologous proteins79, 80.
As discussed above in the context of stability-mediated epistasis, mutations that are neutral in one context, however, might not be neutral in all, and therefore can provide new opportunities for evolution. Directed evolution has demonstrated an important role for stabilizing mutations (which can be functionally neutral or only slightly deleterious) in adaptation. Laboratory evolution experiments have also demonstrated that purposefully accumulated neutral mutations alter promiscuous activities and create new starting points for subsequent adaptive evolution82, 83, 92. Genetic drift and pre-existing diversity might have a similarly important role in natural adaptive evolution62.
An overall picture of the protein function landscape is thus emerging from accumulated directed evolution experience. This picture offers a description of the physical features that all proteins (synthetic or natural) must exhibit and the effects of mutations on those features. Extending lessons learned from directed evolution to natural evolution, however, requires caution because these search processes operate under different time scales, population sizes, mutation rates, strength of selection, etc. Furthermore, natural evolution works on a different fitness landscape, and it is unclear how the protein fitness assayed during directed evolution is related to the organismal fitness that natural evolution optimizes. Differences reflect the consequences of interactions between the protein and the cellular environment, and might include constraints related to metabolic burden, regulation, nonspecific interactions, or other factors.
The ability to disconnect a protein from its in vivo function is actually a valuable asset of directed evolution, because it allows the exploration of physically possible proteins without the often-severe constraint of their being biologically relevant and contributing to organismal fitness. Thus directed evolution can be used to identify which features of proteins are dictated by their physical properties versus those that are due to biological constraints or evolutionary origins and history. The laboratory evolution of the cytochrome P450 propane monooxygenase (FIG. 4), for example, demonstrated the physical possibility, and indeed the ready availability, of such an enzyme, even though known organisms that live on small alkanes use mechanistically and evolutionarily unrelated enzymes for this transformation72. Another example is the generation of proteins with combinations of properties that are usually not found in natural proteins, such as high catalytic activity at low temperature and high stability at elevated temperature21, 22. When properties seem to trade off like this, it might be tempting to infer that such trade-offs are dictated by physical requirements such as the incompatibility between molecular rigidity needed for high stability and the flexibility required for catalytic activity93, 94. If stability and enzymatic activity placed mutually exclusive demands on protein flexibility, then highly active and highly stable enzymes could not exist (a statement that protein engineers did not want to hear). Directed evolution, however, has little trouble finding enzymes that are both highly active and stable when the experiments select for both properties95. Clearly, such proteins are far rarer than highly active, marginally stable proteins, and without a good reason, natural sequences would not exhibit both features21, 22, 32, 96.
Despite the vast size of sequence space and the complex nature of protein function, the Darwinian algorithm of mutation and selection provides a powerful method to generate proteins with altered functions. This simple uphill walk on a fitness landscape in sequence space works because proteins are wonderfully evolvable and can adapt to new conditions or even take on new functions with only a few mutations.
In addition to providing useful proteins, directed evolution experiments have also taught us how proteins adapt and shed light on processes at work during natural evolution21, 62, 97. These experimental results allow us to look at sequence data in a functional context, providing a bridge between long separated fields of evolutionary and molecular biology98. Directed evolution experiments have been used to address important evolutionary questions about the average effects of mutations, mechanisms of functional divergence, evolvability and evolutionary constraints 11, 85, 96, 99.
With the growing number of applications for engineered proteins, directed evolution will continue to be an important strategy for making proteins that are well adapted to new environments and new functions. More advanced high-throughput screens and higher quality sequence libraries will make the searches easier and will enable evolution to solve more and more complex problems. Advances in our understanding of proteins can be incorporated into library design, and the rapidly deceasing cost of DNA synthesis will relieve many sequence construction constraints. Directed evolution will help teach us how biological systems adapt to changing demands; it might also help us to address some of today's most challenging problems of providing effective treatments for disease or producing fuels and chemicals from renewable resources.
The authors acknowledge support from the U.S. Army Research Office, Department of Energy, National Science Foundation, and the National Institutes of Health.
Frances H. Arnold is the Dick and Barbara Dickinson Professor of Chemical Engineering, Bioengineering and Biochemistry at the California Institute of Technology, where her research focuses on evolution of biological molecules and systems in the laboratory. She is an elected member of all three branches of the U.S. National Academies, the Academy of Science, Institute of Medicine and Academy of Engineering, and has served as a Science Advisor to Maxygen, Codexis, Mascoma, Fluidigm, Arzeda and Amyris Biotechnologies and cofounded advanced biofuels company Gevo in 2005.
Philip A. Romero is a graduate student in Biochemistry and Molecular Biophysics at the California Institute of Technology.
Philip A. Romero, M/C 210-41, Pasadena, CA 91125 USA, Tel: (626) 395-4553.
Frances H. Arnold, Dick and Barbara Dickinson Professor of Chemical Engineering and Biochemistry, Division of Chemistry and Chemical Engineering, 210-41, California Institute of Technology, Pasadena, CA 91125 USA, Tel: (626) 395-4162.