Before we describe some of the key lessons that directed evolution studies have taught about protein function and evolution, we would like to briefly discuss the experimental strategy. How the experiment is performed obviously influences the outcome and therefore the information that one extracts from it. Finding a sequence that performs a desired function in a vast space of possible sequences that is only sparsely populated might seem like a daunting task. Inefficient searches of this space could take essentially forever, and the task of the protein engineer is to choose a strategy that will reach the objective and do so quickly and easily. Starting with a functional protein, directed evolution uses repeated generations of mutation to create functional variation and selection of the most fit variants to direct the search to higher elevations on the fitness landscape. It involves four key steps (). First, identifying a good starting sequence, second, mutating this ‘parent’ to create a library of variants, third, identifying variants with improved function, and last, repeating the process until the desired function is achieved. There are many options for implementation of each step, the choice of which can greatly affect both the efficiency and the endpoint of an evolutionary search.
Directed evolution (and for that matter natural evolution) relies on the ability of proteins to function over a wider range of environments or carry out a wider range of functions than what might be biologically relevant at a given time and therefore selected for. This ability to tolerate a nonnatural environment or to exhibit ‘promiscuous’ functions at some minimal level provides the jumping off point for optimization towards that new goal. A good parent protein for directed evolution, then, exhibits enough of the desired function that small improvements (expected from a single mutation) can be reliably discerned in a high throughput screen38
. It is also easy to work with and sufficiently stable to accommodate multiple, potentially destabilizing mutations if the target function is some other property. Some proteins can be significantly more evolvable than others11, 29, 39, 40
. Possible molecular mechanisms that contribute to evolvability have been discussed, including the key role of the chemical mechanism in enzyme functional evolution41, 42
and the idea that evolvable proteins exist in multiple closely related but functionally diverse conformations whose distribution is easily altered by mutation8
. These ideas, however, are still largely speculative, and little other than the ability to accept mutations29, 43
has been conclusively demonstrated in laboratory evolution experiments to contribute directly to allowing one protein to adapt to a new challenge more readily than another. A good heuristic indicator of a protein family's evolvability is its natural functional diversity40, 44
: proteins that have adapted to exhibit a range of functions across the family, for example members of an enzyme family that accepts a wide range of substrates (although individual enzymes in the family might be specific), are likely to be adaptable in the laboratory.
The next step is to create a library of variants. Since screening is often the most difficult experimental step, the library is usually created to generate the highest probability of finding improved proteins given the screening capability. Because most mutations are deleterious and multiple mutations frequently inactivate proteins (vide infra
), this usually involves a low mutation rate (1 or 2 amino acid substitutions per gene). If screening is not difficult (for example, there is a good genetic selection), then the library can be constructed to generate the largest potential improvement. This might mean a slightly higher mutation rate 45
. In either case, mutations can be introduced randomly 1
or, if structural or mechanistic information is available, they can be made in a more directed fashion46-48
, in an effort to increase the frequency of improved proteins and reduce the load in the next step.
Screening (with high-throughput functional assays) or selection (for example, a genetic selection in which hosts having improved proteins outcompete the others) is used to identify the library members improved in the target property. A good screen or selection accurately assesses the target properties. The rule ‘you get what you screen for’ is always useful to remember — screening (or selecting) for something else is risky49
. It is also important not to demand too much improvement in a single generation. The hurdle must be tuned to the screening capacity and should usually be no greater than the improvement that can be provided by a single mutation. If the desired function is beyond what a single mutation can accomplish, the problem can be broken down into a series of smaller ones that can be solved by the accumulation of single mutations, for example by gradually increasing the selection pressure or evolving against a series of intermediate challenges13
. The process of mutation and selection is repeated until the fitness objective is met; the number of iterations required obviously depends on the starting fitness and the improvement that can be achieved in each round, but is often only 5-10 generations.
An evolutionary search relies on the presence of functional diversity within a population, which is the result of underlying genetic variation. At the molecular level, this genetic variation can take many forms: point mutations, insertions, deletions, recombination, circular permutation, etc50-52
. To search efficiently and minimize the screening load, the underlying genetic variation should be set to generate the highest probability of improvement. Statistically, random mutations tend to be quite harsh, usually decreasing activity and sometimes destroying it altogether. Typically, 30-50% of single amino acid mutations are strongly deleterious, 50-70% are neutral or slightly deleterious, and 0.01-1% are beneficial11, 29, 37, 53-56
. If the fitness landscape is Fujiyama-like with many smooth uphill paths, one need only accumulate beneficial mutations (either in multiple rounds of mutagenesis and screening or by recombining beneficial mutations found in each round57, 58
) until the desired fitness is reached. In a single-peaked landscape, all beneficial mutations make a cumulative contribution to the desired function, and all paths uphill eventually converge to the same, optimal solution.
Of course, no real protein landscape consists of a single peak. Most mutations are deleterious and therefore most paths end downhill, with inactive proteins, rather than uphill at more-fit sequences. Furthermore, epistatic interactions occur when the presence of one mutation affects the contribution of another. Such epistatic interactions lead to curves in the fitness landscape and constrain evolutionary searches. Extreme forms of epistasis, in which mutations that are negative in one context become beneficial in another (so-called sign epistasis59
), create local optima on the landscape that can frustrate evolutionary optimization. Epistatic interactions are a ubiquitous feature of protein fitness landscapes60, 61
. We argue, however, that they are not important for most optimizations by directed evolution, which instead follow one of many smooth paths that bypass the more rugged, epistatic routes on this high-dimensional surface62-64
. Among the large number of mutational trajectories between a starting point and a solution, smooth uphill paths can often be found ().
Dealing with the combinatorial explosion
Knowing of epistatic interactions and local fitness optima, some protein engineers worry about the need to make and find multiple mutations at one time. If multiple mutations are in fact needed to climb the peak, the combinatorial explosion of mutational possibilities makes them especially challenging to find. For even a small protein of 100 amino acids, there are 1,900 single amino acid mutants and more than 1.5 million double mutants. The number of possible sequences increases exponentially with the number of mutations, and a complete sampling of even just the double mutants is beyond the capacity of most screens.
Ever higher-throughput screening approaches have been developed to enable sampling of more mutants and more combinations of mutations3, 65, 66
. These screens can allow multiple paths to be explored simultaneously, increasing the probability of discovering good adaptive routes to higher fitness. Higher-throughput screens or selections usually come at the cost of decreased accuracy, however, especially when a surrogate function that is more amenable to high throughput measurement is substituted for the desired function. Furthermore, increasing the mutation rate to capture rare synergistic mutations can make it more difficult to identify improved single-mutation variants, because common deleterious mutations will tend to mask the rare beneficial ones. It is thus often better to focus on sampling single mutants with a higher quality, lower-throughput screen rather than on increasing the throughput to capture multiple simultaneous mutations. Although a search through single adaptive steps cannot find mutations exhibiting negative epistasis, there are usually other, step-wise adaptive routes to the objective.
The high dimensionality of sequence space that makes finding simultaneous beneficial mutations so difficult can be reduced by taking advantage of structural, functional or phylogenetic information to focus mutations to those residues most likely to lead to the desired properties. For example, the modularity of protein structures permits the separate optimization of protein domains13, 67
. Phylogenetic analyses suggest that nature might separately optimize other, structurally non-obvious subunits, or ‘sectors’68
, which could prove to be appropriate targets for directed evolution. The search space can also be reduced by focusing mutations to specific residues within a domain, for example, in an active site or binding pocket in which functional changes might be more likely to occur11, 46, 69-71
. This strategy only works, however, when the experimenter is able to select the right residue combinations for random mutagenesis and leaves out the possibility of finding surprising and informative solutions elsewhere. Numerous studies have shown, for example, that plenty of activating mutations lie outside enzyme catalytic sites and exert their influence through mechanisms that might not be obvious from structural analysis 9, 10, 72
Alternative search strategies
Evolution by the accumulation of single mutations has proven to be very effective at optimizing a function or property that already exists or can be reached through a series of intermediate steps. Some functions, however, simply can not be reached through a series of small uphill steps and instead require longer ‘jumps’ that include mutations that would be neutral or even deleterious when made individually. Examples of functions that might require multiple simultaneous mutations include the appearance of a new catalytic activity or activity on substrate for which the parent and its single mutants show no measurable activity.
Because most mutations are deleterious, the probability that a variant retains its fold and function declines exponentially with the number of random substitutions 36, 37
, and random jumps in sequence space uncover mostly inactive proteins. Thus new functions are extremely difficult to obtain without altering some aspect of the search. One approach is to create a new starting point, a parent protein with at least some minimal function, and improve that by directed evolution7
. Where natural examples of a desired function are not practical or might not even exist, emerging protein design tools have identified functional sequences5
. Expanding the sequence space by incorporation of nonnatural amino acids can also introduce a whole array of new functions, and directed evolution can do the fine-tuning that might be needed to optimize these novel designs15
. Another approach is to find more conservative ways to make multiple mutations, for example, using computational protein design tools to identify sets of mutations that are likely to be compatible with retention of structure47
An approach to making multiple mutations that is used extensively in nature is recombination. Naturally-occurring homologous proteins can be recombined to create genetic diversity within protein sequence libraries73-75
(). It has been shown that mutations made by homologous recombination are much less disruptive and generate functional proteins with much higher frequency than random mutations56
(). Methods based on homologous recombination direct crossovers to regions of high sequence identity and are generally limited to sequences that are very similar (more than 70% identity) 75
, whereas various sequence-independent methods can recombine at random 76, 77
or user-specified sites78, 79
. Recombining homologous proteins by choosing crossovers based on structural information allows construction of libraries of chimeric proteins that simultaneously exhibit a high level of functionality and significant genetic diversity80
. In all cases, the chimeric proteins inherit the best (and worst) residues the parents have to offer, in new combinations not observed in nature.
Recombination of homologous sequences
Chimeric proteins can differ by tens or even hundreds of mutations from their parent sequences and still function. The conservative nature of recombination can be exploited to make whole families of novel enzymes. For example, in one set of more than 6,000 chimeric cytochrome P450 proteins having an average of 70 mutations from the closest parent, approximately half folded properly, and at least 75% of the folded P450 proteins displayed enzymatic activity80
The new combinations of residues can give rise to novel properties81
. Because many of the mutations made by recombination are neutral or nearly neutral, recombination is an efficient way to generate the ‘neutral drifts’, or accumulation of neutral mutations, that have been demonstrated to lead to increases in promiscuous functions 82, 83
and mutational robustness 84, 85
. For example, members of the chimeric cytochrome P450 library exhibited higher enzymatic activity than any of the three parents across a panel of 11 non-native substrates that included substrates on which the parent enzymes showed no measurable activity86
. A large number of P450 chimeras were also more thermostable than the most thermostable parent enzyme, and the thermostable chimeras could be readily identified based on a small sampling of the library 87
(). This approach was subsequently used to generate dozens of highly stable, highly active fungal cellobiohydrolase II enzymes that degrade cellulose into fermentable sugars (for example, for biofuels applications) 79
. Recombination is thus an interesting way to explore new functions, although it might not be the best way to obtain or optimize a specific desired property or set of properties.