|Home | About | Journals | Submit | Contact Us | Français|
Computational structure-based protein design programs are becoming an increasingly important tool in molecular biology. These programs compute protein sequences that are predicted to fold to a target structure and perform a desired function. The success of a program’s predictions largely relies on two components: (i) the input biophysical model, and (ii) the algorithm that computes the best sequence(s) and structure(s) according to the biophysical model. Improving both the model and the algorithm in tandem is essential to improving the success rate of current programs, and here we review recent developments in algorithms for protein design, emphasizing how novel algorithms enable the use of more accurate biophysical models. We conclude with a list of algorithmic challenges in computational protein design that we believe will be especially important for the design of therapeutic proteins and protein assemblies.
Computational structure-based protein design is one of the most promising tools for engineering proteins with new functions, including the development of therapeutic proteins and protein assemblies [1–4]. Despite important successes, however, many of the current computational protein design tools often have low success rates, and designed proteins sometimes fail to achieve the functional properties of native proteins. New advances in protein design methodologies are required to improve the functional properties and success rate of computationally-designed proteins.
The problem of engineering a new functional protein using computational methods is typically divided in two challenging stages. The first stage is selecting a target tertiary/quaternary protein fold that will be designed for a specific function. Often the selected fold is one that performs a similar function and can later be redesigned to a new one [5–9]. In other cases, a protein that has a completely different function is used as a scaffold and repurposed for a new one [10–12]. And, increasingly, protein engineers are incorporating empirical folding and structural principles [13–18] to design proteins from scratch (de novo design) [12–19]. The second stage is to design a protein sequence, together with side chain rotamers and residue conformations , that will adopt the overall target fold (often allowing some backbone flexiblity [13–18, 20–27]) and perform a desired function (e.g., binding with specificity to a target molecule). This latter stage has been historically referred to as protein design . Many computational protein design protocols implement different variations of these two stages, and these have resulted in many successfully-engineered new proteins [5–19, 29–35]. Here we focus on protein design.
Protein design can be formulated as a well-defined computational problem by reducing it to an optimization over a family of parameterized structure-based protein redesign problems. In this well-posed version, an optimization algorithm (also known as a search algorithm) computes and outputs the best protein amino acid sequence(s) and structure(s) in a space defined by a biophysical input model. This biophysical model defines the sequence and structural search space (e.g., template input structure, the allowed flexibility, the amino acid sequences allowed, etc), the optimization objective (e.g., single state, multi-state, ensemble-based, etc), and the scoring potential for protein energetics (i.e., the energy function [36, 37]). To our knowledge, all structure-based protein design programs conform to this formulation [13, 38–42]. For example, one of the most frequently used biophysical models for backbone flexibility in Rosetta  consists of a target structure, an ensemble of allowed backbone moves (e.g., backbone dihedral changes), a rotamer library, the energy function, and a predefined sequence space [13, 43]. This biophysical model describes a space which is then searched using Rosetta’s iterative relaxation/design algorithm . Iterative relaxation/design iteratively intercalates two steps: (i) a design step, where the backbone is held constant while the conformations and amino acid identities of the side chains are optimized; and (ii) a relaxation step, where the sequence is held constant, while the backbone and side chains are optimized using a hybrid stochastic/gradient descent optimization [13, 44, 45].
The accuracy of a computational protein design relies largely on the biophysical model and it is thus highly desirable to improve this model. Biophysical model improvements, however, often come at the cost of exponentially increasing the computational complexity of the search problem. Since computational hardware cannot grow at the same rate, the only practical solution to search more complex biophysical models is through novel algorithms. Therefore, substantial improvement in computational protein design necessitates the development of novel algorithms (see Figure 1). For this reason, we focus on algorithms for protein design, and review those that we believe represent new algorithmic breakthroughs and that have potential for the design of therapeutic proteins and protein assemblies. We focus on developments since 2010 (foundational and earlier algorithms are discussed in [37, 46, 47]) in four areas: optimization algorithms for protein design, algorithms to search improved flexibility models, multi-state design, and ensemble-based design. Due to constraints on the length of this survey, we exclude related algorithms that are important for therapeutic and assembly protein design that have also been highly productive recently, such as docking algorithms (for a review see ), scaffold search algorithms (e.g., [49, 50]), and algorithms to optimize libraries for in vitro evolution of designed proteins (e.g., [51, 52]).
Protein design, like many other problems in the field of computational structural biology, belongs to a hard class of computational problems [47•]. Consider, for example, a simple yet common biophysical model for the protein design problem, which we call the global minimum energy conformation model (GMEC-model). In this model the backbone of the protein is not allowed to move, the interaction energies between atoms are pairwise additive, the amino acid side chains are allowed to change only between discrete states (known as rotational isomers, or rotamers) [37, 46, 47], and the goal is to compute GMEC and its corresponding amino acid sequence. It has been shown that, even under this simple GMEC-model, the protein design problem belongs to a hard class of computational problems known as NP-hard (reviewed in [47•]), which implies that an efficient algorithm for all instances of the problem is unlikely. Moreover, other complex models (discussed in Sections 6 and 7) can belong to even more challenging classes of computational problems.
Because of the complexity inherent to these hard classes of problems, many protein design programs use heuristic optimization algorithms to compute low-energy sequences in the search space described by the biophysical model . Heuristic algorithms use stochastic processes to search the space described by the model and are usually fast. Their speed makes them attractive because protein designers can expand the biophysical model (i.e., expand backbone flexibility, rotamer libraries, the number of protein sequences considered, or the number of states in a multistate design), without incurring a significant performance penalty. For these reasons, the field of heuristic algorithms has been highly prolific both before 2010  and since 2010 (reviewed here).
However, heuristic algorithms cannot guarantee the optimality of the solution relative to the input biophysical model. This can be a particular disadvantage when models are expanded: a heuristic algorithm may not be able to accurately search the expanded model [53••]. Provable protein design algorithms, in contrast, guarantee that if the algorithm runs to completion, the computed sequences are the best ones, or are provably close to the best one, as defined by the model. Although provable algorithms can be empirically slower than heuristic algorithms, their guarantees present several advantages.
First, provable algorithms can (and often do) compute lower energy sequences than heuristic algorithms. For example, in a recent study [53••], Simoncini et al. compared their provable optimization algorithm implemented in the Toulbar2 program [54, 55] vs. the heuristic simulated annealing (SA) algorithm implemented in the Rosetta program using a GMEC-model. They found that, in a set of 100 test protein designs, SA often fails to compute the optimal answer even after running hundreds of times. In another study, Donald and co-workers compared the performance of a commonly used heuristic to model the continuous low-energy regions around rotamers, against provably searching the space [56•]. The heuristic method consisted of an initial discrete search followed by post hoc minimization. They found that the provable answer was far lower in energy and more similar to native proteins than the heuristically-computed sequences.
When a protein designer defines a biophysical model for a specific, empirical protein design problem she usually cannot predict the computational complexity of the model (although algorithms such as  can upper bound the complexity, but these are exceptions to the rule). Since protein design search spaces quickly grow exponentially along multiple dimensions (i.e., sequence space, side chain conformation space, backbone conformation space), the input biophysical model might be too complex for any method to optimize accurately, heuristic or provable. Therefore, a second, and very important advantage of provable algorithms, is that they provide a signalling mechanism on the complexity of the input biophysical model. The importance of this signalling mechanism was evinced in the work by Simoncini et al. [53••], where they found a clear tendency for the difference between the SA solution and the GMEC computed by the provable algorithm to increase as the size of the problem grows. If a heuristic algorithm, such as SA, is used to optimize a biophysical model that exceeds the capabilities of provable algorithms, then it is impossible to know the size of the difference, and the designer has no information on the quality of the heuristic search. In contrast, provable algorithms in these cases warn the designer of the need to either (i) restrict the biophysical model or (ii) improve the algorithms.
Finally, when the output of provable algorithms fails to perform experimentally as predicted, the error can be isolated and attributed solely to the biophysical model, and the biophysical model can then be improved. With heuristic algorithms, it is difficult to know whether the design process failed because of the algorithm or the biophysical model. An interesting example of the ambiguity introduced by heuristic algorithms was recently reported in a closely related field within computational structural biology, in Nuclear Magnetic Resonance (NMR) structure determination [58••]. With sparse data (e.g., for membrane proteins) this problem is a form of protein structure prediction, and has similarities to backbone conformation search and scoring in protein design. Martin et al. showed that a recently published structure of Diacylglycerol Kinase, solved by NMR was vastly different from the crystal structure likely because the SA optimization protocol used to interpret the NMR data failed to find up to 18 other low-energy structures with completely different topologies that also fit the data. Although this discrepancy occurred in the field of NMR structure determination, it is a warning about the ability of SA to explore backbone conformational space, and we suspect that similar dangers can occur when using SA or other heuristic algorithms in computational protein design.
Thus, there are trade-offs to using provable vs. heuristic algorithms: heuristic algorithms can more quickly search complex biophysical models yet cannot guarantee accuracy, while provable algorithms guarantee accuracy but are typically slower. Since 2010 there have been considerable developments in algorithms for both categories and we review these in the following sections.
The GMEC-model is one of the most used protein design models and it is often employed as a benchmark to test the performance of new protein design optimization algorithms, provable or heuristic. Most of the heuristic algorithms that are currently used for the protein design GMEC-model, however, were developed prior to 2010 (and are surveyed in [37, 46]). One recent, promising approach, however, is the CLEVER algorithm for protein design, which builds on previous cluster expansion algorithms developed in Amy Keating’s laboratory [59, 60]. Cluster expansion for protein design is a technique that maps the complex three-dimensional atomic-level energy function, which is a function of atomic coordinates, to a simpler linear function dependent only on the sequence . Thus, cluster expansion maps the biophysical input model to a much simpler one. Then, integer linear programming solvers can be used to efficiently find the optimal sequence in the new model. Since this mapping is only an approximation, the error resulting from a cluster expansion can potentially be large. In more recent work, Hahn et al. analyzed the sources of error in cluster expansion and developed a series of techniques to reduce them . Although it is likely very difficult to eliminate all approximation errors in cluster expansion, its speed makes it an attractive approach for protein design.
In contrast to progress in heuristic algorithms, several notable advances in provable protein design optimization have been reported over the past few years. Tommi Jaakkola and co-workers developed the max-product linear programming algorithm (MPLP) for computer vision and protein design applications . MPLP uses a message passing-based [47•], block-coordinate descent algorithm to approximate tight lower bounds on the energy of partial protein design solutions [61, 62]. Then, using information from the bounding process, the algorithm groups residues into clusters, which iteratively results in tighter MPLP bounds until the exact, best solution is computed . Other algorithms have adopted bounding techniques, similar to MPLP, and used them in different branch-and-bound frameworks. The BroMAP algorithm  also uses a message passing algorithm as the bounding technique, followed by a branch-and-bound algorithm. At each step in the branch-and-bound process, BroMAP prunes solutions that it can prove will not lead to the optimal solution. Donald and co-workers developed the Dynamic A* algorithm [64•], which is based on the traditional A* algorithm for protein design (reviewed in [37, 46]). Dynamic A* radically improves the performance of A* through both better bounding and by introducing dynamic residue ordering to the design process. Zeng and co-workers  developed an AND/OR branch-and-bound method that exploits the sparse nature of protein design biophysical models. In an AND/OR tree the protein design optimization problem can be split into different components on-the-fly, and each subtree can be solved independently, which can result in a significant performance improvement. Other recent developments use advanced computer science tools, such as dynamic programming [57, 66•] or solvers for the boolean satisfiability problem .
Promising improvements in the performance of protein design optimization algorithms for the GMEC-model were recently reported by Thomas Schiex and co-workers [53••–55, 68]. Their protein design optimization algorithm exploits weighted constraint satisfaction (WCSP) techniques, including fast soft local consistencies for bounding, an advanced branch and bound implementation, and sophisticated ordering techniques, to compute the GMEC significantly faster than competing approaches . These algorithms are implemented in the Toulbar2 program [53••–55], with support only for discrete flexibility models, and in the Osprey protein design program [39, 68] (with full support for continuous flexibility models, described in the next section [64•, 69••]).
Although the algorithms described in this section were tested on the discrete rotamer GMEC-model they often are integral components of other algorithms that are used to search more complex biophysical models [56•, 70]. Often, the improvements in optimization power described in this Section are necessary to allow for search over these more complex models in a reasonable amount of time.
Protein design energy functions can be sensitive to small changes in atom coordinates. A protein sequence that is predicted to be energetically disfavorable under a rigid backbone biophysical model, for example, could be very favorable with some slight adjustments to the backbone or side chain angles. Moreover, certain protein secondary structure elements, such as backbone loops, can change their conformations after introducing mutations, and protein design algorithms must model these new conformations to find the lowest energy conformation of a new sequence. Thus, protein design algorithms can improve the quality of their design predictions by expanding/improving their flexibility models.
A widely used method for improved flexibility is the iterative relaxation/design method in Rosetta . This method intercalates sequence design steps (which hold constant the backbone coordinates) with backbone relaxation steps (which hold constant the amino acid sequence ). The relaxation step was recently improved by Tyka et al. in the FastRelax  and BatchRelax  algorithms, which improve the speed of these methods by annealing the weight of the van der Waals term during the minimization process.
Separating backbone flexibility and sequence design into different steps may not always find the best sequences under the flexibility model because the best sequence might only be reached after a concerted backbone movement and residue mutation. Thus, designers have attempted to search backbone and sequence space simultaneously [20–22].
The importance of searching backbone flexibility and sequence space simultaneously has been validated retrospectively in biologically-relevant computational experiments. For example, Kortemme and co-workers  showed that incorporating backbone flexibility greatly improved the ability to predict mutational tolerance in HIV-1 protease and reverse transcriptase. In addition, in  they showed that coupling backbone movements with side-chain movements improved the ability to predict mutations in enzymes and to predict sequences for a protein to bind a particular ligand.
Searching backbone flexibility and sequence space simultaneously represents an important algorithmic challenge for protein design algorithms along two areas: (i) algorithms are needed to improve the flexibility model by defining a restricted, yet comprehensive, conformational search space, and (ii) algorithms are needed to search the improved, yet often more complex, flexibility models. In the first area the main challenge lies in improved models of backbone flexibility. The size of the conformation space with backbone flexibility can be huge and it would be useful to have methods to restrict the space to subsets of good backbone conformations. This problem is particularly hard for protein loops because the space of possible loop conformations is usually large. To address this problem, Kortemme and co-workers developed the KIC algorithm [23•, 24], which generates a library of backbone loops for both protein design and protein modelling. KIC uses a kinematic closure algorithm to analytically solve all possible solutions to a set of three “pivot” residues and uses a Ramachandran plot to compute a library of additional torsional values for the remaining torsional angles in a loop of arbitrary size. Notably, KIC has been compared against loop backbones generated using molecular dynamics and shown to be more accurate . The Baker laboratory incorporated KIC and several other backbone flexibility techniques into RosettaRemodel, a general framework for backbone flexibility in Rosetta . Tripathy et al.  developed a method that uses NMR orientational restraints to generate ensembles of loop backbones. Their method, POOL, exploits sparse NMR orientational restraints and kinematic restrictions on the backbone to systematically search all loop conformations that satisfy the data. In cases where some NMR data on a loop of the wildtype structure is available, POOL can be used to accurately generate candidate loop conformations, and these can be used as input for the design. Floudas and co-workers [27••] provide an algorithm to predict loop structures by iteratively refining the bounds on the dihedral angles of the loop residues. Importantly, the algorithm is able to handle cases in which the coordinates of the flanking secondary structure elements are not known.
Once these discrete backbones are computed, then they can be incorporated into the biophysical model, and any algorithm to compute the GMEC can be used to compute the optimal conformation for the improved model. In the case of cluster expansion approximation where the three dimensional coordinates of atoms are mapped to a simple linear function dependent only on sequence, however, it is not straightforward to see that the approximation will be robust enough for backbone flexibility. To investigate this, Agpar et al.  recently evaluated the performance of cluster expansion with a backbone flexibility model and demonstrated that the approximation is robust enough for the improved model.
Discrete models of flexibility, however, are limited because protein energetics can be very sensitive to small changes. With discrete models it becomes hard to know beforehand how much discrete sampling is necessary for a specific design. Rather than model conformations as discrete conformations, Donald and co-workers have developed a suite of provable algorithms [20, 56•, 69••, 74••] that incorporate continuous flexibility to model the low-energy torsional regions around discrete conformations. They have found that modeling the continuous flexibility around discrete, low-energy conformations results in designed protein sequences that are much more similar to native protein sequences [56•]. These algorithms to model continuous flexibility [20, 56•, 69••, 74••] build on the A* protein design algorithm [64•] to compute bounds on the energies of the continuous regions around discrete conformations, and guarantee to find the lowest energy conformations and sequences under the continuous flexibility model. The iMinDEE algorithm [56•] introduces a new, powerful provable technique to remove continuous rotamers in a protein design search that can be shown to not be part of the optimal solution. The DEEPer algorithm  models simultaneous continuous flexibility in both backbone and side-chain movements. The EPIC algorithm [74••] reduces the burden of using continuous flexibility in protein design by precomputing energy functions between pairs of residues as compact, quick-to-evaluate polynomials. The HOT and PartCR algorithms further accelerate the speed of continuous flexibility algorithms by merging/or partitioning discrete conformations, improving the speed of continuous flexibility algorithms 100-fold [69••]. The LUTE algorithm , conceptually similar to cluster expansion, performs a black-box reduction of protein design with continuous flexibility, to (two calls to) a discrete optimization algorithm for protein design (such as those described in Section 4). LUTE also provides a similar reduction for design with non-pairwise energy functions, such as Poisson-Boltzmann solvation. By reducing these more complex optimizations to pairwise discrete optimization, LUTE can exploit many of their advantages while obtaining the speed of a discrete algorithm.
The design of protein therapeutics and protein assemblies often requires optimizing over multiple states, which is known as multistate design (MSD). For example, the MSD goal might be to optimize binding affinity (which can be modeled by optimizing the difference in energy between the bound state of a complex and the unbound states of the monomers), to optimize for affinity to multiple targets, or to optimize for binding to one target while preventing binding to other targets. MSD has many practical applications. For example, Meiler and co-workers  recently showed that MSD resulted in better agreement with observed evolutionary sequence profiles of antibodies when compared with single state design. In another example [5, 6], Anderson, Donald and co-workers used MSD to predict active site mutations in a drug target that confer resistance to a drug while maintaining the natural function of the enzyme.
In practice, MSD can be harder than traditional protein design optimization problems because multiple states must be simultaneously optimized, instead of just one, but since 2010 there have been several developments of new heuristic algorithms for MSD (earlier MSD algorithms are discussed in ). Building on cluster expansion, the Keating laboratory developed the CLASSY algorithm for MSD [29, 78]. CLASSY uses a protocol called a “CLASSY specificity sweep” which consists of multiple steps. In the first step, CLASSY designs a sequence for affinity to a target. On the second and subsequent rounds, new constraints are added to increase the difference in binding energy of the sequence to the target vs. the off-targets. An optimization process is run anew to optimize for binding to the target. The procedure continues until a highly specific sequence is found. Fromer et al. [79, 80] developed an algorithm for multispecificity using probabilistic graphical models [47•]. Their method combines the energy function of the multiple targets to a single energy function and models the graphical model for each target structure as a separate graphical model. Constraints between the different graphical models ensure that the sequences of all target structures are the same, and the loopy belief propagation optimization algorithm is then used to compute the optimal sequence. Allen and Mayo  adapted the FASTER algorithm, a heuristic optimization algorithm developed for protein design, to the multistate case. Their method, MSD-FASTER, iteratively computes sequences for each of the target states and then applies them together, until convergence. Roberto A. Chica and co-workers  developed an approach to separate backbone ensembles generated computationally into native vs. non-native clusters using multiple linear regression against a training set with known experimental data. They then apply the FASTER algorithm to optimize the sequence with the largest Boltzmann-weighted average energy difference between native and non-native ensembles. Leaver-Fay  et al. developed a genetic algorithm for multistate design within a multistate program that allows the user to define custom multistate optimization objectives. Finally, Meiler and co-workers developed the RECON algorithm , which allows the multiple states to separately explore the sequence space, while slowly converging, under the assumption that this should simplify the energy landscape of the sequence space.
The problem of computing the MSD belongs to a hard class of computational problems (we conjecture that its class is harder than computing the GMEC). In practice MSD is computationally challenging because the number of sequences is exponential in the number of mutable residue positions, and the optimization objective considers multiple states for each sequence. To our knowledge, the Comets algorithm developed in the Donald laboratory [83••] is the only provable algorithm for multistate design with an all-atom energy functions that does not exhaustively enumerate the sequence space. Comets uses a search tree based on the A* algorithm over sequences, where each internal node encodes a partially defined sequence, and each leaf encodes a fully defined sequence. Comets computes a multistate lower bound on the best energy of every internal node, and at each step the internal node with the best bound is expanded. This process continues until the optimal MSD sequences are computed.
Proteins exist in solution as thermodynamic ensembles, and the binding affinity of two proteins depends on the free energy of these ensembles [9••, 47•, 84]. In particular, when two proteins bind each other, such as when a therapeutic protein binds its target, the binding affinity is often affected by an associated loss in conformational entropy. Thus, protein design protocols must account for the entropy change [9••], and failure to do so can result in weakly-binding proteins. This was validated recently in a computational study by Fleishman et al. , which compared the conformational flexibility of the side chains in the binding interface of computationally designed proteins, compared with that of native proteins. Their computational study showed that native binding proteins had restrictions on the side-chain plasticity of proteins, ensuring they undergo a small entropic loss upon binding. Roberts et al. [9••] also performed a comparison between the computational predictions of a GMEC-model vs. the predictions of an ensemble-based model, in an experimentally-validated design of peptide inhibitors for cystic fibrosis. They found that of the top 30 sequences designed by a single-structure (GMEC) model vs. an ensemble model, the two models only shared 4 sequences. The top experimentally-validated sequences were not top-ranked sequences in the single-structure model [9••].
Computing protein ensembles, and the associated partition function for a protein, belongs to a difficult class of computer science problems, #P-hard. However, several approximation algorithms have been developed to compute rotamer-based partition functions that can be used to estimate conformational entropy upon binding. Fleishman et al.  developed a method similar to self consistent mean field to compute a penalty for rotamers that change conformation upon binding. Kamisetty et al. [86•] used probabilistic graphical models to formulate the partition function computation as an inference problem. Then, they used the GOBLIN algorithm, which is based on loopy belief propagation, to compute the free energy of binding in protein-protein interactions [86•]. Their method was also used to show that explicitly accounting for conformational entropy can significantly improve predictions of binding affinities. Gevorg Grigoryan developed the molecular-dynamics based Valocidy algorithm [87••], which uses ensembles to compute free energy. Grigoryan showed how to use Valocidy to calculate free-energy estimates in protein design by incorporating it into cluster expansion.
Despite the hardness of computing partition functions, some algorithms can approximate the free energy of binding with guarantees on the input for continuous rotameric models. The K* algorithm [9••] computes approximations to the binding constant of two proteins by computing the ratio of the partition functions between bound and unbound states [47•]. To approximate the partition function of each state, K* enumerates conformations in order of energy, using the A* algorithm and stops once a provable, ε-approximation to each partition function has been computed. Jou et al. developed the BWM* algorithm, based on dynamic programming techniques, to rapidly enumerate conformational ensembles for use in partition function computation . Viricel et al. adapted weighted constraint satisfaction techniques to compute ε-approximations to the partition function and K* score more quickly than previous methods . Silver et al. [88••] developed a method to compute configurational entropy using A*-enumerated ensembles, similar to the K* algorithm , followed by a novel entropy expansion. The method represents the entropy as a series of mutually exclusive terms, each corresponding to the marginal entropy of a particular degree of freedom or to coupling between degrees of freedom. This expansion offers a clear interpretation of the contribution of particular degrees of freedom to the total entropy of the molecule [88••].
The accuracy of structure-based computational protein design depends on the quality of the input biophysical model. Improving this model in turn requires new algorithms capable of searching the more complex search spaces created by the improved model. The perspicacious reader will note, however, that an arbitrary improvement to the biophysical model, no matter how beneficial in principle, may not admit a corresponding algorithmic improvement to re-expand the equicomplexity curve (Figure 1). The most significant improvements to algorithms for protein design are likely to come from algorithms that exploit the structure of new features in the biophysical model. Yet not all features will be equally exploitable by the algorithms. For example, modeling the solvent molecules in a protein’s environment explicitly would most likely improve the accuracy of protein design, yet it is challenging to develop efficient design algorithms that can search these solvent models. Thus, there is an impedance matching between algorithms and model, and therefore both algorithms and model must be improved in tandem.
Since 2010 there have been important improvements to many important algorithmic challenges in protein design. There are many areas where protein design algorithms can be further improved, both within and outside those listed here. First, as shown by [53••] and argued in this review, significant progress in the speed of provable optimization algorithms is not only possible, but it is essential to ensure the quality of the designed sequences with respect to the input biophysical model. Developing algorithms that improve over the best state-of-the-art algorithms will enable enhanced modeling and, in consequence, better designs. In addition, as shown in multiple studies [9••, 20, 21, 25, 47•, 56•] improving flexibility models in protein design can result in sequences that are lower energy, are more similar to those of native proteins, or perform better experimentally [9••]. Developing new algorithms that expand the equicomplexity curve for expanded flexibility models, especially continuous flexibility, is likely to result in better protein designs.
The successful engineering of proteins for new function depends on a two stage strategy: selecting a tertiary or quaternary protein fold that can be designed for a specific function, followed by protein design of the fold. Recently, as the functional goals of protein design have become increasingly ambitious, the first stage (selecting the fold) has become more relevant because most protein folds are likely not designable for a specific function. Designers often utilize scaffold search algorithms or increasingly more commonly, de novo strategies to find folds, followed sequentially by protein design to compute a sequence for these folds. This sequential strategy, known as a greedy strategy in computer science, suffers from an important limitation from decoupling the two stages: because the stages are sequential, the fold selection process may not be able to foresee whether a protein sequence will be able to adopt the fold and perform a function (e.g., binding). Thus, filtering criteria at the fold selection stage will remove solutions that could be successfully designed to a functioning protein. The engineering of proteins would greatly benefit from a tighter coupling between fold search and protein design. Several recent strategies address in part this limitation [15, 89] from a structural perspective: they predict whether the fold is designable based on observed protein structures. Others, such as inverse rotamers , address this from a functional (i.e., binding or catalysis) perspective. However, a tighter integration between fold selection and protein design is needed to improve the functional capabilities of computational protein engineering, and such integration will likely require new algorithms.
We sincerely apologize to the many authors of outstanding work that was overlooked or not included in this review because of space limitations. We thank Marcel Frankel, Anna Lowegard, and Jonathan Jou for comments on the manuscript, as well as Mark A. Hallen for helpful discussions. This work is supported by the following grant from the National Institutes of Health: R01 GM-78031 to B.R.D.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.