|Home | About | Journals | Submit | Contact Us | Français|
First principles approaches to the protein structure prediction problem must search through an enormous conformational space to identify low-energy, near-native structures. In this paper, we describe the formulation of the tertiary structure prediction problem as a nonlinear constrained minimization problem, where the goal is to minimize the energy of a protein conformation subject to constraints on torsion angles and interatomic distances. The core of the proposed algorithm is a hybrid global optimization method that combines the benefits of the αBB deterministic global optimization approach with conformational space annealing. These global optimization techniques employ a local minimization strategy that combines torsion angle dynamics and rotamer optimization to identify and improve the selection of initial conformations and then applies a sequential quadratic programming approach to further minimize the energy of the protein conformations subject to constraints. The proposed algorithm demonstrates the ability to identify both lower energy protein structures, as well as larger ensembles of low-energy conformations.
The protein folding problem can be simply stated: Of all the possible conformations, how does a denatured protein fold into its unique structure? Leventhal’s paradox raised the concern of how a protein could find the native functional state if the time scale for folding is only on the order of milliseconds or even microseconds . The basic premise of protein folding is Anfinsen’s thermodynamic hypothesis, which states that the native structure of a protein can be determined directly from its sequence by identifying the global minimum free energy conformation . However, the major difficulty in this approach is efficiently distinguishing the global minimum energy confirmation from the numerous local minima of the free energy landscape. As Levinthal’s paradox suggests, a random search through all possible protein conformations is unreasonable, so the careful application of global optimization methods has been useful in addressing these problems.
Due to the complexity of simulating the protein folding problem, comparative modeling techniques have been developed to exploit the information contained in proteins with experimentally-determined structures. Comparative modeling techniques require the identification of a suitable template from a homologous protein and the alignment of the target protein to that template. Once an appropriate template and alignment have been selected, a model of the target in the structurally conserved regions is built, the positions of the side chain atoms and loop residues are added, and a refinement and/or evaluation of the target structure. Comparative modeling techniques have traditionally been based on sequence-sequence alignment methods such as BLAST , but can be significantly improved using profile-sequence comparisons , hidden Markov models , or profile-profile alignment methods .
Fold recognition and threading methods have also been proposed to detect remotely homologous proteins for comparative modeling efforts [7–9]. The premise of fold recognition strategies is the idea that protein folds are more conserved than protein sequences and therefore the number of protein folds is orders of magnitudes less than the number of protein sequences [10, 11]. Threading methods, a well-studied class of protein fold recognition algorithms, attempt to fit or “thread” the target protein sequence onto the template protein structure. Several methods have been proposed to address the protein threading problem, including dynamic programming techniques , iterative approaches [13, 14], and optimization-based approaches [15, 16].
The majority of the remaining approaches to the tertiary structure prediction problem are based on the premise that the minimum energy conformation of a protein corresponds to its native state. These approaches can be roughly divided into two categories: (1) methods that depend on homology-based information for their searches (i.e., statistical potentials and/or biased conformational searches) and (2) methods that are independent of homology-based information (i.e., physics-based potentials and conformational searches with optimization algorithms).
One popular homology-based tertiary structure prediction approach, fragment assembly, uses short subsequences of known structures from the Protein Data Bank  to build an overall protein structure. This structure can then be optimized using scoring functions or statistical potentials in combination with optimization algorithms. Baker and co-workers have used simulated annealing approaches to assemble compact structures from these protein fragments [18, 19]. These compact structures can then be optimized through the minimization of a scoring function, based on conformational statistics of known proteins, with Monte Carlo minimization techniques.
Many of the more successful protein structure prediction approaches are hybrid methods that combine comparative modeling techniques with the minimization of a scoring function. One example of this is the work of Skolnick and co-workers, which combines multiple sequence comparisons, threading, a united atom lattice model, the optimization of a scoring function, and clustering for the prediction of protein structures [20–22]. Their methods have been applied iteratively to the ab initio modeling of small proteins with promising results . Other tertiary structure prediction algorithms focus on the minimization of a knowledge-based scoring function, using a lattice-based approach  or Monte Carlo techniques .
Methods that are independent of homology information make direct use of Anfinsen’s thermodynamic hypothesis , exploring the conformational space of a protein in an attempt to identify the minimum free energy structure of the protein in its environment. These physics-based approaches are typically computationally demanding, requiring times on the order of CPU years on a single processor for all but the smallest proteins. Despite this heavy demand, physics-based protein folding studies have several advantages over homology based approaches. These advantages include the ability to (1) predict structures independent of the existence of homologous structures, (2) extend predictions to proteins in different environmental conditions, and (3) provide insight into the mechanism, thermodynamics, and kinetics of protein folding.
Monte Carlo and molecular dynamics techniques have been used for a number of protein folding studies. Dill and co-workers proposed a zipping and assembly model using replica exchange molecular dynamics for the folding of small, single-domain proteins . A hierarchical approach using a Metropolis Monte Carlo algorithm has been applied to protein structure prediction (LINUS) by identifying conformational biases from a discrete set of moves and evaluating a simplified physics-based force field [27, 28]. Pande and co-workers have folded the protein villin to an RMSD of 3 Å using molecular dynamics implemented through their popular distributed grid computing application, Folding@Home .
Scheraga and co-workers introduced a simplified united-residue force field (UN-RES) to the problem of protein structure for their initial calculations and then refine the coarse protein structure models using an all-atom potential [30–32]. In this united-residue force field, the representation of each amino acid residue is reduced to two interaction sites. This reduction allows the stochastic conformational space annealing algorithm to more efficiently identify low energy regions of the conformational space [33–35]. The use of the united-residue force field with replica-exchange Monte Carlo and minimization search techniques was also investigated .
One successful prediction method is the first principles ASTRO-FOLD protein folding approach developed by Floudas and coworkers . The main thrusts of this approach are (1) α-helical prediction through detailed free energy calculations , (2) a mixed-integer linear optimization formulation for the β-sheet prediction , (3) derivation of secondary structure restraints and loop modeling, and (4) the application of hybrid global optimization algorithms to tertiary structure prediction. The goal of the ASTRO-FOLD tertiary structure prediction approach is to minimize the potential energy of a protein subject to constraints imposed on the dihedral angles and distances. The algorithm is a combination of the deterministic αBB global optimization algorithm, a stochastic global optimization algorithm, and a molecular dynamics approach in torsion angle space [37, 40].
The use of the αBB global optimization algorithm guarantees convergence to the global minimum solution by a convergence of upper and lower bounds on the potential energy minimum. Upper bounds to this model can be obtained through local minimizations of the original non-convex problem. The addition of separable quadratic terms to the objective and constraint functions produces a convex lower bounding function. With these bounding functions, the problem can be iteratively branched over the variable space, fathoming portions when a region’s lower bound rises above the best upper bound. The highly nonlinear force field and vast conformational space present an exceptionally difficult problem for deterministic approaches by themselves. The introduction of torsion-angle dynamics methods to quickly identify feasible low energy conformers and a stochastic optimization method, conformational space annealing, for enhanced searching of the upper bounding function can greatly increase the power of this algorithmic approach. These hybrid algorithms, their convergence properties, and their parallel implementations have been discussed in detail [41, 42]. The ASTRO-FOLD framework can be applied to proteins of any size, but the detailed energetics and deterministic guarantees of the approach are best-suited for small to medium-sized proteins (up to approximately 200 amino acids in length). The methodology has been applied to a varied set of proteins throughout this range  and in a recent double blind prediction . The proposed hybrid global optimization algorithm is an extension of the tertiary structure prediction work of Floudas and co-workers [37, 40–42]. The goal of this proposed method is to combine effective global optimization techniques with efficient local minimization strategies.
Due to the broad scope of the protein structure prediction literature, the methods presented here are merely a representative sample of the algorithmic variety that exists. A more thorough description of protein structure prediction methods is available in a number of recent reviews [44–47].
The proposed tertiary structure prediction algorithm will be presented in the following order. In Sect. 2.1, a suitable potential energy function is selected for the tertiary structure prediction problem. In Sect. 2.2, techniques for introducing bounds on the torsion angles and distances within a protein are discussed. Given an appropriate energy function and a set of constraints, Sect. 2.3 describes the mathematical formulation of the protein structure prediction problem as a minimization problem. The use of sequential quadratic programming techniques as a local minimization tool to address the constrained nonlinear programming problem of protein structure prediction is discussed in Sect. 2.4. The application of constrained local minimization techniques is heavily dependent upon the identification of initial feasible points with low energy. Sect. 2.5 describes how a torsion angle dynamics algorithm can be used to satisfy both the dihedral angle and distance constraints and also avoid large numbers of steric clashes between atoms. Further improvement of the initial feasible points can be achieved using the rotamer optimization techniques in Sect. 2.6 that optimize the placement of the side-chain atoms using a discrete library and a fixed protein backbone. Sect. 2.7 describes the application of the αBB deterministic global optimization algorithm to problem of protein structure prediction, incorporating the algorithmic components of Sects. 2.4–2.6. The conformational space annealing algorithm, a combination of simulated annealing and genetic algorithms, is a stochastic optimization technique that can be applied to the protein structure prediction problem as described in Sect. 2.8. The αBB and CSA algorithms can be effectively combined into a hybrid algorithm to retain the advantages of each individual method. This hybrid algorithm and its parallel design are outlined in Sect. 2.9. A formal presentation of the algorithms is available as Supplementary Material and can be accessed at http://titan.princeton.edu/ProtAlg2009/.
The importance of the protein structure prediction problem has led to the development of a wide variety of force fields. Physics-based force fields include energetic contributions of atomic bonds, angles, and torsion angles, as well as non-bonded interactions such as van der Waals interactions and electrostatics. AMBER  and CHARMM  are physics-based atomic potentials of this type. When the atomic bond lengths and bond angles are fixed to constant values, the energetic contributions of these terms can be ignored. The ECEPP , ECEPP/3  and ECEPP-05  force fields fix these values to allow protein conformations to be represented by only its torsion angles.
The ECEPP/3 atomistic level force field was chosen as the energy function. The ability to represent the protein conformation using only the torsion angles is a significant advantage for minimization approaches as it drastically reduces the variable space that must be considered. This force field calculates the potential energy of the protein as the sum of electrostatic, non-bonded, hydrogen bonded and torsional contributions. Equation (1) details the form of each of these terms.
The interatomic distance of the atomic pair (ij), rij, is present in the first three terms. qi and qj are dipole parameters for the respective atoms, where the dielectric constant of 2 has been incorporated. Fij assumes a value of 0.5 for 1–4 interactions and 1.0 for 1–5 and higher interactions. The non-bonded parameters specific to the atomic pair, Aij and Cij are also necessary to calculate the non-bonded contributions. Likewise, and Bij are the hydrogen bonded parameters specific to the atomic pair. The sets ES, NB, HB are the atomic pairs (ij) that have electrostatic, non-bonded and hydrogen bonding interactions, respectively. The set TOR are the torsion angles, k, that contribute to the torsion potential contributions.
The tertiary structure prediction problem for proteins requires the search of a vast conformational space. Various constraints on both torsion angles and interatomic distances can be imposed to narrow the size of this conformational space. One source of constraints is through the identification and use of homologous proteins with structures that have been determined by experimental techniques . An alternative approach to generating constraints is to approach the protein folding problem using a framework-based algorithm. A logical way to generate constraints for such a problem, such as the Astro-Fold framework , is to first predict the protein secondary structure using either homology-based techniques or free energy modeling, then to predict the arrangement/topology of the secondary structure elements [39, 54], and finally to predict the loop regions of the protein [55, 56].
There are a variety of ways to constrain the dihedral angles based on the prediction of secondary structure elements. The ideal, right-handed α-helix adopts a angle of approximately −57° and a ψ angle of approximately −47° . In a folded natural protein, the geometry of an α-helix may deviate from these values depending upon its environment. As shown in Table 1, Klepeis and Floudas have used two similar sets of dihedral angle bounds for residues predicted to be part of an α-helix.
The ideal antiparallel β-strand adopts a angle of approximately −139° and a ψ angle of approximately 135°, whereas the parallel β-strand has a angle of −119° and a ψ angle of 113° . The geometry of the β-strands also deviates from these values in folded proteins depending on the environment and other factors. As shown in Table 2, Klepeis and Floudas have used two similar sets of dihedral angle bounds for residues predicted to be part of a β-strand. These values are used regardless of the existence of a parallel or an antiparallel orientation.
Distance constraints can also be introduced based on predicted secondary structure elements or predicted topology. The formation of an α-helix is dependent upon the formation of a hydrogen bond between the carbonyl oxygen at residue i and the backbone – NH at position i + 4. Klepeis and Floudas  introduced bounds on the distances between the Cα atoms of these helical residues to enforce the formation of the hydrogen bonding network. Knowledge of β-sheet topology can also be used to introduce bounds on the distances before the application of the tertiary structure prediction algorithm. Klepeis and Floudas  used bounds on the distances between the Cα atoms for β-sheets. These bounds are non-local and help to enforce the formation of a β-sheet between two opposing β-strands. The lower and upper distances used for these bounds are presented in Table 3.
Given the definition of an energy function in Sect. 2.1 and a set of constraints in Sect. 2.2, the protein structure prediction problem can be formulated as a minimization problem. This formulation has been described in detail previously , so only the key aspects of formulating the problem will be covered here.
The prediction of protein tertiary structure can be formulated as either a constrained or unconstrained minimization problem. The unconstrained formulation requires the minimization of a hybrid energy objective function that combines both the ECEPP/3 potential energy as well as penalty values for the violation of the enforced dihedral angle and distance bounds as shown in (2). In this equation, θ is the vector of dihedral angles representing the protein conformation, EECEPP/3 (θ) is the ECEPP/3 potential energy contribution of this conformer, Eres (θ) is an energy term quantifying the restraint violations, and Wres is a weight factor. The weight factor should be large enough such that when this objective function is minimized, the sum of the bound violations is driven to zero.
The energy term quantifying the restraint violations includes contributions from both the dihedral angle bounds and the distance bounds. The violations of the distance bounds are quantified in a quadratic form using a simple square well potential. Equations (3)–(4) illustrate the separation of the distance bound violations into lower and upper bounding terms, respectively. The distance violation terms are individually weighted by the values and , which only contribute when the calculated distance dj assumes a value below the lower bounding distance or exceeds the upper bounding distance , respectively. A similar form of a quadratic function for violations of the dihedral angle bounds using a square well potential can be derived for unconstrained minimization, but becomes more complicated due to the periodic nature of dihedral angles .
An alternative to the formulation of (2) is to address the dihedral angle and distance bounds as mathematical constraints and minimize an objective function that contains only the potential energy contribution. Equation (5) describes the formulation of the protein tertiary structure prediction problem as a constrained minimization problem. The distance bound violations may be grouped as a single constraint, or separated into any number of subsets of multiple constraints, l = 1, …, Ncon, as a specific problem may require. The distance bound violation value of a given conformation θ for the set of bounds l is represented by . This violation is required to be less than or equal to a reference parameter , which may be zero or some small value to account for uncertainties in the values of and . The dihedral angle values, θk are also constrained to be within lower ( ) and upper ( ) bounding values. If no information is available to constrain a dihedral angle θk, the default range of [−π, π] is used.
Nonlinear minimization problems with nonlinear constraints are typically represented in the form of (6). The objective function, f(x) is a nonlinear function to be minimized subject to the constraints r(x), where x represents the vector of variables that can be altered to find the minimum value of f(x). It should be clear that the problem presented in Sect. 2.3 can be cast in this fashion.
Constrained nonlinear minimization techniques can be addressed using sequential quadratic programming (SQP) approaches, augmented Lagrangian methods, reduced-gradient methods, and interior-point methods.
The overall structure of sequential quadratic programming approaches solve the nonlinear programming problem of (6) by using both major and minor iterations. The major iterations identify a series of line search steps that converge to a solution, x*, that satisfies first order conditions for optimality. These iterates are defined by a step length, α, and a search direction, p, which are combined to move from the current point to the next point x. The search direction can be identified by solving the quadratic program of (7). In this equation, g(x) is the gradient vector of first order derivatives of f(x), H is a positive definite quasi-Newton approximation to the Hessian of the Lagrangian, and J(x) is the Jacobian matrix of first derivatives of r(x).
The formulation in (7) is solved using a second iterative procedure, which constitute the minor iterations of the overall SQP algorithm. Once the search direction p, has been identified, an appropriate step length is calculated that decreases the value of a merit function. This merit function measures the quality of each iterate by weighing both the current value of the function to be minimized, f(), and the feasibility of the solution with respect to any non-linear constraints. Further details of the SQP method are available elsewhere [58–61].
Although the basics concepts of a constrained nonlinear programming method remain constant across implementations, there may be considerable differences between various solvers in both applicability and performance. The NPSOL package  is especially attractive for the protein structure prediction minimization problem of Sect. 2.3 as it requires relatively few evaluations of the computationally expensive ECEPP/3 potential energy function.
One important parameter that is varied during the minimizations using NPSOL is the line search tolerance. This parameter can assume a value in the range of [0.0, 1.0), with a default value of 0.9. As the minimization takes a step, α, during a given iteration, the line search parameter controls the accuracy of the step when compared to the minimum of the merit function. For applications of the protein structure prediction problem, the use of the NPSOL algorithm with a single line search tolerance converges to an infeasible point that cannot be improved for as many as 50% of the minimizations in complex problems with hard to satisfy distance constraints. By performing several minimizations and altering the value of the line search tolerance, the algorithm can effectively and robustly identify low-energy, feasible protein conformations in the region of the initial conformation.
The overall approach for each constrained nonlinear minimization to be performed as part of the proposed tertiary structure prediction algorithm in this paper is outlined below and presented formally in the Supplementary Material. The main idea of this approach is to perform repeated calls of NPSOL using varying values of the line search tolerance until a termination criteria is met. The minimization of the given conformer terminates if a specified number of initial point improvements have been achieved or a thorough exploration of the range of line search parameters has been explored.
The application of constrained local minimization techniques is heavily dependent upon the identification of initial feasible points with low energy. Methods for identifying feasible initial conformations are presented in Sect. 2.5. The energy values of these initial conformations can be effectively minimized using the rotamer optimization techniques described in Sect. 2.6. By applying the constrained local minimization routines to these improved initial conformations, the time required by the local minimization is significantly reduced and the energetic quality of the resulting conformers is noticeably improved.
The identification of initial feasible points is critical to the success and efficiency of the constrained nonlinear minimization algorithm. The quality of an initial conformation can be measured by how well the conformation satisfies the imposed distance and dihedral angle constraints and how well the conformation avoids steric clashes that result in large energetic penalties.
If the exact distance values are available for all pairs of atoms, the Cartesian coordinates for these atoms can be determined uniquely (with the exception of translational and rotational degrees of freedom and inversion) using metric matrix distance geometry [62–64]. For protein tertiary structure prediction problems, the distance values possess neither the exactness nor the completeness required for the unique embedding of a single set of atomic coordinates. Distance geometry algorithms such as EMBED  and dgsol  can be used to produce molecular conformations that satisfy a set of sparse distance constraints.
Several alternatives to distance geometry algorithms exist, including molecular dynamics in Cartesian space [66, 67], variable target function methods [68, 69], and molecular dynamics in torsion angle space (torsion angle dynamics) [70–72]. The various algorithms for protein structure determination are reviewed in further detail elsewhere .
The basic torsion angle dynamics routine in the initial implementation of the ASTRO-FOLD framework has been replaced by a more detailed torsion angle dynamics annealing procedure by interfacing with the CYANA (Version 2.1) software package . The number of structures that satisfy both the distance and dihedral angle constraints after this routine is significantly increased, especially in the cases of distance constraints with tight bounds or large numbers of distance constraints. The algorithmic steps are outlined below and presented formally in the Supplementary Material.
The interactions of the side chain atoms in a protein are crucial to both the stability and specificity of its native state. The placement of these side chain atoms can be posed as a combinatorial problem by approximating the continuous space of atom placements using a library of rotational isomers. The rotational isomers, commonly referred to as rotamers, are residue-specific and are typically represented as a set of dihedral angles, χ1…N, where N is the number of side chain dihedral angles required to specify the placement of this side chain.
A variety of rotamer libraries have been proposed to represent the likely conformations of side-chain atoms in native protein structures. One of the earliest rotamer libraries was built from merely 19 well-refined proteins and contained only 67 rotamers . Dunbrack and Karplus included the dependence of a rotamer conformation on the local backbone conformation of a protein in their library, producing the first backbone-dependent rotamer library . Lovell et al. created a “Penultimate” rotamer library by applying stringent criteria to select only highly resolved and refined protein structures and to eliminate specific side-chains with high B-factors, atomic clashes, or uncertain ring orientations . A thorough review of the history, limitations and role of rotamer libraries can be found in .
Many algorithms have been introduced to address the problem of protein side-chain prediction using rotamer optimization techniques and these approaches can be divided into two categories. The first class of algorithms are able to guarantee convergence to the placement of rotamers that yields the minimum energy. The Dead-End-Elimination (DEE) approach was one of the earliest optimization approaches that has been applied to the side-chain placement problem . By proving that a better alternative rotamer positioning exists, DEE eliminates single rotamers or rotamer combinations that cannot be part of the minimum energy solution. These DEE methods have improved to be used alone for rotamer optimization , or in concert with other search algorithms, such as an A* search  or a residue-rotamer-reduction approach . Alternative methods with optimality guarantees for the side-chain placement problem include mixed-integer linear programming [82, 83] and graph theory approaches .
As an alternative to algorithms that provide guaranteed convergence to the global minimum conformation, a number of heuristic algorithms have been proposed that are faster than many of the previously described algorithms. These approaches include the use of Monte Carlo searches [85, 86], cyclical search methods [75, 87], genetic algorithms , and mean field optimization . Desmet et al. proposed the Fast and Accurate Side-Chain Topology and Energy Refinement (FASTER) approach that combines the DEE techniques with a multi-pass algorithm that systematically overcomes local minima of increasing order .
Rotamer optimization is an integral part of many protein structure prediction approaches. Many of the earliest ab initio protein structure prediction methods separate the problem into two distinct subproblems: (a) the generation of a protein backbone structure and (b) the assignment of side-chain atom positions . Rotamer optimization techniques continue to be important in homology modeling methods, fold recognition and threading approaches, fragment assembly algorithms, and ab initio protein structure prediction [77, 92].
A rotamer optimization stage has been introduced in the proposed tertiary structure prediction algorithmic framework prior to the constrained nonlinear minimization of a protein conformation. Given an initial protein backbone, either from torsion angle dynamics runs or through conformational space annealing, the goal of the rotamer optimization stage is to remove any steric clashes that may exist between protein side chains and provide a better starting point for local minimization. In this respect, rotamer optimization acts as an efficient local minimizer. Therefore, the selected approaches have the following goals (a) they are fast, heuristic-based approaches and (b) they search a large rotamer library.
The efficiency of the rotamer optimization algorithms outlined here can be significantly enhanced by the use of an approximate energy function. This approximate energy function will be used to closely represent the repulsive energetic penalty that results from steric clashes between the atoms in the Lennard-Jones and hydrogen bonding potential energy contributions in the ECEPP/3 force field. Both the torsion and electrostatic potential energy terms are ignored in this approximation because they are heavily outweighed by the energy of repulsion. Any interaction energy between two atoms that is less than 2 kcal/mol is ignored. The contribution of repulsion is then approximated by a piece-wise linear function that intersects the interaction energy at values of 2, 5, 10, 20, 50, 100 kcal/mol. The linear approximation between 50 and 100 kcal/mol is extended for distances that lead to interaction energy values beyond 100 kcal/mol.
A number of rotamer optimization algorithms, especially those that are combinatorial in nature, function by dividing the energy contribution into two parts, the self energy and the pair energy. The self energy of rotamer k at residue position i, , represents the energetic interaction of this rotamer with all atoms that remain fixed during the rotamer optimization. Thus, the moveable side chain atoms of residue i are evaluated against all the backbone atoms, the Cβ side chain atoms, and any other immovable side chain atoms, such as Hα or Hβ, to obtain the rotamer self energy. The pair energy, , is evaluated for the placement of rotamer k at residue position i and rotamer l at residue position j. This energy contribution evaluates the interactions of the moveable side-chain atoms at both residue positions concurrently. Equations (8)–(9) define the self and pair energies, where Ri is defined as the set of rotamers that are valid for the amino acid at position i, the indices m, n represent atoms on the protein, Mi,k is the set of movable atoms for residue i (whose positions are defined by rotamer k), and Fi is the set of fixed atoms for residue i.
Given this separation of the energy terms, the rotamer optimization problem is concisely stated as (10).
The rotamer optimization strategy for the proposed tertiary structure prediction approach is divided into three stages. The first rotamer optimization algorithm is an application of the FASTER algorithm. The original FASTER algorithm has been shown to produce nearly identical results to DEE-based rotamer optimization methods but is nearly 100–1000 times faster . In their computational studies, the initialization stage (specifically the calculation of the self and pair energy contributions) actually becomes the dominant and rate-limiting computational phase. In this application, the FASTER algorithm is applied to the current protein structure using the reasonably-sized Penultimate rotamer library. The algorithmic details can be found in . For tertiary structure prediction problems with distance constraints for the moveable side chain atoms (such as those encountered with NMR structure prediction and refinement problems), the self and pair energies evaluated for the FASTER approach are penalized (with a constant penalty term) if the rotamer assignment would result in a violation of these distance constraints.
The second rotamer optimization algorithm described here is a simple cyclical search algorithm. The algorithm steps through the entire Xiang and Honig rotamer library at each residue position, saving changes that lead to an improvement in the ECEPP/3 energy function. The approximate energy function is used to accelerate the search, by only visiting the detailed energy function if the approximation suggests the rotamer positioning in question may be in the same energetic range. The algorithmic steps are outlined below and presented formally in the Supplementary Material.
The final rotamer optimization algorithm presented here is a random local search algorithm that proceeds in a similar fashion to the cyclical search algorithm. Instead of using the Xiang and Honig rotamer library, random rotamers are generated from a narrow Gaussian distribution in the region of the current rotamer. This algorithm is intended to provide a degree of refinement after the application of either or both of the first two algorithms, helping to bridge the gap to the continuous χ angle space while retaining some of the favorable properties of discrete rotamer searches. The algorithmic steps are outlined below and presented formally in the Supplementary Material.
The use of the rotamer optimization as a local minimizer provides a better initial protein conformation for subsequent constrained local minimizations with solvers such as NPSOL. Any reasonable rotamer optimization protocol for this framework should require significantly less CPU time than the gradient based minimization approach. For rotamer optimizations of the protein PDB:1o2f, the associated CPU times are approximately 1 second for the FASTER approach, 8 seconds per cyclic search iteration, and 1 second per random local search iteration. The total time of approximately 25 seconds for the rotamer optimization protocol compares favorably to constrained local minimization runs that require on the order of 5–15 minutes per protein conformation.
The overall approach to the problem of rotamer optimization for the tertiary structure prediction algorithm can be summarized as (1) a single application of the FASTER algorithm using the Penultimate rotamer library, (2) two iterations of a cyclic search algorithm using a large rotamer library from Xiang and Honig’s research efforts and (3) ten iterations of a random local search algorithm using randomly generated χ angles in the vicinity of the original rotamer position. This approach demonstrates the ability to act as an efficient local minimizer and can provide better initial configurations to the NPSOL constrained local minimization.
The αBB algorithm is a deterministic global optimization approach that can provide theoretical guarantees of convergence to the global optimal solution of a wide-variety of optimization problems with twice-continuously differentiable objective and constraint functions [93–97]. The algorithm achieves convergence by creating a nondecreasing series of lower bounds on the global optimum as well as a non-increasing series of upper bounds on this optimum. These two series of bounds eventually converge to the global optimum value of the optimization problem. The αBB approach has been previously applied to the protein structure prediction problem [37, 40].
The use of the αBB global optimization algorithm guarantees the identification of the global minimum solution by a convergence of upper and lower bounds on the potential energy minimum. The upper bound on the global minimum is obtained by constrained nonlinear local minimization on any protein structure. The lower bound is determined by creating a valid convex underestimating function and identifying its minimum function value. The algorithm converges by successively partitioning regions of conformational space at every level of a branch and bound tree. The lower bounding formulation can be written as shown in (11).
The term LECEPP/3 (θ) is the lower bounding function of the force field on the current region and can be expressed by (12). The αθi terms are nonnegative convexification parameters that can be calculated rigorously through a variety of approaches and must be greater than −1/2 of the minimum eigenvalue of the Hessian of the energy function over the domain in question .
The term is the convex relaxation of the distance bound violation term, , as shown in (13). This relaxation produces a convex overestimation of the feasible region as defined by the constraint on the distance bound violations.
Given the solutions of the lower and upper bounding problems, the algorithm proceeds by branching on a region of dihedral angle space. Bisecting a single dihedral angle across all nodes at a given level of the tree has been suggested as an appropriate branching strategy for protein conformation problems . The subproblem with the infimum of all the minimum values of the lower bounding functions is identified as the next candidate for branching to ensure non-decreasing lower bounds. The upper bounding values of a region can be identified using constrained nonlinear local minimization techniques, such as those found in Sect. 2.4. A non-increasing sequence of these upper bounds can be determined by selecting the current upper bound as the minimum value of all previously determined protein conformations. Any subregion where the lower bounding value exceeds the current upper bound can be fathomed, as it can no longer contain the minimum of the potential energy function.
The implementation of the αBB algorithm will be presented as three phases, (i) initialization, (ii) algorithm control and (iii) single iteration. This separation will be useful in Sect. 2.9, where a parallel implementation of a hybrid global optimization algorithm is proposed. The initialization phase must be executed prior to the other two phases to identify global variables, dihedral angle bounds, distance bounds, α values, and other necessary information.
The αBB algorithm control handles the iterative nature of the method. The lower and upper bounding values of each subregion are stored here and used to identify the non-increasing sequence of upper bounds and non-decreasing sequence of lower bounds. In addition, this information is used to identify branching directions and candidates for fathoming. The control of the algorithm proceeds as shown below and is presented formally in the Supplementary Material.
A single αBB iteration focuses on identifying the lower and upper bounding function values for a given subregion. For the protein structure prediction applications, torsion angle dynamics and rotamer optimization methods are included as part of the algorithm structure. The torsion angle dynamics approach, as described in Sect. 2.5, allows for the rapid identification of protein conformations that are feasible with respect to the distance and angle bound constraints. The rotamer optimization methods, as described in Sect. 2.6, function as quick and effective local minimizers. The integration of these two components into an αBB iteration and the overall layout of this iteration is described below and presented formally in the Supplementary Material.
The conformational space annealing (CSA) algorithm has been proposed by Scheraga and co-workers as a stochastic optimization technique that is well-suited for the problem of protein structure prediction [33–35, 99–101]. This approach can be classified as a member of the simulated annealing class of algorithms , where the conformational search space is initially unrestrained but becomes gradually narrowed as the algorithm progresses. This slow reduction of the achievable conformations ideally restricts the search to the lowest energy regions where the global minimum conformation is likely to occur. The CSA algorithm combines this simulated annealing approach with elements of a genetic algorithm as a hybrid stochastic global optimization approach. Unlike the αBB deterministic global optimization approach presented in Sect. 2.7, this stochastic optimization technique offers no theoretical guarantee of finding the global minimum protein conformation in finite time.
The outline of the basic CSA approach used as part of the proposed protein tertiary structure prediction algorithm is presented below and also presented formally in the Supplementary Material. The source of the protein conformers to create the initial bank and add additional conformers to the bank will be the solutions of the αBB subproblems, which will be described in detail in Sect. 2.9. The three genetic operations of steps (2(a)ii), (2(a)iii), and (2(a)iv) are similar to the genetic operations previously implemented by Scheraga and co-workers [33, 100].
The bank acceptance criteria is dependent upon a measure of distance between two protein conformers. The distance between two protein conformers, i and j, in conformational space is defined as the absolute deviation of the differences of the dihedral angle values as shown in (14). In this equation, and represent the dihedral angle k for the protein conformations i and j, respectively.
Given the definition of a distance metric in (14), it is straightforward to define an average pairwise distance value over all of the protein conformations in the CSA bank. If Nbank is the current size of the CSA bank, the average pairwise separation distance between conformations in the bank, Davg is defined by (15). It is obvious that this equation is only meaningful for Nbank ≥ 2.
A cutoff distance, Dcut, is defined to prohibit the protein bank from becoming heavily biased towards one region of the conformational space during the early stages of the CSA algorithm. This cutoff distance is initialized to Davg/2. Since this algorithm is an annealing method, the value of Dcut should be gradually reduced as the algorithm progresses. Several annealing schedules have been proposed to reduce Dcut exponentially [41, 100], linearly , or as a function of the difference between the lower bounding function value (from the αBB results) and potential energy value of the conformers in the CSA bank . The annealing schedule used for the proposed tertiary structure prediction algorithm relates the value of Dcut to a weighted combination of the Davg value and the improvement/convergence of the αBB upper and lower bounds as shown in (16). In this equation, δαBB is the current difference between the αBB lower and upper bounds, which is compared to the initial difference of these bounds, δαBB,0
Given the definition and annealing schedule of this Dcut value, the following procedure describes how an altered and minimized conformer is evaluated for acceptance into the conformational bank of the CSA algorithm.
The local minimization strategy for the trial conformations of the CSA approach is outlined below and presented formally in the Supplementary Material. The use of rotamer optimization algorithms, in a similar approach to the one described in Sect. 2.7, provides a good starting point for further minimization by eliminating the large energetic penalties that accompany any steric clashes between side chain atoms. The altered conformer with improved side-chain atom positions can then be subject to constrained nonlinear minimization techniques where the positions of all the protein atoms (represented by the dihedral angles) are allowed to vary.
The deterministic nature of the αBB provides a theoretical guarantee of convergence to the global optimal solution. Due to the lower-bounding problems that must be solved over each problem domain, the computational requirements of this algorithm are significant. The stochastic CSA algorithm cannot provide any information regarding the lower bound of the energy function within a given region of conformational space, but it is an efficient method of identifying and improving protein conformations that place an upper bound on the potential energy. The ideal algorithm would be a combination of these two techniques, which would yield an algorithm that can both identify lower bounds on protein conformers and efficiently identify and refine protein conformations with low energies.
Two parallel hybrid algorithms have been previously proposed to combine the benefits of the αBB and CSA algorithms. The first approach, an integrated hybrid method, executes the αBB and CSA algorithms in succession as the overall minimization approach . In this approach, an αBB iteration is performed to identify a low-energy conformation, which is passed to the CSA approach as a trial conformation and subjected to mutation and crossover operations. The second approach, the alternating hybrid algorithm, is based on the separation of the αBB and CSA algorithms . The αBB algorithm is repeatedly performed to build and update the bank of conformers necessary for CSA approach. These conformers are then subject to the standard CSA mutation and crossover operations followed by local minimization. The alternating hybrid approach is used as the basic framework for the algorithm proposed in this article for tertiary structure prediction.
Several modifications of the parallel implementation of the alternating hybrid algorithm have also been made. The duties of the two control processors have been combined and assigned to a single overall control processor. This combination is effective because the work distributed to the work processors requires several orders of magnitude more time for processing than for the communication associated with delivering the work and receiving the results. Very little delay of the work distribution due to communication was observed. The initial implementation of the algorithm assigned a fixed set of processors to specific duties (αBB iterations, CSA iterations). Since no CSA work exists until a bank has been filled by low energy structures from the αBB algorithm, these processors would sit idle during the first rounds of the algorithm. This inefficiency has been removed by initially assigning all processors to perform iterations of the αBB method. Once the initial CSA bank of conformers is filled, the control processor sends a message to a subset of the work processors indicating that their duties have changed. The algorithmic steps of the improved hybrid global optimization approach are shown below and presented formally in the Supplementary Material.
The utility of an algorithm can be gauged by knowing both its strengths and its limitations. Previous algorithms have used global optimization-based approaches to identify low energy protein conformations using the ECEPP/3 force field [37, 40]. The proposed tertiary structure prediction algorithm will be evaluated using four protein test cases that have been studied in detail including published energy values, secondary structure predictions, distance and dihedral angle restraints, and root mean square deviations (RMSDs) of the Cα atoms from the native state. The computational studies described below were executed in parallel using 50 Intel 3.0 GHz Pentium processors for 72 hours on an available Beowulf cluster.
The immunoglobulin binding domain of protein G from the Streptococcus species contains 56 amino acids and has shown unique properties, such as an extreme thermal stability. The structure of this protein has been determined using NMR spectroscopy  and X-ray crystallography . This protein folds into a ββαββ motif, where the four β-strands form a single β-sheet and the single α-helix is located above this plane. The tightly-packed hydrophobic core and the extensive hydrogen bonding network make this protein an interesting and widely-used system for computational and theoretical studies.
The ASTRO-FOLD protein structure prediction approach was applied to protein G to further validate this ab initio approach . The ASTRO-FOLD approach predicts a single α-helix between residues 23–34, so the and ψ dihedral angles of these residues are restricted to the range of [−85, −55] and [−50, −10], respectively. Klepeis and Floudas  correctly predicted the existence of four β-strands, located from residues 1–7, 16–21, 43–45, and 51–55. The and ψ dihedral angles of these β-strands are bounded from [−155, −75] and [110, 180], respectively. The hydrogen bonding network within the helix was enforced with 8 lower and upper distance bound restraints on Cα–Cα (i, i + 4) distances. The prediction of the β-sheet topology identified 12 β-sheet contacts that were enforced with lower and upper distance bounds of 4.5–6.5 Å on the Cα–Cα distances. Klepeis and Floudas  obtained and ψ dihedral angle bounds for the loop regions by additional free energy runs for oligopeptide segments. Their combined global optimization and torsion angle dynamics algorithm with these restraints yielded a protein conformation with an ECEPP/3 potential energy value of −267.0 kcal/mol and a Cα RMSD to the experimentally determined structure of 4.2 Å.
The restraints on the dihedral angles and distances described by Klepeis and Floudas  were approximately recreated for the purposes of comparing tertiary structure prediction algorithms for protein G. The specifics of the free energy runs for the oligopeptides representing the loop residues were not provided, so the comparison run was performed with unrestrained loop residues (i.e., and ψ bounds of [−180, 180]). Figure 1 summarizes the evolution of the results of the proposed hybrid tertiary structure prediction algorithm by plotting the minimum ECEPP/3 energy conformation achieved versus the number of CSA iterations that have been completed. The lowest energy conformation, with an ECEPP/3 potential energy of −418.254 kcal/mol, was identified after 2361 iterations of the CSA portion of the proposed hybrid global optimization algorithm.
The quality of the ensemble of predicted protein conformers can be further evaluated by calculating their RMSDs from the native protein G structure. This evaluation of the ensemble is presented in Fig. 2. The conformer with the lowest energy in the ensemble has an RMSD of 4.97 Å from the native structure. The ensemble does contain several structures with a lower RMSD, including a conformer with an energy of −328.38 kcal/mol and a 2.81 Å RMSD from the native structure. Despite the significant population of low energy structures that can be found using the proposed hybrid tertiary structure prediction algorithm, the tertiary structure prediction method of Klepeis and Floudas  is unable to find structures within 100–200 kcal/mol of the minimum energy.
The lowest energy conformer from the application of the proposed tertiary structure prediction algorithm is aligned to the native protein G structure in Fig. 3. There are differences in the exact placement of the secondary structure elements, especially the orientation of the α-helix with respect to the β-sheet, but the overall native topology is well-maintained in this lowest energy conformation.
The target T59 was one of 43 protein sequences with unknown structure released during the CASP3 experiment. This sequence, representing the human Sm D3 protein, contains 75 amino acids. After the CASP experiment, this protein was later determined with a resolution of 2.0 Å using X-ray crystallography techniques . The topology of this protein is similar to the common SH3 fold, which can be recognized by the antiparallel β-sheets that fold together to produce a barrel-like structure.
Klepeis and Floudas used the ASTRO-FOLD methodology to predict the structure of this protein . The predicted location of the α-helical region for this protein is residues 6–11, which favorably overlaps the assignment of residues 6–13 as α-helical when using the experimental coordinates. The and ψ dihedral angles for these residues are restricted to the range of [−90, −40] and [−60, −10], respectively. This protein is predicted to have eight β-strands, spanning the residues 16–21, 26–29, 31–34, 39–43, 46–51, 53–57, 61–64, and 68–73. The and ψ dihedral angles of these residues with extended conformations are bounded by the ranges of [−180, −80] and [80, 180], respectively. The remaining residues, which define the loop regions between these secondary structure elements, are allowed to assume the entire range of possible and ψ angles, [−180, 180]. Klepeis and Floudas identified 30 lower and upper distance bounds to introduce on Cα–Cα distances representing the predicted β-sheet contacts. The combined global optimization and torsion angle dynamics algorithm of Klepeis and Floudas with these restraints yielded a protein conformation with an ECEPP/3 potential energy value of −395 kcal/mol and a Cα RMSD to the experimentally determined structure of 5.4 Å.
T59 represents yet another protein system to validate the application of the proposed tertiary structure prediction algorithm. Figure 4 describes the progress of this proposed hybrid algorithm by plotting the minimum ECEPP/3 energy conformation that has been identified versus the number of CSA iterations. The conformer with lowest ECEPP/3 energy, at a value of −582.858 kcal/mol, was identified by the algorithm after 480 iterations. The algorithm shows consistent progress for the first 483 iterations, indicating that further exploration of the conformational space may lead to structures with lower potential energies than the minimum value reported here. This observation was indeed validated by the identification of near-native protein structures with energies as low as −630 kcal/mol.
The RMSD values of the predicted T59 conformers, identified by the proposed tertiary structure prediction algorithm, versus the T59 native structure are plotted as a function of their ECEPP/3 potential energy in Fig. 5. The conformer with the lowest energy in the ensemble has an ECEPP/3 potential energy of −582.858 kcal/mol and an RMSD to the native structure of 4.73 Å. The ensemble does contain several structures with a lower RMSD, including the conformer with the lowest RMSD of 3.43 Å and an energy of −545.47 kcal/mol. More than 3600 conformers are identified with an energy below −400 kcal/mol.
Figure 6 illustrates the alignment of the lowest energy conformer from the application of the proposed tertiary structure algorithm versus the native T59 structure. There are differences in the exact placement of the secondary structure elements, but the overall native topology of this lowest energy conformation is well-maintained. The reasonable RMSD of 4.5 Å indicates that the differences easily identified by visual inspection do not contribute heavily to the structural deviations. Additional dihedral angle restraints on the loops or more extensive conformational sampling may lead to structures with lower energies and lower RMSDs from the native T59 structure.
Serine protease inhibitors are responsible for regulating serine proteases like trypsin and chymotrypsin. The serine proteases are necessary for hydrolyzing peptides, which is an integral role for cell function. The bovine pancreatic trypsin inhibitor (BPTI) is a well-studied protein within the class of serine protease inhibitors. Its three-dimensional structure is characterized by 3 disulfide bonds (between Cysteine residues at positions 5–55, 14–38, and 30–51) that stabilize the protein in its native state. These disulfide bonds are conserved across the class of serine protease inhibitors. The tertiary structure of BPTI has been experimentally determined using standard X-ray crystallography techniques (PDB: 4PTI)  and by methods that combine X-ray and neutron diffraction experiments (PDB: 5PTI) .
BPTI was studied in detail by Klepeis and Floudas using the ASTRO-FOLD methodology . The α-helical regions of the protein were determined to be between residues 2–5 and 47–54, whereas the residues 17–23, 29–35, and 44–46 were identified as β-strands. Klepeis and Floudas introduced and ψ dihedral angle bounds of [−85, −55] and [−50, −10] for the α-helical regions, bounds of [−155, −75] and [110, 180] for the β-strand regions, and unrestricted bounds ([−180, 180] and [−180, 180]) for the loop regions. For α-helical residues separated by four positions (i.e. positions i, i + 4) in the primary amino acid sequence, Cα–Cα distance bounds of 5.5–6.5 Å were introduced to enforce the formation of the intrahelical hydrogen bonding network. The topology of the β-sheets and disulfide bridges within BPTI were predicted using integer linear programming techniques  and enforced through additional distance restraints. For residues predicted to be contacts between two β-strands, Cα–Cα distance bounds of 4.5–6.5 Å were introduced. Finally, sulfur atoms of Cystine residues predicted to be in a disulfide bridge network are bounded by distances of 2.01–2.03 Å. The combined global optimization and torsion angle dynamics algorithm of Klepeis and Floudas with these restraints yielded a protein conformation with an ECEPP/3 potential energy value of −428.0 kcal/mol and a Cα RMSD to the experimentally determined structure of 4.1 Å.
The restraints described by  were recreated for the purposes of comparing tertiary structure prediction algorithms with a consistent basis. Figure 7 shows the energy values of the predicted protein conformers plotted against the RMSD values of these structure from the native structure of BPTI. Although the conformer with the lowest energy in the ensemble has an RMSD of 8.96 Å from the native structure, the ensemble contains several structures with a lower RMSD, including a conformer with an energy of −417.852 kcal/mol and a 3.41 Å RMSD from the native structure.
Figure 8 illustrates the alignment of the lowest energy conformer from the application of the proposed tertiary structure algorithm versus the native BPTI structure. The lowest energy conformation has the correct β-sheet topology, the correct disulfide bridge contacts, and the correct α-helical topology. However, there are significant differences in the placement of the α-helices relative to the β-strands.
The target T114 was one of 43 protein sequences with unknown structure released during the CASP4 experiment. This 87 residue protein is an anti-fungal, chitin-binding protein from the organism Streptomyces tendae TÜ901. The structure of this protein was determined using multidimensional NMR spectroscopy techniques and released after the CASP4 experiment . The structure of this protein contains 9 β-strands that pack together into 2 antiparallel β-sheets (one with 4 strands, one with 5 strands) that form a parallel β-sandwich motif.
Klepeis and Floudas closely examined the structure prediction of T114 as an evaluation of the ASTRO-FOLD protocol . They predicted 7 β-strands, defined by the residues 12–16, 22–26, 31–37, 39–42, 48–54, 61–70, and 77–86, that were enforced by introducing and ψ dihedral angle bounds of [−180, −80] and [80, 180]. In addition, 34 β-sheet contacts were predicted and imposed as Cα–Cα distance bounds of 4.5–6.5 Å. A disulfide bridge was also predicted between residues 7 and 25, which bounded the distance between the sulfur atoms of these residues to a range of 2.01–2.03 Å. After applying their tertiary structure prediction algorithm, Klepeis and Floudas presented a structure with an energy of −530 kcal/mol and an RMSD to the native structure of 4.5 Å.
Figure 9 evaluates the RMSD to the native T114 structure for each of the conformers identified by the proposed tertiary structure prediction algorithm as a function of their ECEPP/3 potential energy. The lowest energy structure, with an RMSD to the native structure of 6.24 Å, has an energy value of −744.31 kcal/mol. A number of conformers with a better RMSD value exist in the ensemble, including a conformer with an energy value of −736.43 and an RMSD value of 4.18 Å and also the conformer with the lowest RMSD value of 4.00 Å at an energy value of −534.66 kcal/mol. This figure also illustrates the location of the original prediction of Klepeis and Floudas. In total, the proposed tertiary structure prediction algorithm identifies 1750 conformers with an energy value better than −530 kcal/mol.
The lowest energy conformer is aligned to the native T114 structure in Fig. 10. Although the prediction of the overall topology, guided by the predicted β-sheet contacts, is correct, there are noticeable differences between the native structure and the lowest energy conformer in the predicted ensemble in the relative locations of several of the β-strand elements.
A new hybrid global optimization algorithm for protein tertiary structure prediction and its parallel implementation were presented. This hybrid global optimization algorithm successfully combines the αBB deterministic optimization algorithm and the conformational space annealing approach with an effective local minimization strategy. The local minimization requires torsion angle dynamics to identify initial conformations, rotamer optimization to achieve rough but efficient energy gains, and sequential quadratic programming to further minimize a protein conformation. The proposed method was able to outperform a related algorithmic approach in the literature for a computational study of four available test proteins, leading to lower energy protein structures and larger conformations of low-energy structures.
CAF gratefully acknowledges financial support from the National Science Foundation (R01 GM52032), the National Institutes of Health (R24 GM069736), and the US EPA (GAD R 832721-010). This work has not been reviewed by and does not represent the opinions of the funding agencies.