The side-chain conformation prediction problem is an integral component of protein structure determination, protein structure prediction, and protein design. In single-site mutants and in closely related proteins, the backbone often changes little and structure prediction can be accomplished by accurate side-chain prediction1
. In docking of ligands and other proteins, taking into account changes in side-chain conformation is often critical to accurate structure predictions of complexes2–4
. Even in methods that take account of changes in backbone conformation, one step in the process is recalculation of side-chain conformation or “repacking.”5
Because many backbone conformations may be sampled in model refinements, side-chain prediction must also be very fast. In protein design, as changes in the sequence are proposed by Monte Carlo steps or other algorithms, conformations of side chains need to be predicted accurately in order to determine whether the change is favorable or not6–8
Most side-chain prediction methods are based on a sample space that depends on a rotamer library, which is a statistical clustering of observed side-chain conformations in known structures9
. Such rotamer libraries can be backbone-independent, lumping all side chains together regardless of the local backbone conformation10
, or backbone-dependent, such that frequencies and dihedral angles vary with the backbone dihedral angles
. An alternative to using statistical rotamer libraries is to use conformer libraries, which are samples of side chains from known structures, usually in the form of Cartesian coordinates, thus accounting for bond length, bond angle, and dihedral angle variability13–16
. Once a search space in the form of rotamers (and samples around rotamers in some cases) or conformers is defined, a scoring function is required to evaluate the suitability of the sampled conformations. These may include the negative ogarithm of the observed rotamer library frequencies17–20
, van der Waals or hard sphere steric interactions of side chains with other side chains or the backbone, and sometimes electrostatic, hydrogen bonding, and solvation terms20–24
. Many search algorithms have been applied, including cyclic optimization of single residues or pairs of residues11,16
, Monte Carlo5,18,25
, dead-end elimination26,27
, self-consistent mean field optimization28
, integer programming29
, and graph decomposition17,30,31
. These methods vary in how fast they can solve the combinatorial problem, and whether they guarantee a global minimum of the given energy function or instead search for a low energy without such a guarantee. In general, such a guarantee is not necessary, given the approximate nature of the energy functions, and it is the overall prediction accuracy and speed that are more important features of a prediction method. In recent years, it has become clear that some flexibility around rotameric positions15,16,32
and more sophisticated energy functions20,33
are needed for improved side-chain packing and prediction.
SCWRL3 is one of the most widely used programs of its type with 2986 licenses in 72 countries as of April 30, 2009. It uses a backbone-dependent rotamer library12
, a simple energy function based on the library rotamer frequencies and a purely repulsive steric energy term, and a graph decomposition to solve the combinatorial packing problem30
. It has a number of features that have made it widely used. The first of these is speed, which has enabled the program to be used on a number of web servers that predict protein structure from sequence-structure alignments34
and may perform many hundreds of predictions per day. The second is accuracy. At the time of its publication, it was one of the most accurate side-chain prediction methods. However, a number of other methods have appeared claiming higher accuracy15,18,20,35
, although often at much longer CPU times. The third feature of SCWRL3 is usability. The program takes input PDB coordinates for the backbone, optionally a new sequence, and outputs coordinates for the structure with predicted side chains using the same residue numbering and chain identifiers as the input structure. This feature is simple but in fact many if not most side-chain prediction programs renumber the residues of the input structure and strip the chain identifiers, making them difficult to use in homology modeling. One unfortunate feature of SCWRL3 is that the graph decomposition method used may not always result in a combinatorial optimization that can be solved quickly. In such cases, the program may go on for many hours instead of finishing in a few seconds, since it lacks any heuristic method for simplifying the problem and finishing quickly.
In developing a new generation of SCWRL, called SCWRL4, we had several goals. First, we wanted to increase the accuracy over SCWRL3 such that SCWRL4’s accuracy would be comparable or better than programs developed in the last several years. Second, we wanted to maintain the speed advantage that SCWRL has over most similar programs. Third, we wanted to maintain the usability of the program for homology modeling and other purposes. As part of this, we wanted to make sure that the program always solves the structure prediction problem in a reasonable time, even if the graph is not sufficiently decomposable. This is accomplished with an approximation, that while not guaranteeing a global minimum of the energy function given the rotamer search space, does complete the calculation quickly in all cases tested.
In this paper, we describe the development of the SCWRL4 program for prediction of protein side-chain conformations. We used a number of different approaches to accomplish the goals described above. We have improved the SCWRL energy function using a new backbone-dependent rotamer library (Shapovalov and Dunbrack, in preparation) which uses kernel density estimates and kernel regressions to provide rotamer frequencies, dihedral angles, and variances that vary smoothly as a function of the backbone dihedral angles
and ψ. SCWRL4 also uses a short-range, soft van der Waals interaction potential between atoms rather than a linear repulsive-only function used in SCWRL3, as well as an anisotropic hydrogen bond function similar to that used in Rosetta36
(but using a different functional form that is faster to evaluate). To account for variation of dihedral angles around the mean values given in the rotamer library, we used the approach of Mendes et al.32
, which samples χ angles around the library values and averages the energy of interaction between rotamers of different side chains over these samples, resulting in a free-energy-like scoring function. In order to determine the interaction graph, as used in SCWRL3, we implemented a fast method for detecting collisions (i.e., atom-atom interactions less than some distance) using k
-discrete oriented polytopes (“kDOPs”). kDOPs are three-dimensional shapes with faces perpendicular to common fixed axes, such that kDOPs around two groups of atoms can be rapidly tested for overlap37
In SCWRL3, we used a graph decomposition method that broke down the interaction graph of residues into biconnected components, which overlap by single residues called articulation points. In most cases, this solves the graph quickly. However, with a longer-range energy function and sampling about the rotameric dihedral angles, this is no longer true. We therefore implemented our own version of a tree decomposition of the graphs, as suggested by Jinbo Xu for the side-chain prediction problem31
. This is almost always successful but in a small number of cases may still not result in an easily solvable combinatorial problem. We therefore added a heuristic projection of the pairwise energies onto self-energies within some threshold. This approximation of the full prediction problem always results in a solution, even if it is not guaranteed to find the global minimum. Finally, the new program has been developed as a library, so that its functions can be called easily by other programs such as loop modeling and protein design.