Clash-score is dependent on the resolution of the protein structure
We explore the prevalence of clashes in protein structures and ask if the extent of clashes in a protein structure depends on the quality of the protein structure. Using our energetic definition of clashes (Methods), we first study the distribution of the energy of different clashes, setting the minimum clash-energy as half kBT (0.3 kcal/mol). We find an exponential distribution of the energy of clashes in crystal structures of proteins: most of the clashes in a structure correspond to very low repulsion energy (0.3-0.6 kcal, ), and the probability of finding clashes of higher energies decreases exponentially (, inset). The clash energy distribution confirms that severe clashes are very rare in protein structures. However, a lot of low-energy clashes are observed, which may be required for close packing of the protein core and also for the formation of hydrogen bonds. Even though we exclude hydrogen-bonded atoms in enumerating clashes, during the formation of a hydrogen bond, adjacent atoms are also brought very close to each other. We also observe that the probability of finding high-energy clashes is higher in low-resolution crystal structures compared to high-resolution crystal structures (, inset)
In order to obtain an aggregate clash-score for a structure, we sum up the energies of all clashes and divide it by total number of ‘contacts’. Contacts include all possible local overlaps in the protein (Methods). The clash-score is thus a single number that describes the maximum permissible extent of clashes in a protein structure. In order to determine whether the clash-score of a given protein is dependent on the resolution of its crystal structure, we compute the clash-score for all the protein structures in our datasets. Since the clash-score itself is a sum of independent variables (the energy of different clashes in a protein), the clash-scores of proteins in a dataset are expected to form a normal distribution. As expected, the clash-scores of the three datasets (high-resolution, low-resolution and homology models) fit well to normal distributions (). We find that the clash score of a given protein structure is strongly dependent on its resolution (). High-resolution crystal structures feature low clash scores, while the mean clash score of lower resolution structures is significantly higher than that of high-resolution crystal structures with the homology models having the highest clash-scores in our three datasets (). These results indicate that the increased clash-scores in lower resolution structures is more the consequence of model building than being an inherent property of the protein structure. Furthermore, distribution of clash-energies in a protein show that low-resolution structures on an average feature more number of high-energy clashes compared to high-resolution structures (, inset).
Given the differences in distribution of clash-scores of different datasets, we arrived at an acceptable clash-score as the mean plus one standard deviation of the distribution of clash-scores of high-resolution structures. From the distribution of clash-scores for high and low resolution crystal structures and homology models (), it is clear that we require a tool that reduces the clash-score to be within the permissible value. To check if side-chain repacking alone can reduce abnormal clash-scores, we performed side-chain optimization using SCWRL
27 on 20 structures from our low-resolution PDB dataset with the worst clash-scores (
Supplementary Table I). We find that just side-chain repacking is not enough to improve the clash-score in these structures, and in many cases, optimizing side-chain rotamers results in higher clash-scores compared to input structures. Thus, small perturbations in the protein-backbone may be required for resolving clashes in low-resolution structures.
Automated protocol for minimization of steric clashes
To address the need for removing severe clashes in protein structures with minimal perturbations in the protein backbone, we develop a protocol using discrete molecular dynamics (DMD) simulations
12 (). Severe clashes have high values of Lennard-Jones potential, the release of which results in high velocities of clashing atoms. The high velocities in the simulation result in a rapidly expanding simulation system with bonds between atoms being broken. Thus, we need to quench the high velocities arising due to clashes, which we achieve using a high heat exchange rate of the solute (protein) with the bath. In our simulations, we maintain the temperature by periodically scaling the velocities of particles according to Maxwell’s distribution. By using a high heat exchange rate (or the frequency of rescaling the velocities) we ensure that the high velocities of clashing atoms are quenched. In this protocol, we rescale the velocities of all particles every 5 fs instead of ~1-5 ps commonly implemented in equilibrium MD simulations. To minimize deviation from the initial structure in these simulations, we constrain the backbone and Cβatoms to their initial positions using a harmonic potential. In some cases (like enzyme active sites), the rotameric state of certain residues is important for protein function, and those may not be involved in any clashes. In such cases, we could also constrain the side-chain atoms to their initial positions in order to maintain the initial rotameric orientations of side-chain atoms. In our protocol (), we first perform a short DMD simulation (10 ps) at a higher temperature (0.7 ε/k
B, roughly corresponds to 350 K; ε/k
B is the reduced unit of temperature
11), and then compute clash-score of every snapshot in the simulation trajectory to see if the system has converged to feature a clash-score less than or equal to the acceptable clash-score (). We report as the minimized structure, the first snapshot in the trajectory that has an acceptable clash-score. If the clash-score is not low enough during the first simulation cycle, we perform another cycle of DMD simulation at 0.5 ε/k
B, to quench the system. The “quench” approach aids in the formation of contacts and hydrogen bonds in the protein core that might remain unsatisfied at higher temperatures. We repeat the alternating cycles of simulations at 0.7 ε/k
B and 0.5 ε/k
B until a structure with acceptable clash-score is obtained.
We benchmark our protocol by attempting to resolve clashes in protein structures with high clash-scores. We select 20 structures from our low-resolution PDB dataset with the worst clash-scores. We reduce the clash-score of all these structures within the range of high-resolution structures (less than 0.02 kcal.mol
−1.contact
−1), with Cα RMSD of utmost 1 Å with respect to the corresponding initial structures (,
Supplementary Table II; shows one of the structures). The backbone and all-atom RMSD follow similar trends, but as expected, they are slightly higher than Cα RMSD (
Supplementary Table III). Some of these structures initially have severe clashes that cannot be acceptable for MD simulations (see below), but we can robustly minimize those clashes using DMD and make the structures acceptable for MD simulations. In order to ensure that stabilizing interactions in the core are not sacrificed for the sake of lowered clash energy, we examine the hydrogen bonds in the core of the protein. We find that in most of the cases, the number of unsatisfied hydrogen bond donors/acceptors decrease, and the core seldom has perturbation that causes significant loss of contacts (,
Supplementary Table IV). When we compare the initial and DMD-minimized structures, we notice that slight perturbations in the backbone ensure that the clashes are resolved while contacts in the core are maintained intact (). In order to ensure that the clash minimization does not result in distortion of side-chain geometry, we calculate the Z-score of χ angles of each side-chain (
Supplementary Table V). On average, we find the Z-scores of DMD minimized structures to be close to or less than the Z-scores of the input structures. We would expect the Z-scores of DMD-minimized structures to be close to the input structures since we constrain the Cβ atoms in our simulations, which maintains the side-chains close to the initial rotameric states.
| Table IPerformance of Chiron compared to Rosetta and CHARMM |
| Table IINumber of unsatisfied hydrogen bonds before and after minimization |
From our benchmarking simulations, we conclude that our protocol based on DMD can be used as a tool for rapid and efficient minimization of steric clashes in protein structures. Our protocol () is available as a web-based application – Chiron (
http://chiron.dokhlab.org). Chiron accepts protein-structures and attempts to resolve clashes with minimum backbone perturbation relative to the input structure. In the web server, any given input structure is first examined for missing atoms – missing side-chain atoms are reconstructed using Medusa
17,18, which uses Dunbrack rotamer libraries
27. Chiron then computes the clash-score to determine if the structure requires minimization (if the clash-score is greater than 0.02). Chiron then iteratively runs constrained DMD simulations till the clash-score is acceptable. Chiron also accepts protein-structures with ligands (designated as HETATM in the input PDB files). The user can choose the ligands to be included in determination of clash-score and for further minimization. When ligands are included in an input structure, we reconstruct side-chains in context of the ligands in order to ensure that the reconstruction does not introduce clashes with the ligands.
We now compare our approach in minimizing clashes in protein structures to other widely used simulation programs using the publicly available versions of the respective programs.
Comparison with Rosetta protein design suite
We use Rosetta, one of the widely used tools for protein design, to minimize clashes in protein structures to benchmark DMD. Rosetta uses a knowledge-based energy function for treating inter-atomic interactions in proteins.
1 We choose Rosetta for comparison with DMD since both the tools use the same bonded and non-bonded parameters derived from the CHARMM force field.
13 We use two different protocols for clash minimization using Rosetta (see Methods) and report that we are able to resolve clashes in 8 out of 10 test cases with Rosetta, but with a higher Cα RMSD relative to the starting structure when compared to DMD (, gray cells). We observe that for the structures for which Rosetta is able to successfully minimize clashes, the clash-scores are much lesser than those obtained by DMD or all-atom conjugate gradient (CG) minimization (see below). The lower clash-score of the final structure obtained using Rosetta is only a consequence of the simulation protocol: while DMD and CG/MD simulations are terminated as soon as the clash-score is less than the acceptable clash-score, the Rosetta protocol has a fixed routine which provides a single structure as an output for each iteration.
Rosetta also takes much longer CPU time to minimize clashes in the given protein structure () compared to DMD. Since Rosetta performs protein design by selectively replacing fragments of the given structure by determining the best fragment from a knowledge-driven fragment database, it is imperative that the protocol must be executed multiple times for statistically significant results. We perform 100 relaxation iterations-for every chosen protein in a parallel computing environment, with a unique random seed for every individual attempt. The runtime reported in is the sum of times taken for all the independent iterations on the cluster. It is logical to compare runtimes in this manner because DMD performs minimization on a single processor in the times reported in .
We also attempt to implement the “classic relax” protocol designed to perform more extensive relaxation employing small and shear moves of the protein backbone along with minimization and repacking of the side chain atoms. Since the classic relax protocol is extensive, we would expect it to take longer than the fast or constrained relax runs. Minimization of a 71 amino acid protein (1CTX) takes almost one day of CPU time, with no significant improvement in the final output compared to fast and constrained relaxation routines (data not shown). Hence, we do not perform classic relaxation of the remaining test cases. Overall, although we can use Rosetta to minimize the number of clashes in a given protein in 8 out of 10 cases tested, the downside of this approach is higher RMSD relative to the initial structure and longer computation time required for minimization compared to the other protocols tested. From the results obtained for the first ten PDB structures () and also from literature
8, we observe that Rosetta is not ideal for minimization of proteins longer than 250 residues. Hence, we do not use Rosetta to minimize the ten additional structures we consider for benchmarking DMD and CG/MD simulations (
Supplementary Table II, IV).
Comparison with CG minimization
We compare our results to those obtained using conjugate gradient (CG) minimization using all-atom forcefield (CHARMM). We perform CG minimization of the test set of proteins (,
Supplementary Table II; see methods for the protocol). We perform subsequent molecular dynamics (MD) simulations for those proteins whose clash scores are not reduced below the acceptable clash score using CG minimization alone. Most of the protein structures attain acceptable clash-scores with CG alone, which leads to much shorter time required for minimization compared to DMD or Rosetta. The quality of the structure is also maintained in these cases. Thus, CG is an efficient way to remove clashes in most protein structures. If the structures require subsequent MD simulations for complete minimization, the time taken is comparable to DMD.
However, in four out of twenty structures considered, CG does not converge because of unreasonable Van der Waals repulsion energy leading to high velocity, which in turn causes some of the bonds to break. We show these clashes before and after DMD minimization for one such structure (PDB ID 1GFF, ) that does not converge due to severe clashes leading to infinite forces in the simulation system. We are able to perform DMD simulations with these structures because we employ soft potentials for non-bonded interactions coupled with frequent rescaling of velocities. Soft-core potentials are an available option in MD forcefields too, but have been implemented for free energy perturbation simulations, but not for use in CG or MD to resolve severe steric clashes. Thus, existing minimization protocol using CG/MD method is not robust when applied to structures with severe clashes, which can be resolved using DMD. Furthermore, in all the four cases in which CG/MD does not converge, the DMD minimized structures are acceptable for subsequent MD simulations.
Robustness of DMD in resolving clashes
Minimization of steric clashes with minimal backbone perturbation need not imply a final structure that has formed proper contacts in its buried core. In order to estimate the quality of the structures generated by the different relaxation techniques we use, we enumerate the number of unsatisfied hydrogen bond donors/acceptors in the buried core before and after minimization, which informs us if all polar contacts in the core are well formed. Since a very small number of unsatisfied hydrogen bond donors/acceptors are seen in the core of high-resolution crystal structures, lesser the number of unsatisfied hydrogen bond donors/acceptors in the core, better is the quality of the structure, provided the steric overlaps are absent. In our benchmark set, we find that the numbers of buried unsatisfied hydrogen bond donors/acceptors of the test set of proteins are either maintained or decreased in 19/20, 6/10 and 15/16 cases for DMD, Rosetta and CG/MD respectively (,
Supplementary Table IV; CG/MD does not converge in 4/20 cases). If we were to categorize these unsatisfied partners as backbone or sidechain atoms, we find that all the methods are much better at forming new backbone hydrogen bonds, but lose many hydrogen bonds involving side chain atoms. This effect could be due to removal of clashes involving side chain, while at the same time stabilizing secondary structural elements in the protein. We conclude from the analysis of unsatisfied hydrogen bond donors/acceptors that all protocols we test (DMD, Rosetta and CG/MD) maintain the hydrogen bonds in the core of the protein, and thus provide final structures that can be compared on the basis of backbone RMSD to initial structures.
In order to test whether severe steric clashes are the cause for failure of CG during minimization of four of the twenty structures we use for benchmarking, we use the structure generated by DMD upon minimization to run a 2 ps MD simulation using CHARMM in GROMACS. These simulations proceed normally, further establishing the robustness of DMD in minimizing steric clashes. The reason DMD succeeds in minimizing steric clashes in cases where CG fails is that DMD automatically defines soft potentials between clashing atoms, thus accommodating even structures with severe clashes for minimization.
To further test the different protocols, we perform minimization simulations on 25 structures of proteins smaller than 250 residues with the worst clash scores from our swiss-model dataset (see methods). We observe a trend (
Supplementary Table VI) similar to that of the low resolution crystal structures: minimization using Rosetta takes longer with the minimized structure having higher RMSD to the starting structure, while DMD and CG/MD are comparable in terms of RMSD, while CG minimization alone is faster. However, there are three structures (Q03EE3, A6L8G0, P49048) where CG/MD does not converge due to severe clashes in the starting structures. Thus, simulations on homology models reiterate that DMD protocol is robust in accepting structures with severe clashes. To evaluate the robustness of these protocols in minimizing larger structures, we perform DMD and CG/MD simulations on 25 structures that are more than 250 residues long with worst clash-scores (
Supplementary Table VII). We notice that CG/MD does not converge on a majority of these larger structures, mainly because of the poor quality of these structures. Most of these structures feature bond-lengths that are more than ten standard deviations away from mean, causing CG to fail, and hence we do not use these simulations for arriving at any conclusions.
Our results clearly indicate that Chiron is able to resolve unnatural clashes in low-resolution crystal structures and homology models efficiently with minimum deviation of the protein backbone from the initial structure. However, attempting to refine homology models using simulations may drive the models away from the native state while featuring an RMSD of 1 Å with respect to the input structure. Hence it is important to ensure that the minimization protocol we employ does not result in a structure that is farther away from the native state compared to the initial structure. To verify whether Chiron drives a given structure away from its native state upon minimization, we consider five homology models from CASP8 predictions featuring the worst clash-scores, to compare against their native structures that are now available in the PDB. We perform refinement and analyses similar to those described above on these five models and observe that Chiron is able to resolve clashes from the homology models within 1 Å of the initial structure and yet not drift away from the native structure (
Supplementary Tables VIII and IX). We also observe that the other methods we tested present similar trend with respect to the Cα-RMSD from the native state after refinement (
Supplementary Tables VIII and IX). We can conclude from this analysis that clash-refinement using Chiron does not drive the structures further away from the native state.