|Home | About | Journals | Submit | Contact Us | Français|
Protein structure determination and predictive modeling have long been guided by the paradigm that the peptide backbone has a single, context-independent ideal geometry. Both quantum-mechanics calculations and empirical analyses have shown this is an incorrect simplification in that backbone covalent geometry actually varies systematically as a function of the Φ and Ψ backbone dihedral angles. Here, we use a nonredundant set of ultrahigh-resolution protein structures to define these conformation-dependent variations. The trends have a rational, structural basis that can be explained by avoidance of atomic clashes or optimization of favorable electrostatic interactions. To facilitate adoption of this new paradigm, we have created a conformation-dependent library of covalent bond lengths and bond angles and shown that it has improved accuracy over existing methods without any additional variables to optimize. Protein structures derived both from crystallographic refinement and predictive modeling both stand to benefit from incorporation of the new paradigm.
Structural details at the 0.1 Å scale guide our understanding of enzyme catalysis, how mutations cause disease, and what makes a good inhibitor and potential drug. Since the work of Pauling and Corey (1951), protein model building at all levels has been guided by the assumption that the peptide backbone has a certain ideal geometry independent of context (Figure 1). This paradigm underlies the restraints used to guide protein structure refinement (e.g., Evans, 2007) and is also the basis of the rigid-geometry approximation used to simplify model building in the most successful structure-prediction packages such as Rosetta and I-TASSER (Rohl et al., 2004; Zhang, 2009). The rigid-geometry approximation uses fixed bond lengths and angles, leaving torsion angles as the only variables needed to define the structure. Ideal target values for the peptide backbone have varied little over the years, and a set of values most recently updated in 1999 (EH; Engh and Huber, 1991; Engh and Huber, 2001) is commonly used (Figure 1).
Experimentally derived crystal structures at all but the highest resolutions reflect the influence of the single-value ideal-geometry paradigm that is applied in the form of geometric restraints. However, strong evidence exists that this paradigm is flawed. Quantum-mechanics calculations and empirical analyses of high-resolution protein structures from over a decade ago suggested that the concept of a single, context-independent ideal value for backbone bond angles and lengths was wrong (Schäfer et al., 1995; Karplus, 1996). Instead, both approaches showed that backbone covalent geometry varies systematically as a function of the conformation of the backbone torsion angles. The systematic conformation dependence of ideal geometry was most notable for the N-Cα-C bond angle (NCαC) that varied by 8.8°, from 105.7° to 114.5° (Karplus, 1996). Similarly, systematic distortions of geometry are known to occur for classically disallowed but experimentally observed conformations (e.g., Gunasekaran 1996, Ramakrishnan 2007). And finally, particularly intriguing has been the observation that at increasingly higher resolution, protein structures are in progressively worse agreement with the supposedly “ideal” values (e.g., Longhi et al., 1998). This observation resulted in a recent literature debate about how to adjust the target values used for geometric restraints and how heavily to weight them (Jaskolski et al., 2007a; Tickle, 2007; Jaskolski et al., 2007b; Stec, 2007). We contributed to this debate with the suggestion that the root of the problem is not simply a matter of incorrect ideal target values or weights but instead is a matter of an incorrect paradigm in that ideal geometry should be a function, not a single value (Karplus et al., 2008).
With the explosion of protein structures solved at 1.0 Å resolution or better, the time is ripe to extend the earlier analysis (Karplus, 1996) and more accurately determine the nature and extent of the systematic variations of peptide geometry with conformation. To accomplish this, we created a nonredundant database of atomic-resolution structures that has nearly 20,000 residues. Here, we use this database to analyze conformation-dependent trends in backbone geometry in all bond angles and lengths. We also show that accounting for these trends has the potential to improve both crystallographic refinement and homology modeling.
To accurately characterize the nature and extent of conformation-dependent variations in geometry, we used a data set of 16,682 well-defined three-residue segments from 108 diverse protein chains determined at 1.0 Å resolution or better (see Experimental Procedures). A three-residue segment includes all of the atoms in two complete peptide units, and the data set included the bond lengths and bond angles for the peptide units uniquely identified by whether they mostly involve atoms from residue −1, 0, or +1 in the three-residue segment (Figure 1). Based on previous work (Karplus, 1996) indicating distinct geometric behavior of Gly, Pro, the β-branched residues Ile and Val (Thr behaves more like a general residue because of stabilizing sidechain-backbone hydrogen bonds) and residues preceding proline (prePro), we carried out separate statistical analyses for those five groups. The data set used here included 1,379 Gly, 639 Pro, 511 general prePro (644 before exclusion of Gly/Pro/Ile/Val), 1,822 Ile/Val, and 10,921 general residues (the 16 other residue types taken together). All prePro residues are excluded from the other classes. As seen in Figure 2, these residues were distributed in Φ,Ψ as has been seen for many well-filtered data sets (Karplus, 1996; Kleywegt and Jones, 1996, Lovell et al., 2003). Figure 2 also provides the shorthand nomenclature we will use for certain regions of the Ramachandran plot.
We analyzed these results to visualize and to document the Φ,Ψ-dependent variations in bond lengths and angles. Our approach was to use kernel-regression methods to smooth the data and to produce continuously variable functions for each parameter (see Experimental Procedures). The figures and tables in this paper are based on the kernel-regression analysis and only include regions of the Ramachandran plot having an observation density of at least 0.03 residues/degree2 (i.e., 3 residues in a 10° × 10° area) and a finite standard error of the mean.
The data reveal that for general residues, all 15 bond angles in the two peptides adjacent to the central residue vary systematically with Φ and Ψ (Figure 3 and Table 1). The most prominent observation is that the variations do not occur only in rare outlier conformations, but they occur throughout even the most populated areas of the plot for all residue types (Figure 3, S1–S4). Consistent with the lower-resolution analysis (Karplus, 1996), NCαC varies the most (6.5°), but four other angles also vary by ≥5°. An important difference from the results of the earlier study is that the conformation-dependent standard deviations of the bond angles are about half what was seen previously (Karplus, 1996), ranging from 1.3°–1.8° (Table 1). These are also substantially smaller than the standard deviations of ~2.5° used for the single ideal values defined by Engh and Huber (1991) based on small-molecule structures. It is notable that ultrahigh-resolution crystal structures are generally refined using geometric restraints that do not match the local averages, so the narrow (small σ) distributions cannot be an artifact of the restraints used. Interestingly, the variations in the averages are 2–4 times the standard deviations (Table 1), implying that current modeling restraints would work to wrongly pull angles away from their actual optimal values in many regions. Dramatically, the distributions at the extremes can even be completely non-overlapping because of the small standard deviations (Figure 4). The standard errors of the Φ,Ψ-dependent means (i.e., σ/√N) for bond angles are less than 0.5° in nearly all regions and typically less than 0.2° in the highly populated regions (Figures S5–S9)—however, errors should be considered when examining averages for the lowest-populated edges and other regions, such as the prePro region for general residues. In comparison, the 2°–7° ranges seen for the expected values are 10–50 times greater than their uncertainties. This shows that the variations are well-determined and backbone geometry in no way obeys the single ideal value paradigm.
In the 1996 study, the resolution of the data did not allow reliable visualization of bond-length variations. Here at atomic resolution, systematic Φ,Ψ-dependent trends are now visible in bond lengths (Figure 5) but the variation ranges (0.01 Å–0.02 Å) are only on par with the standard deviations (0.012 Å–0.016 Å), meaning the distributions are highly overlapping. The standard errors of the mean are smaller (~0.002 Å), so the variations in the means seen are nevertheless significant (~10-fold larger). Given that the standard deviations are on par with the expected coordinate accuracy, we hypothesize that the true underlying bond lengths are distributed more narrowly and thus will require still higher resolution analyses to determine accurately. Because of this limitation and the expectation that, because of the very small distances involved, the bond-length variations will have little impact on modeling accuracy, we will not further describe the bond-length trends here. Nevertheless, we suspect the variations involved will be chemically informative (e.g., Esposito et al., 2000; Figure 5).
Comparison of conformation-dependent trends across the two sequential peptide units reveals that the trends are largely locally influenced. For each of the seven angles associated with the central residue, the range is larger than the range for the same angle associated with the previous or subsequent residue (Table 1). For instance, N−1Cα−1C−1 and N+1Cα+1C+1 have ranges of 5.5° and 3.0°, whereas NCαC has a range of 6.5°. This implies that the angles in Table 1 associated with residues −1 and +1 show highly local effects, being more influenced by the Φ,Ψ values of their respective residues than the Φ,Ψ values of residue 0 (the central residue). For modeling purposes, it makes sense to assign the “ideal” target values for all seven of these angles based on Φ,Ψ of the central residue.
Furthermore, among these seven angles, additional evidence of the dominance of local effects is seen as each angle is influenced mostly by the single closest torsion angle, whether it is Φ or Ψ. Starting at the N-terminal end, C−1NCα is heavily Φ-dependent as is seen in the vertical pattern of variation, then the Cα-centered angles are a mixture, displaying diagonal patterning, and the angles at the C-terminal end, such as CαCN+1, have Ψ-dependent horizontal patterning. Even among the Cα-centered angles, NCαCβ shows enhanced dependence on Φ and CβCαC shows enhanced dependence on Ψ. This extreme locality agrees with much prior work noting that local steric interactions are critical factors in determining observed conformational and secondary-structure preferences (e.g., Dunbrack and Karplus, 1994; Baldwin and Rose, 1999).
As noted in the introduction, quantum-mechanical (QM) calculations of isolated alanine peptides (Jiang et al., 1997; Yu et al., 2001) also produce conformation-dependent trends in bond angles and bond lengths. The QM calculations are computationally intensive and they have only been carried out at 30° resolution in Φ,Ψ (Jiang et al., 1997; Yu et al., 2001), making detailed features of the trends unavailable. Beyond a certain level, the method and basis set used in QM calculations is unimportant to this analysis because they produce trends on the same scale with a nearly constant offset (Yu et al., 2001). As was reported by Karplus (1996), the QM results have similar trends, but now it is apparent that QM results show larger deviations, ranging farther both positively and negatively than experimental protein structures. For example, the empirical deviations from the central value for NCαC are roughly 70% of the calculated deviations. Additionally, QM calculations show serious discrepancies in some less populated regions, such as a false global maximum for O−1C−1N in Lδ (Figures 2 and and3).3). The mis-scaling seen in QM-calculated angles has been suggested by others to be caused by a lack of long-distance structural effects (Jiang et al., 1997; Yu et al., 2001; Feig, 2008). However, if that were the case, comparison of residues in secondary structure versus those in loops should show this same difference, but Karplus (1996) did not see a difference, and here we confirm that observation (Figures S10–S11). One potential underlying cause is the difference between a protein environment and vacuum rather than a long-distance effect caused by repeating secondary structure, but the reason that calculations in small peptides fail to predict the correct details of conformation-dependent geometry for proteins is uncertain.
The bond-angle trends for five classes of residues for all Φ,Ψ possibilities comprise a massive amount of information that cannot be exhaustively described in the context of this article. A survey of the results, however, reveals a general principle that the observed trends in geometry make structural sense in terms of accommodating local steric and electrostatic interactions, extending the rationale for observed conformations proposed by Ho et al. (2003). In Karplus (1996), the behavior of NCαC in the well-populated α, β, and δ regions (Figure 2) was rationalized in these terms, including the proposal of a π-peptide interaction in the δ region optimized by the opening of NCαC (see Figure 8 of Karplus, 1996). Instead of rehashing those observations, here we present four illustrative examples of Φ,Ψ regions with significant distortions. The conformations are shown in Figure 2, the relevant bond-angle values can be seen in Figure 3, and the specific collisions being ameliorated are illustrated in Figure 6.
In the Lα/Lδ region, non-Gly residues are disfavored because when using single ideal values for bond angles and lengths, there is a close-contact collision between O−1 and CβH. As Φ increases, this collision becomes worse. The conformation-dependent trends show that these conformations become accessible by a systematic increase in O−1C−1N, C−1NCα, and NCαCβ that opens the ring between O−1 and Cβ. At the extreme tip of the region near (+90°, 0°), these angles open compared to the EH values (Figure 1) by 0.4°, 4.3°, and 2.8°, respectively, to increase the O−1…Cβ distance from 2.59 Å to 2.85 Å. Although this change in distance is small, as are others described in this section, they can make large energetic differences by transforming unfavorable atomic clashes0 to close contacts.
The II′ region is adopted by the i+1 residue of type II′ turns, a tight turn with a hydrogen bond between O−1 and N+2H. In this conformation, Cβ is quite close to both O−1 and N+1, which results in this region being unfavorab le for nonglycine residues. Under the rigid-geometry approximation, the entire region should be disallowed because of this clash, but distortions in covalent geometry allow it to be accessible. The variations seen in Figure 3 show that the distortions relative to EH values (Figure 1) include a large opening in CβCαC (5.9°) as well as opening of CαCN+1 (3.3°) to reduce the Cβ…N+1 clash. This also reduces the O−1…Cβ clash, where the CβCαC distortion acts like opening jaws to move Cβ away from O−1. The extreme bond openings are enabled by a closing of NCαC (2.5°), CαCO (1.8°), and OCN+1 (2.0°). The Cβ…N+1 distance increases from 2.65 Å to 2.71 Å, and the O−1…Cβ distance increases from 3.06 Å to 3.09 Å.
Left of the δ region is a Ramachandran-allowed but sparsely populated region. The primary clash is between HN and HN+1. This clash is prevented through a combination of distortions relative to EH values: the dominant increases are in NCαC (3.5°) and CαCN+1 (2.8°) that both exhibit their extreme values (Figure 3), coupled with a decrease in CαCO (2.0°). The combined effect is to open and twist a nearly planar ring between NH and N+1H to prevent a van der Waals overlap by increasing the HN…HN+1 distance from 1.78 Å to 1.92 Å and the N…N+1 distance from 2.66 Å to 2.76 Å.
As a final example, we illustrate the importance of treating prePro as a special residue type. Preproline residues are classically disallowed in the α region, yet they are experimentally observed with low populations (Hurley et al., 1992). The primary clash occurs between N and Cδ+1 with a secondary clash between CβH and Cδ+1H (Figure 6). To alleviate this clash, the Pro ring bends away from the prePro residue through increases in NCαC (2.0°), CβCαC (2.4°), and CαCN+1 (3.3°), enabled by decreases in CαCO (2.3°), OCN+1 (2.6°), and CN+1Cα+1 (3.8°). In comparison to calculations by Hurley et al. (1992) that suggested a single, very large deviation of 8.5° in CβCαC, here we observe that the distortions have diffused across all of the angles between the sterically hindered atoms. These distortions increase the N…Cδ+1 distance from 2.65 Å to 2.85 Å and the CβH…Cδ+1H distance from 1.86 Å to 1.90 Å to reduce the van der Waals overlap. CN+1Cδ+1 was not included in the database, but we expect it also opens to further alleviate the collision.
With the knowledge of these systematic trends comes the possibility of leveraging them to improve the accuracy of crystallographic refinement and homology modeling. To provide a convenient form in which the documented systematic variations can be used in modeling applications, we created a binned conformation-dependent library (CDL) for distribution. Similar to the approach taken by Karplus (1996), we divided Φ,Ψ space into 1296 10° × 10° bins and calculated the averages and standard deviations for each bin for each of the five residue-type categories (Gly, Pro, prePro, Ile/Val, General). This first-generation CDL (v1.0), available from the authors or at http://proteingeometry.sourceforge.net/, uses a simple precalculated lookup table that accepts conformations and returns the appropriate target value for each bond angle and length. When considering crystallographic refinement and homology modeling, it is important to note that using more accurate CDL values in place of EH values does not increase the number of variable parameters used in the modeling.
A variety of simple control calculations can be carried out to show that the CDL is an improvement over the single-value paradigm (EH values) and even context-dependent values derived from molecular mechanics (MM) force fields. Because an MM force field allows bond angles and lengths to vary with conformation, it could in theory provide better conformation-dependent values than the empirical approach.
As one simple assessment, we compared how well the NCαC values in a 1.15 Å ribonuclease structure (PDB code 1rge; Sevcik et al., 1996) matched with EH values, the CDL, and bond-angle values from the structure after minimization using a molecular mechanics force field (see Experimental Procedures). As seen in Figure 7, the conformation-dependent library outperforms both the single ideal value and molecular mechanics. Importantly, the CDL produces more angles with very close (<1°) agreement with the reference crystal structure as well as fewer extremely large deviations. In terms of modeling accuracy, there appears to be no downside to using the CDL.
For a more thorough comparison of the CDL with EH values, we compared how well each matched the NCαC values for the set of protein structures used to generate the CDL, with each protein jackknifed out during its comparison. Averaged over the whole data set, the median deviation from the native bond angles for the EH single-value paradigm was 1.5°/residue and the median deviation for the CDL dropped to 1.1°/residue. This amounts to an improvement of ~25% in NCαC accuracy relative to the old paradigm.
To understand the impact this difference could have upon protein modeling, coordinates for each jackknifed structure were rebuilt from torsion and bond angles using EH or CDL values. Holmes and Tsai (2004) have shown that the replacement of experimental bond angles with ideal ones while holding Φ and Ψ fixed shifts coordinates by an average of 6 Å (unnormalized by protein length), and this limits model-building accuracy. Here, applying the same approach, we find that the median Cα RMSD100 (normalized to the length of a 100-residue protein) from the native structure for EH values was 3.23 Å, and for CDL values it was 2.85 Å. The CDL produced a significant improvement in the Cα RMSD100 of ~0.4 Å over the old single-value paradigm.
To assess the potential impact of accounting for Φ,Ψ-dependent variations upon X-ray crystal structures at various resolutions, we evaluated how much the experimental NCαC values deviated from those in the CDL as a function of resolution (Figure 8). To avoid bias, none of the structures used in the survey were used in the generation of the CDL. Analysis of the data shows that for structures solved at near 1 Å resolution, the RMSD of NCαC from the CDL is ~1.6°. This matches well with the standard deviation seen in the CDL for this angle and serves as an effective validation of the CDL. Additionally, the small standard deviation of the RMSDs at this resolution shows that each of the individual high-resolution structures is well-described by the CDL. Already at a resolution of 1.5 Å, normally considered very high resolution, the match of NCαC values to the CDL is nearly twice as poor as for the 1.0 Å resolution structures. This loss of accuracy became steadily more pronounced in lower-resolution structures, rising to nearly 4° at 3.0 Å resolution. We conclude that by using the CDL, high-, medium-, and low-resolution structures could all be improved. We suspect that at resolutions worse than 3 Å, the CDL would have less impact because dihedral angles would be less reliable.
To understand the potential benefit of accounting for Φ,Ψ-dependent geometry variations in predictive modeling of protein structure, we carried out a test using the Rosetta modeling program (Rohl et al., 2004). A standard control calculation for homology modeling is to ask how far a crystal structure moves from the experimental structure when minimized by the force field. This provides a lower limit on how accurately a structure can be predicted (e.g., Bradley et al., 2005). For our test, we performed a series of 100 Monte Carlo energy minimizations starting with different random seeds using both native and “ideal” bond geometries for two ultrahigh-resolution protein structures: ribonuclease chain A at 1.15 Å resolution (PDB code 1rge; Sevcik et al., 1996; Figure 9) and the PDZ domain of syntenin at 0.73 Å (PDB code 1r6j; Kang et al., 2004; data not shown). “Native” geometry refers to the bond lengths and angles as seen in the crystal structure. As seen in Figure 9A, minimizations using the “native” bond geometry instead of the idealized geometry resulted in better convergence (tighter grouping) and allowed the minimized structure to be about 30% closer to the true structure (~0.6 Å vs ~0.9 Å). One notable feature is that the improved behavior occurs despite the force field’s optimization based on the traditional “ideal” geometry values. We conclude from this that the use of the rigid-geometry approximation with standard single ideal values limits modeling accuracy substantially. Thus, it is worthwhile to adapt modeling programs to account for the new conformation-dependent geometry paradigm.
To pinpoint exactly where in the structure the improvements occurred, we calculated the deviations between the crystal structure and the energy-minimized structures using native versus ideal geometry (Figure 9B). As an indication of the variation that can occur for this protein in two environments, the deviations with chain B from the same structure are also shown. The largest differences between EH and experimental geometry occur in loops rather than regular secondary structure (Figure 9B). This meets the expectation that the largest systematic deviations from single ideal values should occur in parts of the protein with less observed, more diverse Φ,Ψ values. This result was expected because the most highly populated regions dominate the global averages, resulting in the illusion of single ideal values assumed in EH, whereas more conformationally diverse, less populated regions contribute less to the global average. Importantly, the two loops that were highly improved by using experimental geometry are at the active site of the protein, so the accuracy with which they are modeled would significantly degrade the ability of this mock homology model to provide insight.
The studies here show that the dependence of backbone geometry on conformation is unmistakably real, significant, and systematic, and it has a rational structural basis. These systematic distortions in covalent geometry additionally explain how some conformations are accessible to amino-acid residues despite being theoretically disallowed by modeling based on single ideal values for backbone geometry. Extending these studies to the conformation dependence of the ω and χ1 torsion angles will be described elsewhere. The conformation-dependent library we derived from the database represents the first step toward implementing the new paradigm of “ideal-geometry functions.” With its much-improved agreement to ultrahigh-resolution crystal structures, the ideal-geometry functions provide an intellectually satisfying resolution to the debate among crystallographers as to what ideal values should be used during refinement. Also, because the ideal-geometry functions captured in the CDL are simply a highly enlarged set of immutable ideal values, their use in the place of single ideal values represents no increase in algorithmic complexity. Use of the CDL thus offers the potential for improved modeling accuracy in a wide variety of experimentally based and predictive modeling applications without increasing the risk of overfitting.
A Protein Geometry Database being developed in our laboratory (Berkholz et al., submitted) was used to generate our data set of atomic-resolution geometry information. To optimally analyze Φ,Ψ-dependent geometry trends, the data set must be large but also have independent and accurate information about geometry. The plethora of new atomic-resolution protein structures allowed us to use stringent criteria for independence and accuracy, yet still have sufficient observations for reasonable statistics. To ensure independence, we used the PDBSelect (Hobohm and Sander, 1994) list from March 2006 to choose protein chains with less than 90% sequence identity to any other chain in the data set. To ensure high accuracy, we only used structures determined at 1.0 Å or better. At this resolution, we estimate Φ and Ψ dihedral angle accuracy to be better than 3° (see next paragraph). Also, as in Karplus (1996), to ensure that individual residues used were well-resolved, we required that all residues in a five-residue segment were all well-ordered, having B-factors <25 Å2 for the mainchain average, the sidechain average, and Cγ, and alternative conformations were discarded.
To estimate the experimental uncertainty in Φ and Ψ for 1 Å resolution structures, we chose to use a straightforward, empirical approach—randomize and re-refine a test structure multiple times and then examine the spread of the dihedral angles among the structures. Specifically, we applied 10 coordinate randomizations with a mean shift of 0.2 Å using phenix.pdbtools (Adams et al., 2002) to the coordinates of glutathione reductase at 0.95 Å resolution (PDB ID: 3dk9; Berkholz et al., 2008) and re-refined each in SHELXL (Sheldrick, 2008). Dihedral RMSDs for the vast majority of residues were between 1°–2°. The 90th percentile of the per-residue RMSDs in both Φ and Ψ was 2.2°, and the RMSD values of the per-residue RMSDs for Φ and Ψ were 1.7° and 2.4°, respectively.
The data value of any structural parameter a of residue i (or of the left or right neighbor of residue i) may be expressed:
where m is a regression function, and ε are random Gaussian-distributed errors with mean 0 and σ=1:
In these expressions, E is the expectation value of a and Var is the variance of a.
To obtain an estimate of m and ν, we use a zeroth-order or Nadaraya-Watson kernel regression (Nadaraya, 1964) by summing over N data points:
The latter is Var(a|,ψ), an estimate of the heteroscedastic data variance as a function of and ψ.
The functions K are kernels that weight the data points based on how far away they are from the query, ,ψ value. Since and ψ are angles, we use the product of two von Mises kernel functions (Mardia and Zamroch, 1975)
At large values of κ, these functions behave very similarly to Gaussian distributions, except that they are periodic. We investigated several values of κ and plotted the resulting regressions as a function of and ψ. We empirically chose a value of κ=50 to produce distributions that varied smoothly with and ψ in a reasonable way.
The ,ψ map is not uniformly populated by data points, each of them representing a single residue backbone conformation. Therefore, for the unpopulated regions of the map, the kernel regression analysis generates non-local estimates of m and ν. A query point (,ψ) in which we estimate expectation and variance values of a, can be surrounded by an effective radius r, equal to half of a bandwidth, b of the kernel function, K. We can count the effective number of data points, Neff within the radius, r around any query point. These points will have an impact on the estimated local values of m and ν.
We define the bandwidth, b(κ) as a diameter of the circle centered on the query point (0,ψ0) within which the von Mises kernel function integrates to 68.2% (the value of integral of the normal distribution PDF within one standard deviation from its center):
The bandwidth of the von Mises kernel at κ=50 is approximately 16°.
In order to depict the trends of (,ψ) and (,ψ), we only plot their estimates at ,ψ grid points where Neff(,ψ)≥ 3 within a circle with a diameter equal to the bandwidth b(κ=50) = 16°.
In the sparsely populated areas of the ,ψ map the threshold of at least 3 data points within the effective bandwidth may lead to estimates with high standard errors of mean (SEM). We calculated an estimate of SEM, as
It is very important to analyze the trends of m and ν as a function of ,ψ together with SEM (a|,ψ). The values of SEM will indicate the significance of the trend in the more sparsely populated areas.
To create a binned conformation-dependent library (CDL) for each residue class, averages and standard deviations were calculated in 10° × 10° bins in Φ,Ψ. . The results were stored in a set of files, one per residue class. Python scripts provide an interface to the CDL, allowing easy retrieval of the conformation-dependent values when given a residue name and conformation. Additional tools building upon this simple interface are also part of the distributed code, including a tool that will compare the bond angles and lengths in any PDB coordinate file with CDL values, EH values, or another PDB coordinate file. The CDL and accessory tools are available under an open-source license from http://proteingeometry.sourceforge.net/.
Molecular mechanics-derived context-dependent values for bond angles for two test cases (PDB codes 1rge (Sevcik et al., 1996) and 1r6j (Kang et al., 2004)) were generated using the following protocol: the structures were minimized in CHARMM (Brooks et al., 1983) using the parm_all22_prot force field with the CMAP correction (MacKerell, 2004) using the GBMV implicit solvent model (Lee et al., 2003). The protocol used cycles of 100 steps of steepest-descent minimization with heavy-atom restraints of 5, 3, 1 and 0 * atomic mass kcal/mol/Å2. Following the last cycle (which had no restraints), 1000 steps of adopted basis Newton-Raphson minimization were performed, and the typical gradient RMS was about 0.05 kcal/mol/Å.
Ideal peptides with EH or CDL backbone geometry were built using PyRosetta (http://graylab.jhu.edu/~sid/pyrosetta/), Python bindings to Rosetta (Rohl et al., 2004). To account for the length dependence of RMSD calculations (e.g., Holmes and Tsai, 2004), we linearly rescaled all RMSDs to those of 100-residue proteins using the EH RMSDs and the assumption that RMSDs intersect the origin. Based on the linear fit of EH RMSDs versus length produced, we calculated a scaling factor of (0.0332519/100)/(0.0332519/length). To understand the structural basis of variations between these theoretical peptides, van der Waals clashes were visually analyzed using the Coot (Emsley and Cowtan, 2004) interface to MolProbity (Davis et al., 2007).
Nonredundant structures with a 25% sequence-identity threshold were taken from PDBSelect (Hobohm and Sander, 1994). Among these, 50 structures were selected from each of five resolution ranges: 1.0–1.1 Å, 1.5–1.6 Å, 2.0–2.1 Å, 2.5–2.6 Å, 3.0–3.1 Å. For each residue in these structures, we then calculated the difference in the observed NCαC and the CDL value. These were used to calculate the per-structure RMSDs, which were then used to calculate averages, standard deviations, and standard errors of the mean for each of the five resolution shells.
We thank Charles L. Brooks III (University of Michigan) for performing the molecular-mechanics minimizations used in this study. We additionally thank the David Baker lab (University of Washington at Seattle), in particular Srivatsan Raman, James Thompson, and Elizabeth Kellogg, for their help with Rosetta. We thank Jeffrey Gray (Johns Hopkins University) for providing PyRosetta, the Python bindings to Rosetta. We thank Lothar Schäfer (University of Arkansas) for providing a database of QM-calculated dipeptides and an extrapolation program to obtain values for conformation-dependent bond angles and lengths. This work was supported in part by NIH grant R01-GM083136 (to PAK), NSF grant MCB-9982727 (to PAK), and NIH grant P20-GM76222 (to RLD).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.