Data Source and Analysis Strategy
To accurately characterize the nature and extent of conformation-dependent variations in geometry, we used a data set of 16,682 well-defined three-residue segments from 108 diverse protein chains determined at 1.0 Å resolution or better (see Experimental Procedures). A three-residue segment includes all of the atoms in two complete peptide units, and the data set included the bond lengths and bond angles for the peptide units uniquely identified by whether they mostly involve atoms from residue −1, 0, or +1 in the three-residue segment (). Based on previous work (Karplus, 1996
) indicating distinct geometric behavior of Gly, Pro, the β-branched residues Ile and Val (Thr behaves more like a general residue because of stabilizing sidechain-backbone hydrogen bonds) and residues preceding proline (prePro), we carried out separate statistical analyses for those five groups. The data set used here included 1,379 Gly, 639 Pro, 511 general prePro (644 before exclusion of Gly/Pro/Ile/Val), 1,822 Ile/Val, and 10,921 general residues (the 16 other residue types taken together). All prePro residues are excluded from the other classes. As seen in , these residues were distributed in Φ,Ψ as has been seen for many well-filtered data sets (Karplus, 1996
; Kleywegt and Jones, 1996
, Lovell et al., 2003
). also provides the shorthand nomenclature we will use for certain regions of the Ramachandran plot.
We analyzed these results to visualize and to document the Φ,Ψ-dependent variations in bond lengths and angles. Our approach was to use kernel-regression methods to smooth the data and to produce continuously variable functions for each parameter (see Experimental Procedures). The figures and tables in this paper are based on the kernel-regression analysis and only include regions of the Ramachandran plot having an observation density of at least 0.03 residues/degree2 (i.e., 3 residues in a 10° × 10° area) and a finite standard error of the mean.
Ubiquitous, Systematic, Φ,Ψ-Dependent Variations Exist in Peptide Geometry Bond angles
The data reveal that for general residues, all 15 bond angles in the two peptides adjacent to the central residue vary systematically with Φ and Ψ ( and ). The most prominent observation is that the variations do not occur only in rare outlier conformations, but they occur throughout even the most populated areas of the plot for all residue types (, S1–S4
). Consistent with the lower-resolution analysis (Karplus, 1996
C varies the most (6.5°), but four other angles also vary by ≥5°. An important difference from the results of the earlier study is that the conformation-dependent standard deviations of the bond angles are about half what was seen previously (Karplus, 1996
), ranging from 1.3°–1.8° (). These are also substantially smaller than the standard deviations of ~2.5° used for the single ideal values defined by Engh and Huber (1991)
based on small-molecule structures. It is notable that ultrahigh-resolution crystal structures are generally refined using geometric restraints that do not match the local averages, so the narrow (small σ) distributions cannot be an artifact of the restraints used. Interestingly, the variations in the averages are 2–4 times the standard deviations (), implying that current modeling restraints would work to wrongly pull angles away from their actual optimal values in many regions. Dramatically, the distributions at the extremes can even be completely non-overlapping because of the small standard deviations (). The standard errors of the Φ,Ψ-dependent means (i.e., σ/√N) for bond angles are less than 0.5° in nearly all regions and typically less than 0.2° in the highly populated regions (Figures S5–S9
)—however, errors should be considered when examining averages for the lowest-populated edges and other regions, such as the prePro region for general residues. In comparison, the 2°–7° ranges seen for the expected values are 10–50 times greater than their uncertainties. This shows that the variations are well-determined and backbone geometry in no way obeys the single ideal value paradigm.
Figure 3 Conformation-dependent variation in bond angles of general residues as a function of the Φ,Ψ of the central residue. A Ramachandran plot is shown for each backbone bond angle in the two peptide units surrounding the central residue of (more ...)
Expected and observed ranges for peptide geometriesa
Figure 4 NCαC distributions are well-defined and distinct. Shown are observations from selected 10° × 10° bins in each of four conformations: α (gray), β (green), PII (blue), and a region left of δ at (−125°, (more ...) Bond lengths
In the 1996 study, the resolution of the data did not allow reliable visualization of bond-length variations. Here at atomic resolution, systematic Φ,Ψ-dependent trends are now visible in bond lengths () but the variation ranges (0.01 Å–0.02 Å) are only on par with the standard deviations (0.012 Å–0.016 Å), meaning the distributions are highly overlapping. The standard errors of the mean are smaller (~0.002 Å), so the variations in the means seen are nevertheless significant (~10-fold larger). Given that the standard deviations are on par with the expected coordinate accuracy, we hypothesize that the true underlying bond lengths are distributed more narrowly and thus will require still higher resolution analyses to determine accurately. Because of this limitation and the expectation that, because of the very small distances involved, the bond-length variations will have little impact on modeling accuracy, we will not further describe the bond-length trends here. Nevertheless, we suspect the variations involved will be chemically informative (e.g., Esposito et al., 2000
Figure 5 Conformation-dependent variation in bond lengths is partially masked by experimental uncertainty. Ramachandran plots are shown for average lengths and standard deviations of the C=O bond (left panels) and the C-N bond (right panels) using colors as in (more ...)
Variations are Correlated with Local Interactions
Comparison of conformation-dependent trends across the two sequential peptide units reveals that the trends are largely locally influenced. For each of the seven angles associated with the central residue, the range is larger than the range for the same angle associated with the previous or subsequent residue (). For instance, N−1Cα−1C−1 and N+1Cα+1C+1 have ranges of 5.5° and 3.0°, whereas NCαC has a range of 6.5°. This implies that the angles in associated with residues −1 and +1 show highly local effects, being more influenced by the Φ,Ψ values of their respective residues than the Φ,Ψ values of residue 0 (the central residue). For modeling purposes, it makes sense to assign the “ideal” target values for all seven of these angles based on Φ,Ψ of the central residue.
Furthermore, among these seven angles, additional evidence of the dominance of local effects is seen as each angle is influenced mostly by the single closest torsion angle, whether it is Φ or Ψ. Starting at the N-terminal end, C−1
is heavily Φ-dependent as is seen in the vertical pattern of variation, then the Cα
-centered angles are a mixture, displaying diagonal patterning, and the angles at the C-terminal end, such as Cα
, have Ψ-dependent horizontal patterning. Even among the Cα
-centered angles, NCα
shows enhanced dependence on Φ and Cβ
C shows enhanced dependence on Ψ. This extreme locality agrees with much prior work noting that local steric interactions are critical factors in determining observed conformational and secondary-structure preferences (e.g., Dunbrack and Karplus, 1994
; Baldwin and Rose, 1999
Comparison of Trends with Quantum Mechanics
As noted in the introduction, quantum-mechanical (QM) calculations of isolated alanine peptides (Jiang et al., 1997
; Yu et al., 2001
) also produce conformation-dependent trends in bond angles and bond lengths. The QM calculations are computationally intensive and they have only been carried out at 30° resolution in Φ,Ψ (Jiang et al., 1997
; Yu et al., 2001
), making detailed features of the trends unavailable. Beyond a certain level, the method and basis set used in QM calculations is unimportant to this analysis because they produce trends on the same scale with a nearly constant offset (Yu et al., 2001
). As was reported by Karplus (1996)
, the QM results have similar trends, but now it is apparent that QM results show larger deviations, ranging farther both positively and negatively than experimental protein structures. For example, the empirical deviations from the central value for NCα
C are roughly 70% of the calculated deviations. Additionally, QM calculations show serious discrepancies in some less populated regions, such as a false global maximum for O−1
N in Lδ ( and ). The mis-scaling seen in QM-calculated angles has been suggested by others to be caused by a lack of long-distance structural effects (Jiang et al., 1997
; Yu et al., 2001
; Feig, 2008
). However, if that were the case, comparison of residues in secondary structure versus those in loops should show this same difference, but Karplus (1996)
did not see a difference, and here we confirm that observation (Figures S10–S11
). One potential underlying cause is the difference between a protein environment and vacuum rather than a long-distance effect caused by repeating secondary structure, but the reason that calculations in small peptides fail to predict the correct details of conformation-dependent geometry for proteins is uncertain.
Local Variations Make Structural Sense
The bond-angle trends for five classes of residues for all Φ,Ψ possibilities comprise a massive amount of information that cannot be exhaustively described in the context of this article. A survey of the results, however, reveals a general principle that the observed trends in geometry make structural sense in terms of accommodating local steric and electrostatic interactions, extending the rationale for observed conformations proposed by Ho et al. (2003)
. In Karplus (1996)
, the behavior of NCα
C in the well-populated α, β, and δ regions () was rationalized in these terms, including the proposal of a π-peptide interaction in the δ region optimized by the opening of NCα
C (see Figure 8 of Karplus, 1996
). Instead of rehashing those observations, here we present four illustrative examples of Φ,Ψ regions with significant distortions. The conformations are shown in , the relevant bond-angle values can be seen in , and the specific collisions being ameliorated are illustrated in .
Figure 6 Structural basis for geometry variations of selected conformations. Four Ala residues with adjacent peptides are shown, built using EH values and with dots showing van der Waals overlap between atoms: blue (wide contact), green (close contact), yellow (more ...)
In the Lα/Lδ region, non-Gly residues are disfavored because when using single ideal values for bond angles and lengths, there is a close-contact collision between O−1 and CβH. As Φ increases, this collision becomes worse. The conformation-dependent trends show that these conformations become accessible by a systematic increase in O−1C−1N, C−1NCα, and NCαCβ that opens the ring between O−1 and Cβ. At the extreme tip of the region near (+90°, 0°), these angles open compared to the EH values () by 0.4°, 4.3°, and 2.8°, respectively, to increase the O−1…Cβ distance from 2.59 Å to 2.85 Å. Although this change in distance is small, as are others described in this section, they can make large energetic differences by transforming unfavorable atomic clashes0 to close contacts.
The II′ region is adopted by the i+1 residue of type II′ turns, a tight turn with a hydrogen bond between O−1 and N+2H. In this conformation, Cβ is quite close to both O−1 and N+1, which results in this region being unfavorab le for nonglycine residues. Under the rigid-geometry approximation, the entire region should be disallowed because of this clash, but distortions in covalent geometry allow it to be accessible. The variations seen in show that the distortions relative to EH values () include a large opening in CβCαC (5.9°) as well as opening of CαCN+1 (3.3°) to reduce the Cβ…N+1 clash. This also reduces the O−1…Cβ clash, where the CβCαC distortion acts like opening jaws to move Cβ away from O−1. The extreme bond openings are enabled by a closing of NCαC (2.5°), CαCO (1.8°), and OCN+1 (2.0°). The Cβ…N+1 distance increases from 2.65 Å to 2.71 Å, and the O−1…Cβ distance increases from 3.06 Å to 3.09 Å.
Left of the δ region is a Ramachandran-allowed but sparsely populated region. The primary clash is between HN and HN+1. This clash is prevented through a combination of distortions relative to EH values: the dominant increases are in NCαC (3.5°) and CαCN+1 (2.8°) that both exhibit their extreme values (), coupled with a decrease in CαCO (2.0°). The combined effect is to open and twist a nearly planar ring between NH and N+1H to prevent a van der Waals overlap by increasing the HN…HN+1 distance from 1.78 Å to 1.92 Å and the N…N+1 distance from 2.66 Å to 2.76 Å.
As a final example, we illustrate the importance of treating prePro as a special residue type. Preproline residues are classically disallowed in the α region, yet they are experimentally observed with low populations (Hurley et al., 1992
). The primary clash occurs between N and Cδ+1
with a secondary clash between Cβ
H and Cδ+1
H (). To alleviate this clash, the Pro ring bends away from the prePro residue through increases in NCα
C (2.0°), Cβ
C (2.4°), and Cα
(3.3°), enabled by decreases in Cα
CO (2.3°), OCN+1
(2.6°), and CN+1
(3.8°). In comparison to calculations by Hurley et al. (1992)
that suggested a single, very large deviation of 8.5° in Cβ
C, here we observe that the distortions have diffused across all of the angles between the sterically hindered atoms. These distortions increase the N…Cδ+1
distance from 2.65 Å to 2.85 Å and the Cβ
H distance from 1.86 Å to 1.90 Å to reduce the van der Waals overlap. CN+1
was not included in the database, but we expect it also opens to further alleviate the collision.
A 10°-Resolution Conformation-Dependent Library
With the knowledge of these systematic trends comes the possibility of leveraging them to improve the accuracy of crystallographic refinement and homology modeling. To provide a convenient form in which the documented systematic variations can be used in modeling applications, we created a binned conformation-dependent library (CDL) for distribution. Similar to the approach taken by Karplus (1996)
, we divided Φ,Ψ space into 1296 10° × 10° bins and calculated the averages and standard deviations for each bin for each of the five residue-type categories (Gly, Pro, prePro, Ile/Val, General). This first-generation CDL (v1.0), available from the authors or at http://proteingeometry.sourceforge.net/
, uses a simple precalculated lookup table that accepts conformations and returns the appropriate target value for each bond angle and length. When considering crystallographic refinement and homology modeling, it is important to note that using more accurate CDL values in place of EH values does not increase the number of variable parameters used in the modeling.
Conformation-Dependent Angles are More Accurate
A variety of simple control calculations can be carried out to show that the CDL is an improvement over the single-value paradigm (EH values) and even context-dependent values derived from molecular mechanics (MM) force fields. Because an MM force field allows bond angles and lengths to vary with conformation, it could in theory provide better conformation-dependent values than the empirical approach.
As one simple assessment, we compared how well the NCα
C values in a 1.15 Å ribonuclease structure (PDB code 1rge; Sevcik et al., 1996
) matched with EH values, the CDL, and bond-angle values from the structure after minimization using a molecular mechanics force field (see Experimental Procedures). As seen in , the conformation-dependent library outperforms both the single ideal value and molecular mechanics. Importantly, the CDL produces more angles with very close (<1°) agreement with the reference crystal structure as well as fewer extremely large deviations. In terms of modeling accuracy, there appears to be no downside to using the CDL.
Figure 7 CDL NCαC values match ultrahigh-resolution structures best. Deviations of predicted angles from the experimental ones for atomic-resolution ribonuclease (PDB code 1rge; Sevcik et al., 1996) with various methods are shown: EH single ideal values (more ...)
For a more thorough comparison of the CDL with EH values, we compared how well each matched the NCαC values for the set of protein structures used to generate the CDL, with each protein jackknifed out during its comparison. Averaged over the whole data set, the median deviation from the native bond angles for the EH single-value paradigm was 1.5°/residue and the median deviation for the CDL dropped to 1.1°/residue. This amounts to an improvement of ~25% in NCαC accuracy relative to the old paradigm.
To understand the impact this difference could have upon protein modeling, coordinates for each jackknifed structure were rebuilt from torsion and bond angles using EH or CDL values. Holmes and Tsai (2004)
have shown that the replacement of experimental bond angles with ideal ones while holding Φ and Ψ fixed shifts coordinates by an average of 6 Å (unnormalized by protein length), and this limits model-building accuracy. Here, applying the same approach, we find that the median Cα
(normalized to the length of a 100-residue protein) from the native structure for EH values was 3.23 Å, and for CDL values it was 2.85 Å. The CDL produced a significant improvement in the Cα
of ~0.4 Å over the old single-value paradigm.
Potential Applications: Crystallographic Refinement and Homology Modeling
To assess the potential impact of accounting for Φ,Ψ-dependent variations upon X-ray crystal structures at various resolutions, we evaluated how much the experimental NCαC values deviated from those in the CDL as a function of resolution (). To avoid bias, none of the structures used in the survey were used in the generation of the CDL. Analysis of the data shows that for structures solved at near 1 Å resolution, the RMSD of NCαC from the CDL is ~1.6°. This matches well with the standard deviation seen in the CDL for this angle and serves as an effective validation of the CDL. Additionally, the small standard deviation of the RMSDs at this resolution shows that each of the individual high-resolution structures is well-described by the CDL. Already at a resolution of 1.5 Å, normally considered very high resolution, the match of NCαC values to the CDL is nearly twice as poor as for the 1.0 Å resolution structures. This loss of accuracy became steadily more pronounced in lower-resolution structures, rising to nearly 4° at 3.0 Å resolution. We conclude that by using the CDL, high-, medium-, and low-resolution structures could all be improved. We suspect that at resolutions worse than 3 Å, the CDL would have less impact because dihedral angles would be less reliable.
Figure 8 NCαC deviation of the CDL values from crystal structures as a function of resolution of the analysis. At each of five resolutions ranging from 1.0–3.0 Å, the NCαC RMSDs from the CDL were calculated for 50 nonredundant structures. (more ...)
To understand the potential benefit of accounting for Φ,Ψ-dependent geometry variations in predictive modeling of protein structure, we carried out a test using the Rosetta modeling program (Rohl et al., 2004
). A standard control calculation for homology modeling is to ask how far a crystal structure moves from the experimental structure when minimized by the force field. This provides a lower limit on how accurately a structure can be predicted (e.g., Bradley et al., 2005
). For our test, we performed a series of 100 Monte Carlo energy minimizations starting with different random seeds using both native and “ideal” bond geometries for two ultrahigh-resolution protein structures: ribonuclease chain A at 1.15 Å resolution (PDB code 1rge; Sevcik et al., 1996
; ) and the PDZ domain of syntenin at 0.73 Å (PDB code 1r6j; Kang et al., 2004
; data not shown). “Native” geometry refers to the bond lengths and angles as seen in the crystal structure. As seen in , minimizations using the “native” bond geometry instead of the idealized geometry resulted in better convergence (tighter grouping) and allowed the minimized structure to be about 30% closer to the true structure (~0.6 Å vs ~0.9 Å). One notable feature is that the improved behavior occurs despite the force field’s optimization based on the traditional “ideal” geometry values. We conclude from this that the use of the rigid-geometry approximation with standard single ideal values limits modeling accuracy substantially. Thus, it is worthwhile to adapt modeling programs to account for the new conformation-dependent geometry paradigm.
Figure 9 Energy minimization behaves better using experimental geometry as opposed to the rigid-geometry approximation. (A) Shown are 100 trials minimized with experimental (squares) and with EH (triangles) geometries. They are plotted as Rosetta energy versus (more ...)
To pinpoint exactly where in the structure the improvements occurred, we calculated the deviations between the crystal structure and the energy-minimized structures using native versus ideal geometry (). As an indication of the variation that can occur for this protein in two environments, the deviations with chain B from the same structure are also shown. The largest differences between EH and experimental geometry occur in loops rather than regular secondary structure (). This meets the expectation that the largest systematic deviations from single ideal values should occur in parts of the protein with less observed, more diverse Φ,Ψ values. This result was expected because the most highly populated regions dominate the global averages, resulting in the illusion of single ideal values assumed in EH, whereas more conformationally diverse, less populated regions contribute less to the global average. Importantly, the two loops that were highly improved by using experimental geometry are at the active site of the protein, so the accuracy with which they are modeled would significantly degrade the ability of this mock homology model to provide insight.
The studies here show that the dependence of backbone geometry on conformation is unmistakably real, significant, and systematic, and it has a rational structural basis. These systematic distortions in covalent geometry additionally explain how some conformations are accessible to amino-acid residues despite being theoretically disallowed by modeling based on single ideal values for backbone geometry. Extending these studies to the conformation dependence of the ω and χ1 torsion angles will be described elsewhere. The conformation-dependent library we derived from the database represents the first step toward implementing the new paradigm of “ideal-geometry functions.” With its much-improved agreement to ultrahigh-resolution crystal structures, the ideal-geometry functions provide an intellectually satisfying resolution to the debate among crystallographers as to what ideal values should be used during refinement. Also, because the ideal-geometry functions captured in the CDL are simply a highly enlarged set of immutable ideal values, their use in the place of single ideal values represents no increase in algorithmic complexity. Use of the CDL thus offers the potential for improved modeling accuracy in a wide variety of experimentally based and predictive modeling applications without increasing the risk of overfitting.