|Home | About | Journals | Submit | Contact Us | Français|
Conventional protein structure determination from nuclear magnetic resonance data relies heavily on side-chain proton-proton distances. The necessary side-chain resonance assignment, however, is labor intensive and prone to error. Here we show that structures can be accurately determined without NMR information on the sidechains for proteins up to 25 kDa by incorporating backbone chemical shifts, residual dipolar couplings, and amide proton distances into the Rosetta protein structure modelling methodology. These data, which are too sparse for conventional methods, serve only to guide conformational search towards the lowest energy conformations in the folding landscape; the details of the computed models are determined by the physical chemistry implicit in the Rosetta all atom energy function. The new method is not hindered by the deuteration required to suppress nuclear relaxation processes for proteins greater than 15 kDa, and should enable routine NMR structure determination for larger proteins.
The first step in protein structure determination by NMR is chemical shift assignment for the backbone atoms. In contrast to the subsequent assignment of the sidechains, this is now rapid, reliable, and largely automated (1–5). Global backbone structural information complementing the local structure information provided by backbone chemical shift assignments (6, 7), can be obtained from HN-HN NOESY, residual dipolar coupling (RDC)(8), and other (9, 10) experiments. For larger proteins, deuteration becomes necessary to circumvent the efficient spin relaxation properties resulting from their higher rotational correlation times(11, 12), but removing protons also eliminates long range NOESY information from sidechains except for selectively protonated sidechain moieties (13). The difficulty in determining accurate structures with no or limited sidechain information is a major bottleneck that currently prevents routine application of NMR to larger (> 15 kDa) systems(14).
Here we show that structures of proteins up to 200 residues (23 kDa) can be determined using information from backbone (HN, N, Cα, Cβ, C') NMR data by taking advantage of the conformational sampling and all atom energy function in the Rosetta structure prediction methodology, which for small proteins in favorable cases can produce atomic accuracy models starting from sequence information alone(15). Structure prediction in Rosetta proceeds in two steps; first a low resolution exploration phase using Monte-Carlo fragment assembly and a coarse-grained energy function, and second a computationally expensive refinement phase which cycles between combinatorial sidechain optimization and gradient-based minimization of all torsional degrees of freedom in a physically-realistic all-atom forcefield(15). The primary obstacle to Rosetta structure prediction from amino acid sequence information alone is conformational sampling; native structures almost always have lower energies than non-native conformations, but are very seldom sampled in unbiased trajectories. Incorporating NMR chemical shift information in the selection of the fragments used in the exploration phase (CS-Rosetta) (16, 17) provides a robust approach to determining accurate structures of small (< 100 residue) proteins using only backbone and 13Cβ chemical shift data. For larger (> 12 kDa) proteins, the performance of CS-Rosetta is very target dependent: structures sufficiently close to the native structure for the energy to drop significantly may be generated rarely or not at all.
We investigated whether RDC data, which provide long range information on the orientations between bond vectors, can guide the low resolution search closer to the native structure and overcome the sampling problem for larger (100 – 200 residue) proteins. For every attempted Monte Carlo move, the alignment tensor is calculated by singular value decomposition(18), and the decision to accept or reject the conformation is biased by the change in the agreement between the back-calculated and experimental couplings(19). Incorporation of RDCs dramatically improved convergence on the correct structure in a benchmark of 11 alpha, beta and alpha/beta proteins ranging in size from 62 to 166 residues (Figs 1, S1 and Table 1). As indicated in Table 1, CSRDC-Rosetta consistently generates accurate models for proteins up to 120 residues, and in favorable cases for larger proteins.
For proteins with over 120 residues, conformational sampling becomes limiting even for the CS-RDC-Rosetta protocol and the low energy ensemble is not always close to the native structure. To further focus sampling, we developed an iterative refinement protocol that incorporates assigned backbone HN-HN NOEs in addition to backbone RDCs. As in the previously described “rebuild and refine” protocol, a pool of diverse low energy conformations is maintained and the highest energy structures in the pool are periodically replaced with offspring(20). The new protocol, a genetic algorithm, generates hybrid conformations by recombining first beta sheet pairings and subsequently fragments of the low energy structures (see methods). To further enhance sampling, trajectories are seeded with conformations harvested from previous trajectories that led to low energy conformations(21).
The improvement in the model population with increasing generations in the iterative protocol is illustrated in Figure 2 for the 200-residue ALG13 protein using experimentally determined chemical shift, RDC, and assigned backbone amide HN-HN NOE data(22). The Cα-RMSD to the native structure and the energy improve from generation to generation, and after several rounds, discrimination towards lower RMSD structures is apparent (Figure 2a, cyan to yellow). After high resolution refinement (Figure 2a, orange to red), the lowest energy structures are close to the native structure. The final low energy structural ensemble (Figure 2b) recapitulates the unusual topology in the previously determined NMR structure(22) (Figure 2d) to within 3.4 Å RMSD (Table 1). The Rosetta ensemble fits independent RDC data as well as the NMR structure, and the backbone variation in the ensemble is correlated with backbone dynamics as probed by the R1 relaxation rate (Figure 2c). The iterative-CS-RDC-NOE-Rosetta models of ALG13 thus appear to be comparable in quality to the previously published structure that required substantial effort, including preparation of selectively methyl and aromatic-protonated samples(22).
The iterative-CS-RDC-NOE protocol was tested further on 12 proteins with sizes ranging from 120 to 266 residues (Table 1 and Fig S3). For all proteins but 1g68 a considerable part of the structure converges (Table 1). Backbone HN-HN NOE data was required for convergence of 2z2i, 1i1b, arf1, 2rn2 and 1sua but not for 5pnt, 1s0p, 1f21 and er553. The RMSDs to the native structures over the converged regions range from 1.7–4.3Å with the exceptions of 1sua and 1f21. For 1f21 high accuracy (1.6 Å) was reached for a 92 residue subset (Fig S3). Sidechain accuracy was generally quite high in the converged regions (Fig S5).
We carried out a blind test of the new methods on five data sets generated in the Northeast Structural Genomics (NESG) Center before conventional NMR structures were determined. For four of the proteins, the CS-RDC protocol converged (Figure 3a–d), while for a fifth, convergence was not observed and blind structure determination was instead carried out using the iterative CS-RDC-NOE protocol (Figure 3e). In all five cases (Table I) the resulting Rosetta determined structure is very similar to the conventionally determined NMR solution structure over both the backbone (Figure 3, left panels) and the core sidechains (Figure 3, right panels), which is notable because no experimental sidechain information is used in the Rosetta protocol; the details of core packing are determined by the Rosetta all atom energy function.
Thus, our methodology is able to generate accurate structures of proteins up to ~25 kDa from sparse NMR data without side-chain assignments. To be useful in practice, it is important that there be a means of assessing the reliability of the computed models. Cross-validation with independently collected data is an excellent way to do this, but truly independent data may not always be available, and if the available data are already sparse, it may not be possible to remove a subset for independent validation.
Our approach to structure validation is based on the interplay between the two contributing sources of structural information—the detailed physical chemistry implicit in the Rosetta all atom energy function, and the experimental NMR data. As illustrated in Figure 4a, the all atom energy landscape (black) is rugged with many local minima, making optimization difficult. The experimental bias based on backbone NMR data (red), although smoother, is degenerate and lacks resolution. Since the constrained minimization of a function will almost always result in higher function values than unconstrained minimization, NMR data constrained optimization in general should result in higher energy structures than bias-free optimization (arrow 1 in Figure 4a). This scenario may hold for traditional structure determination in which the search is almost completely driven by the experimental data. However, if the two sources of information are in concordance, the bias from the experimental data can have two favorable effects (Figure 4b)—first, optimization far from the native minimum is impeded, resulting in an upward shift of the energy of non-native structures (arrow 1), and second, optimization near the native minimum is improved as the data guide the search towards the global minimum (Figure 4a–b, arrow 2).
The scenario illustrated in Figure 4b is unlikely to occur if there is no sampling near the correct structure: the experimental data and the energy function will almost never independently favor the same incorrect structure. Hence, we propose the following three criteria for evaluating the reliability of a calculated structure (Table 1, column 6–8). First, the calculation should converge — the lowest energy conformations should be very similar to each other over a large fraction of the structure. For both the CS-RDC-Rosetta and the iterative protocol, whenever the calculation converged for more than 60% of the structure, the RMSD to native over this region was less than 4 Å (Table 1, column 6). Second, the converged structures should be clearly lower in energy than all significantly different (RMSD greater than 7 Å) structures; this was the case for nearly all of our test cases (Table 1, column 7). Third, the structures generated with experimental data should be at least as low in energy as those generated without experimental data; for none of the successful calculations does the energy increase significantly when the experimental data are included in the optimization (Table 1, column 8). For larger proteins (>120) the data in fact guide the trajectories to lower energy structures than obtained by unconstrained optimization (Figure 4d and Table 1 column 8)— as argued above, this is a strong indicator that the correct structure has been found.
For the twenty proteins in our test set, when all three criteria were satisfied the low energy ensemble resembles the independently determined structures. Importantly, the clear structure calculation failure — 1f21 — which converged to a wrong conformation with an RMSD of 9.4 to the native, fails the third criterion: the energy is higher rather than lower when the experimental data are included in the optimization (Figure 4c and Table 1, column 8). Since we had only one such failure, we simulated additional failures by deleting all-near native structures from the model populations and computed the three metrics described above for these `fake' minima, (Table S1, cf. Methods). For almost all the proteins, these constructed pathological cases again fail the third criterion: they have higher energies in the experimentally biased optimization.
For the proteins in our set in the ~30 kDa molecular weight range, the computed structures are not completely converged and have large disordered regions. This is clearly a sampling problem since the native structure has lower energy (Figure 4d, S3); even with the NMR data as a guide, Rosetta trajectories fail to sample very close to the native state. Increased convergence on the low energy native state can be achieved either by collecting and utilizing additional experimental data (1ilb_2 in Fig S3;) or by improved sampling. While at present the former is the more reliable solution, the latter will likely become increasingly competitive as the cost of computing decreases and conformational search algorithms improve.
We have shown that accurate structures can be computed for a wide range of proteins using backbone only NMR data. These results suggest a change in the traditional NOE-constraint-based approach to NMR structure determination (Suppl. Fig. S4). In the new approach, the bottlenecks of sidechain chemical shift assignment and NOESY assignment are eliminated, and instead, more backbone information is collected: RDCs in one or more media, and a small number of unambiguous HN-HN constraints are collected using 3D or 4D experiments, which restrict the number of β-strand registers. Advantages of the approach are that 1H,15N-based NOE and RDC data quality is relatively unaffected in slower tumbling larger proteins and that the analysis of resonance and NOESY peak assignments can be done in a largely automated fashion with fewer opportunities for error. The approach is compatible with deuteration necessary for proteins greater than 15 kDa, and for larger proteins can be extended to include methyl- NOEs on selectively protonated samples. The method should also enable a more complete structural characterization of transiently populated states(24) for which the available data are generally quite sparse.
Protein structures can be determined using the limited NMR information obtainable for larger proteins
We are thankful to the DoE INCITE award for providing access to the Blue Gene/P supercomputer at the Argonne Leadership Computing Facility and to Rosetta@home participants for their generous contributions of computing power. We thank Yang Shen and Ad Bax for fruitful discussions, Y. Janet Huang and Yeufeng Tang for their contribution during preliminary studies using sparse NOE constraints with CS-Rosetta, Sonal Bansal, Hsiau-wei Lee, and Yizhou Liu for collection of RDC data, Alexander Lemak for providing the CNS RDC refinement protocol, and the NESG consortium for access to other unpublished NMR data that has facilitated methods development. S.R., O.F.L., P.R., G.T.M. and D.B. designed research, S.R. designed and tested the CS-RDC-Rosetta protocol, O.F.L. designed and tested the iterative CS-RDC-NOE-Rosetta protocol, S.R., O.F.L. and D.B. designed and performed research for energy based structure validation, X.W and J.P analysed the ALG13 ensemble, J.A, G.L, T.R, A.E, M.K, T.S provided blind NMR datasets, S.R., O.F.L., P.R., G.T.M., and D.B. wrote the manuscript. This work was supported by the Human Frontiers of Science Program (to O.F.L.), by the National Institutes of Health grant GM76222 (to D.B), the HHMI, the National Institutes of General Medical Science Protein Structure Initiative program, grants U54 GM074958 (to G.T.M) and the Research Resource grant RR005351 (to J.P).