|Home | About | Journals | Submit | Contact Us | Français|
NMR chemical shifts provide important local structural information for proteins. Consistent structure generation from NMR chemical shift data has recently become feasible for proteins with sizes of up to 130 residues, and such structures are of a quality comparable to those obtained with the standard NMR protocol. This study investigates the influence of the completeness of chemical shift assignments on structures generated from chemical shifts. The Chemical-Shift-Rosetta (CS-Rosetta) protocol was used for de novo protein structure generation with various degrees of completeness of the chemical shift assignment, simulated by omission of entries in the experimental chemical shift data previously used for the initial demonstration of the CS-Rosetta approach. In addition, a new CS-Rosetta protocol is described that improves robustness of the method for proteins with missing or erroneous NMR chemical shift input data. This strategy, which uses traditional Rosetta for pre-filtering of the fragment selection process, is demonstrated for two paramagnetic proteins and also for two proteins with solid-state NMR chemical shift assignments.
Chemical shifts are key to protein NMR spectroscopy not only because they allow separate observation of each 1H, 13C, and 15N nucleus in the molecule, but also as they carry important information on the local conformation (Saito 1986; Spera and Bax 1991; Williamson and Asakura 1993; Williamson et al. 1995; Asakura et al. 1997; Ando et al. 1998; Cornilescu et al. 1999; Castellani et al. 2003; Neal et al. 2006), including secondary structure (Wishart et al. 1991), hydrogen bonding (Wagner et al. 1983; Shen and Bax 2007) and the position and orientation of aromatic rings (Haigh and Mallion 1979; Case 1995). Protein structural information derived from chemical shifts, such as the backbone Ø/ψ torsion angles predicted by the program TALOS (Cornilescu et al. 1999), is widely used in NMR structure determination, but almost invariably as a complement to conventional NOE distance restraints or to internuclear distance restraints obtained by solid-state NMR. Recently, several computational approaches have been developed to use the NMR chemical shifts alone as input for protein structure generation (Cavalli et al. 2007; Gong et al. 2007; Shen et al. 2008; Wishart et al. 2008). These approaches, represented by CHESHIRE (Cavalli et al. 2007), CS-Rosetta (Shen et al. 2008) and CS23D (Wishart et al. 2008), match the experimental chemical shifts of the backbone and 13Cβ atoms, which are commonly available at the early stage of the conventional NMR structure determination procedure, to a structural database to identify protein fragments with similar chemical shifts. Because the structural database of proteins for which actual NMR assignments are available remains relatively small, empirical relationships (Cornilescu et al. 1999; Neal et al. 2003; Kontaxis et al. 2005; Shen and Bax 2007) are commonly used to “assign” chemical shift values to nuclei in proteins of known structure. Selected protein fragments are then used as input for a fragment assembly procedure, which also aims to optimize empirical energy terms related to hydrogen bonding, hydrophobic packing, etc., to generate an all-atom protein structure. These approaches have been evaluated for over two dozen proteins with sizes of up to 15 kD and a wide variety of folds. For the vast majority, convergence is obtained, which then invariably yields all-atom protein models that compare well with experimental structures, with root-mean-square deviations (rmsd's) from the conventionally determined reference structure in the 0.7–2 Å range for the backbone atoms, and ~1.4–3 Å when considering all atoms. Structures generated by the CS-Rosetta procedure for nine structural genomics target proteins, prior to completion of the conventional NMR structure determination process (Shen et al. 2008), prove the procedure to be a viable alternative for small to medium-size proteins (Gryk and Hoch 2008).
To date, the chemical shift based structure determination methods have been evaluated for proteins with complete or nearly complete NMR chemical shift assignments. In practice, however, resonance assignments are often incomplete, and also may contain a small fraction of erroneous assignments. Often, a completeness of >80–90% of the backbone sequence-specific assignments makes it possible to obtain a sufficient number of side-chain resonance and NOE assignments for deriving a dense network of distance restraints, needed for the conventional NMR structure determination procedure. The present study investigates the impact of incomplete chemical shift assignments on the NMR chemical shift based CS-Rosetta protocol by using chemical shift assignments with various degrees of completeness or correctness, simulated by omission and/or modification of entries in the experimental chemical shift data. For cases where a substantial fraction of the chemical shifts is missing or in error and the standard fragment CS-Rosetta protocol is found to fail, a more robust hybrid fragment selection method is described which largely resolves this limitation.
In recent years, several viable routes to resonance assignment and structure determination of small globular proteins by solid-state NMR (ssNMR) have been demonstrated (Castellani et al. 2002; Igumenova et al. 2004; Siemer et al. 2005; Zech et al. 2005; Nadaud et al. 2007; Loquet et al. 2008; Manolikas et al. 2008), relying mostly on 13C-13C, 15N-13C, and/or indirectly measured 1H-1H distance restraints. Chemical shift assignments of ssNMR spectra typically are obtained by sophisticated two-and three-dimensional 13C-detected analogs of the widely used triple resonance J-connectivity experiments used in solution NMR. However, with few exceptions (Agarwal et al. 2006; Chevelkov et al. 2006), 1H resonance assignments are usually not determined when studying a protein structure by these methods. For a variety of technical reasons, spectral resolution obtained for small proteins by ssNMR is often lower than what can be obtained for such proteins in solution (Tycko 1996), resulting in increased signal overlap and a considerable fraction of missing resonance assignments. For cases where protein structures have been determined both by solution and by solid state NMR methods, results are generally quite similar (Manolikas et al. 2008), and chemical shifts observed in the solid state generally agree well with those seen in solution (Igumenova et al. 2004; Zech et al. 2005). On the other hand, exceptions are often seen for residues involved in intermolecular contacts, i.e., surface-exposed residues, reflecting the different protein sample conditions. It is therefore interesting to evaluate to what extent the CS-Rosetta approach is applicable to proteins whose chemical shifts have been determined by solid state NMR. Indeed, as we demonstrate for two small proteins, ubiquitin and GB3, CS-Rosetta yields good structural models when using solely the ssNMR chemical shift assignments as input.
A second challenging area, where often a considerable fraction of chemical shift assignments are missing, concerns paramagnetic metalloproteins. About 25% of all proteins in living systems contain metal ions (Andreini et al. 2004) and in many of these cases the metal is paramagnetic (Fe2+/3+, Cu2+, Co2+, Ni3+), where the presence of unpaired electrons causes very rapid transverse relaxation for nearby nuclei, interfering with use of the standard 1H-detected triple resonance assignment strategy (Ikura et al. 1990; Montelione and Wagner 1990). Although 13C-detected experiments can yield relief (Montelione and Wagner 1990; Bertini et al. 2005; Bermel et al. 2006), collection of 1H-1H NOE restraints remains problematic in the vicinity of paramagnetic centers. The degree of paramagnetic broadening scales with the inverse sixth power of the distance to the metal, resulting in a sphere with radius of ca 5–15 Å around the metal where assignments are missing. In addition, if protons are observed and assigned, they may contain paramagnetic pseudo-contact contributions to their chemical shifts, which are not easily accounted for in the absence of a known structure, and therefore can impact the molecular fragment search of the CS-Rosetta protocol in a similar manner as assignment errors. We will show, however, that the hybrid CS-Rosetta protocol is quite tolerant to these problems, and demonstrate its application to two small paramagnetic proteins of known structure.
In this work, the original complete experimental chemical shift assignments, including δ15N, δ13C’, δ13Cα, δ13Cβ, δ1Hα and δ1HN, for proteins MrR16 (90 residues; PDB code: 1YWX; 514 available chemical shifts from BMRB #6799) and TM1442 (110 residues; PDB code: 1SBO; 647 available chemical shifts from BMRB #5921) are used. The entries of the chemical shift assignments of each protein are regrouped and/or modified to create new datasets that simulate the chemical shift inputs with various degrees of completeness and/or chemical shift errors. The CS-Rosetta protein structure generation protocol is carried out for these differently prepared chemical shift input data sets, but following exactly the same computational procedures. The impact of the incompleteness and/or incorrectness of chemical shift assignments on the CS-Rosetta procedure are evaluated both by monitoring the accuracy of the selected fragments and by the quality and convergence of the generated CS-Rosetta all-atom models.
Three groups of incomplete or partially erroneous chemical shift assignments were generated using the original (nearly complete) experimental chemical shift assignments of proteins MrR16 and TM1442. Details regarding the assignments of the two paramagnetic proteins, and the proteins studied by solid-state NMR, are also provided below.
Depending on the strategy used for backbone resonance assignment, chemical shift assignments for certain types of backbone may not be available. Table 1 lists the chemical shift datasets generated for MrR16 and TM1442 by omitting the entries of the experimental chemical shift assignments of up to four types of nuclei (represented by datasets Ia-Ii). Except for datasets Ig (containing δ15N, δ13Cα, δ13Cβ and δ13C’ for all residues) and Ii (δ13Cα and δ13Cβ), these datasets all include δ15N, δ13Cα and δ1HN, which constitute the minimum set of protein backbone chemical shifts required for conventional triple resonance assignment. Dataset Ig, containing only 13C and 15N chemical shifts, was generated to simulate a typical solid-state NMR chemical shift dataset. Dataset Ik contains no chemical shifts and is included to allow comparison of the impact of chemical shifts over standard Rosetta fragment selection.
To simulate the situation of proteins with unassigned residues, for both MrR16 and TM1442 two sets of ‘incomplete’ chemical shift assignments were generated by omitting all chemical shifts (δ15N, δ13C’, δ13Cα, δ13Cβ, δ1Hα and δ1HN) for ~10% or ~20% of the residues from their original complete chemical shift datasets. Two different sets of partial chemical shift assignments were generated in this manner. First, a favorable but perhaps unrealistic set was generated where the unassigned residues are evenly distributed along the protein sequence by deleting chemical shifts of residue numbers N×10 (data set IIa) or N×5 (data set IIb), where N = 1,2,3,…. Second, two more realistic sets of partial assignments were generated where the unassigned residues are consecutive along the protein sequence, exemplifying the situation where residues of one or two segments in the protein are not assigned. Considering that such unassigned stretches of residues are often located in loop or turn regions, we arbitrarily selected such regions with length of ca8–10% of the entire sequence, and removed their chemical shift assignments from the datasets. For MrR16, the deleted regions comprise two loops, residues 24–32 (between the second β-strand and the first α-helix, referred to as loop I) and 43–50 (which connects the first α-helix and the third β-strand, referred to as loop II); for TM1442, the two loops, comprise residues 21–29 (between the third β-strand and the first α-helix, loop I) and 52–59 (between the fourth β-strand and the second α-helix, loop II). For each protein, three chemical shift assignment datasets were generated and named as follows: dataset IIc, for which all assignments of loop I are omitted, simulating the situation that the residues in loop I are “unassigned”; dataset IId, for which the residues in loop II are “unassigned”; and dataset IIe, for which the residues in both loops I and II, comprising ~16–19% of the total number of residues in the protein, are “unassigned”.
In practice, various kinds of chemical shift assignment errors can occur during the protein resonance assignment process, either resulting from mistakes during automated resonance assignment, or from human errors. In order to evaluate the impact of such “random” errors on the CS-Rosetta structure generation, several chemical shift assignment datasets were generated by swapping the chemical shift assignments for two dipeptides with identical amino acid types along the protein sequence: dataset IIIa, for which the chemical shift assignments of the two dipeptides have the same secondary structures (for MrR16, Val42-Leu43 and Val85-Leu86, both in α-helices; for TM1442, Ile16-Val17 and Ile47-Val48, both in β-strands) were swapped; dataset IIIb, for which the chemical shift assignments of two dipeptides in different secondary structures were swapped (for MrR16, Leu39-Val40 in the first α-helix, Leu50-Val51 in the third β-strand; for TM1442, Ser52-Ser53 in the loop between the fourth β-strand and second α-helix, and Ser82-Ser83 in the last β-strand).
Chemical shift referencing errors also are common, and the resulting “artificial” chemical shift offsets are easily simulated by systematically altering chemical shifts of certain types of nuclei. Here, we evaluate two such datasets: IIIc, for which 1.0 ppm was added to all 13Cα and 13Cβ chemical shifts as the artificial chemical shift referencing error; and dataset IIId, for which 1.7 ppm was added to all 13Cα and 13Cβ chemical shifts.
The 15N, 13Cα, 13Cβ and 13C’ chemical shifts of GB3 and ubiquitin, as determined by ssNMR spectroscopy, were taken from the BMRB (accession codes 15283 (Nadaud et al. 2007) and 7111 (Manolikas et al. 2008). For both proteins, the high-resolution solution NMR structures (PDB entries 2OED (Ulmer et al. 2003) and 1D3Z (Cornilescu et al. 1998)), respectively, were used as the experimental reference structures to evaluate the CS-Rosetta all-atom models.
The 15N, 13Cα, 13Cβ, 13C’, 1Hα and 1HN chemical shift assignments of two paramagnetic proteins, calbindin (75 residues; with a paramagnetic Yb3+ ion in the C-terminal metal binding site and Ca2+ in the N-terminal site) and ferredoxin (98 residues; with a [2Fe–2S] cofactor), were taken from BMRB entries 15594 (Barnwal et al. 2008) and 5148 (Muller et al. 2002), respectively. The experimental structure of calbindin is taken from a 1.6 Å X-ray structure (PDB entry 4ICB) of diamagnetic Ca2+-calbindin (Svensson et al. 1992); the NMR structure (PDB entry 1JQ4) of the [2Fe–2S] ferredoxin (Muller et al. 2002), for which the above NMR chemical shift assignments were obtained, is used as the experimental reference structure for this protein.
The newly extended Rosetta protein structural database, comprising a total of 9,523 proteins, was supplemented with predicted 13Cα, 13Cβ, 13C’, 15N, 1Hα and 1HN chemical shifts by the program SPARTA (Shen and Bax 2007). Then, for each 3-residue and 9-residue fragment in the query protein, selection of database fragment candidates was performed in two different ways.
200 fragment candidates with best matched backbone NMR chemical shifts and amino acid sequence patterns were selected by using a standard MFR search of the protein structural database (Kontaxis et al. 2005; Shen et al. 2008).
As indicated in Fig. 1, an exhaustive search was first conducted throughout the protein structural database by using the standard Rosetta method (Rohl et al. 2004) to find the 2,000 database fragments with the best matched amino acid sequence and sequence-derived secondary structure patterns. A second search was then performed on these 2,000 fragment candidates to select the 200 fragments with the best matched chemical shifts pattern according to a chemical shift score of
as defined in Eq 3 of Shen et al. (2008), where δi,j stands for the chemical shifts of atom i (i = 13Cα, 13Cβ, 13C', 15N, 1Hα and 1HN) of residue j in the fragment; is the experimental chemical shift in the target segment; and denote the SPARTA-derived chemical shifts and uncertainties, respectively, for the fragments in the protein structural database; N is the total number of chemical shifts in the fragment; ci is the weighting factor for each atom type (1.0 for 13Cα, 13Cβ, 13C', 1Hα; 0.9 for 1HN and 15N). For all tests, proteins with significant sequence homology, as judged by a PSI-BLAST (Altschul et al. 1997) e-score < 0.05 to the target protein were excluded from the protein structural database before fragment searching. Note that this removal is only needed for the tests carried out in this study; in real applications the presence of homologous proteins will increase the quality of the resulting structures.
The selected fragments, represented by their idealized backbone torsion angles and the secondary structure classification for each residue, were used in the standard Rosetta manner as inputs for a Monte Carlo assembly and relaxation process to generate ca. 10,000 Rosetta all-atom models for each protein. These all-atom models were further evaluated in terms of fitness with respect to the input chemical shift data, following the same procedure used in the standard CS-Rosetta protocol (Shen et al. 2008), contributing to the empirical energy term that is used for the selection of final all-atom models.
All CS-Rosetta structure generations were performed using Rosetta@home (http://boinc.bakerlab.org/rosetta/) supported by the BOINC server or the Biowulf PC/Linux cluster at the NIH (http://biowulf.nih.gov).
To evaluate the influence of the completeness of chemical shift assignments on the CS-Rosetta protein structure generation process, the following parameters are monitored and analyzed:
During the CS-Rosetta structure generation, the input chemical shifts serve two major functions: fragment selection and re-scoring of the Rosetta models (Fig. 1). Use of the chemical shift information during the fragment search process significantly increases the accuracy of selected fragments over the use of sequence information alone (Shen et al. 2008), and dramatically improves convergence of the structure generation process. Evaluation of the agreement between the final Rosetta-generated models and the input experimental chemical shifts also provides an important selection criterion for eliminating structures whose backbone angles have diverged from those of the original input fragments during the Rosetta optimization procedure. In practice, frequently not all chemical shifts (δ15N, δ13C', δ13Cα, δ13Cβ, δ1Hα and δ1HN) of all residues are available, depending on the resonance assignment strategy chosen and/or missing connectivities in the assignment pathway, most often resulting from conformational exchange on an intermediate time scale. The completeness of the chemical shift assignment will impact both the fragment selection and the re-scoring steps, and thereby the entire CS-Rosetta structure generation procedure. The impact of missing chemical shifts on each of these steps will be discussed below.
It is well recognized that secondary chemical shifts of different nuclei in any given residue are correlated (Supplementary Fig. S1), and this correlation can be used effectively to identify potential errors in chemical shift referencing (Wang et al. 2005). The structural information contained in the chemical shifts of the different types of backbone nuclei therefore may be partly redundant. The standard CS-Rosetta protocol utilizes chemical shifts of all backbone and 13Cβ atoms to select the best matched 3-residue and 9-residue fragments. This redundancy suggests that the absence of assignments for some of this set of six nuclei (N, HN, Cα, Hα, Cβ, C') may not significantly decrease the accuracy of the selected fragments. This issue will be evaluated below for the chemical shift combinations listed in Table 1.
Omission of a single type of chemical shift (δ13C', δ13Cβ or δ1Hα) is found to have very little adverse impact on the quality of selected fragments (Fig. S2A-C), either when using the regular MFR selection protocol or the hybrid method. There also appears little systematic difference in the accuracy of fragments selected with the regular MFR protocol or the hybrid method when using these sets of chemical shifts, although the individual sets of fragments selected by the two methods can differ substantially. This holds true both when considering the average backbone rmsd relative to the reference structure, and for the rmsd of the fragment most closely matching the reference structure (Fig. S2). In passing, we note that the moderate differences in the quality of the fragments are not that easy to evaluate from Figures such as S2, but these differences propagate during the Monte Carlo Rosetta structure generation process, dramatically impacting the yield of converged structures.
Although the accuracy of the fragments selected when omitting two types of chemical shifts (either δ13C'/δ13Cβ, δ13C'/δ1Hα, δ1Hα/δ13Cβ, or δ1HN/δ1Hα) decreases somewhat (Fig. S2D-G), this decrease is small compared to the variation in accuracy seen for different fragments along the sequence of the two proteins.
For MrR16, the quality of fragments obtained by using the chemical shift assignments of only 1HN, 15N, and 13Cα, or sets containing only δ13Cα and δ13Cβ is not much lower than for sets derived using more complete assignments (Fig. S2). As a result, the convergence of the CS-Rosetta structure generation process remains adequate and permits assembly of reasonable Rosetta models, albeit with raw Rosetta all-atom energies that are not as low as for structures obtained from using all six types of chemical shifts for fragment searching (Fig. S3). Similar results are obtained for TM1442 (Fig. 2).
Remarkably, even though the accuracy of the resulting structures decreases when just using 1HN, 15N, and 13Cα chemical shifts, or just 13Cα and 13Cβ chemical shifts, lowest energy structures remain close to the reference structure, in particular when the hybrid fragment selection method is used. A survey of the energies of the Rosetta-assembled structures and their accuracies (Fig. S3) indicates that the original MFR fragment selection results in higher yields during structure generation than the hybrid fragment selection method when assignments are relatively complete. However, for MrR16, the hybrid method outperforms the regular MFR method for datasets Id, If and Ih (Fig. S3); for TM1442 the hybrid method outperforms the regular MFR approach for datasets 1B, 1F, and 1H (Fig. S4). For the case where no chemical shifts are available, only the standard Rosetta approach can be used. No convergence is then reached for MrR16, whereas for TM1442 the lowest energy models fall within 4 Å from the reference structure and relaxed convergence criteria are met (Fig. S5).
The calculations discussed above, and summarized in Figure 2 and Figure S2–S5 indicate that the resonance assignments of not all six types of nuclei are required for success of the CS-Rosetta structure generation process. The order of importance of each type of chemical shift can be ranked as δ13Cα ~ δ13Cβ > δ1Hα ~ δ13C’ > δ15N ~ δ1HN. For proteins where all or the vast majority of these chemical shifts are available, the standard MFR fragment selection protocol tends to yield better accuracy of the selected MFR fragments and higher convergence, as well as lower energy when generating the all-atom Rosetta models. The calculations also suggest that the chemical shift assignment dataset needed for the CS-Rosetta protocol at a minimum comprises δ15N, δ1HN and δ13Cα, which also are the cornerstone nuclei during triple resonance backbone assignment, complemented by either δ13C', δ13Cβ or δ1Hα.
The standard MFR fragment selection procedure, implemented in the previously described CS-Rosetta protocol, relies primarily on the match between the experimental 13Cα, 13Cβ, 13C', 15N, 1HN and 1Hα secondary chemical shift values of each residue in any given 3- or 9-residue query fragment, and the SPARTA-generated secondary shift values for the corresponding residues in any fragment present in the structural database. The similarity in amino acid sequence is also used in this scoring process but carries a much weaker weighting. However, when most or all chemical shifts are missing for any given residue or group of residues, the relative importance of similarity in residue type increases and eventually becomes the only criterion when no chemical shifts are available at all. Clearly, the absence of chemical shift information yields to a decrease in accuracy of the fragments that can optimally be selected from any structural database (Shen et al. 2008).
For the relatively favorable situation, where residues with missing chemical shifts are distributed evenly throughout the protein sequence, the chemical shift patterns encoded in the 9-residue target fragments only sustain a small fractional loss in information content when a single residue in such a fragment is missing. Indeed the quality of the MFR-selected fragment candidates for chemical shift assignment IIa and IIb (see Preparation of chemical shift datasets section) remains quite good (Fig. S6). For the 3-residue fragments, where the loss of assignments for one residue represents 33% loss in information contents, results are less favorable. In particular, when the backbone angles within the 3-residue fragment strongly differ from one another, i.e., when the fragment is not embedded in an α-helix or β-strand, results from the MFR search can be poor. For example, for the 3-residue TM1442 fragments containing residue Lys85 (an N-terminal helix capping residue), omission of its chemical shifts (dataset IIb), causes a large spike in the coordinate rmsd when using the regular MFR fragment search (Fig. S6B”). Nevertheless, because the adverse impact of lacking chemical shift assignments on the quality of the 9-residue fragments remains small, the Rosetta fragment assembly process remains capable of generating high quality models. This result applies for both MrR16 and TM1442 (Figs. S7 and S8), but for both proteins convergence to the correct structure is lower compared to using a complete set of chemical shift assignments.
A more realistic but also more challenging situation occurs when the unassigned residues cluster along the protein sequence. The MFR fragment selection then becomes dominated by residue type similarity between the query fragment and fragments present in the structural database. The accuracy of fragments that include such unassigned segments, selected by the standard MFR method, is severely affected (Fig. S6C,D), in particular when the missing assignments are located outside regions of secondary structure (datasets IIc and IId). Interestingly, the quality of these fragments tends to be much lower than what is achieved with the standard Rosetta fragment selection method (Fig. S5), highlighting that the simple residue similarity scoring used by the MFR method performs much worse than the far more elaborate Rosetta fragment selection protocol (Rohl et al. 2004). Unsurprisingly, the subsequent Rosetta structure assembly protocol, using standard MFR fragments as input, can fail to obtain a converged low-energy fold (Figs. S7, S8). On the other hand, for MrR16 the CS-Rosetta structure generation for dataset Ic, lacking assignments for residues 24–32, remains successful and finds a converged low-energy fold, where the backbone of the lowest energy model deviates by 1.8 Å from the experimental reference structure (Fig. S7). Even while the quality of 9-residue fragments encompassing this region with missing assignments is poor, the accuracy of the best 3-residue fragments selected remains quite good for this region, and it is the powerful combinatorial engine of Rosetta which can exploit the presence of a relatively small subset of accurate fragments for this single region during the assembly process. For the case where two regions with missing assignment are present in the protein (dataset IIe), CS-Rosetta with standard MFR selection no longer is able to obtain converged low energy structures (Fig. 3, Fig S7 and S8).
One way to improve the selection of suitable fragments, and thereby the CS-Rosetta structure generation process, for proteins with extended segments of missing chemical shift assignments is to take advantage of the standard Rosetta fragment selection procedure (Rohl et al. 2004), which searches for matched database fragments based on a relatively sophisticated procedure that simultaneously exploits residue type similarity and predicted secondary structure. Amino acid sequence similarity alone provides less structural information than the backbone chemical shifts, and therefore results in a wider distribution of selected peptide conformations. The average quality of Rosetta-selected fragments therefore is significantly lower than for MFR selection based on chemical shifts, but the quality of the best fragments (out of 200 selected) remains quite good, in particular for the 3-residue fragments (Shen et al. 2008). A preferred way to score the fragments therefore would directly combine, with suitable weight factors, the amino acid sequence based Rosetta fragment score with the chemical shift component of the MFR score. For technical reasons, however, this is not easily accomplished and we therefore resort to a simpler protocol which equally takes advantage of the strengths of both approaches. This hybrid fragment selection procedure first uses standard Rosetta to select the 2000 database fragments (out of over 2,200,000) that are most compatible in terms of amino acid sequence, and then uses MFR chemical shift scoring to narrow down this set to fragments that are most compatible with the experimental shifts. When complete chemical shifts are available, this hybrid method performs slightly worse than the regular MFR procedure (Figs. S2–S4). However, when significant segments in the protein lack assignments, the hybrid method remains perfectly successful at generating low energy, converged and accurate results. For example, when using the ‘hybrid’ fragments selected with chemical shift datasets IIc-IIe, lacking chemical shifts for two extended loop regions, the Rosetta fragment assembly and relaxation protocol results in near-convergence for TM1442, yielding lowest energy models that are within 2.5 Å Cα rmsd relative to the reference structure (Fig.3).
A potential error during conventional and/or automated backbone resonance assignments is the case where chemical shift assignments of two di- or tripeptide sequences of similar amino acid sequence, embedded between residues with similar chemical shifts, are accidentally interchanged. Below, we consider the case where assignments for two dipeptides with identical amino acid types are interchanged.
For the favorable situation where the two dipeptides are located in segments with the same secondary structure, as exemplified in dataset IIIa, the chemical shift patterns in the 3-residue and 9-residue fragments are virtually unchanged and there is essentially no adverse impact on the fragment selection, neither for the standard MFR nor the hybrid approach (Fig. S9). Clearly, generation of Rosetta structures also remains unaffected (Figs. S10 and S11).
For the case where the two miss-assigned dipeptides are engaged in different types of secondary structure, the incorrect chemical shift values are likely to favor selection of fragments with backbone torsion angles that deviate substantially from the true values, resulting in a significant decrease in the quality of MFR-selected fragments. This is particularly true for the 3-residue fragments (Fig. 4A,Fig. 4B; Fig. S9), where the fraction of erroneous assignments equals two thirds. Not surprisingly, the subsequent Rosetta fragment assembly and relaxation protocol has trouble generating well converged models. Although the lowest (re-scored) energy models exhibit folds that are essentially correct, these differ by ~2.56 Å and ~3.47 Å (Cα-rmsd) from the experimental structures of MrR16 and TM1442, respectively (Fig. 4C; Figs. S10 and S11). The re-scored all-atom energies of these models are also systematically higher than obtained when using the correct chemical shift assignments.
When using the hybrid fragment selection method, the impact of erroneous assignments is reduced considerably, and acceptable convergence is achieved (Fig. 4; Figs. S9–S11). As pointed out by Wang et al. (2005), nearly 30% of the deposited chemical shift data in the BMRB have chemical shift referencing problems. Such referencing errors are most prevalent for 13Cα/13Cβ, but also are common for 13C' and 15N. Below, we evaluate the impact of 13Cα/13Cβ referencing errors. As will be shown, the fragment search procedure is relatively insensitive to moderate errors in 13Cα/13Cβ chemical shift referencing, in part because 13Cα and 13Cβ secondary shifts are anti-correlated. For example, a 4 ppm reference error could change a typical β-sheet secondary 13Cα shift of −1 ppm to an α-helical 3 ppm value. However, the +2 ppm β-sheet secondary 13Cβ shift would become +6 ppm, completely incompatible with a helical conformation, preventing the residue from being misidentified as helical. To first order, the impact of 13Cα/13Cβ referencing errors is small when both 13Cα and 13Cβ shift data are available, and manifests itself mainly as a steeper 13Cα/13Cβ chemical shift gradient when selecting fragments, and increased total energies when rescoring the energies of the Rosetta models.
The impact of 13Cα/13Cβ chemical shift referencing errors on CS-Rosetta structure generation was evaluated using the chemical shift assignment datasets IIIc and IIId. When 1.0 ppm offset was added to δ13Cα/β (dataset IIIc), comparable to the average δ13Cα/β prediction errors (σ in Eq 1) (Gong et al. 2007; Shen and Bax 2007), the accuracy of the selected fragments slightly decreases (Fig. S9), with a very small adverse impact on subsequent Rosetta structure generation (Figs. S10 and S11). The impact of chemical shift referencing errors appears to be insensitive to the type of fragment selection method used: For MrR16, standard MFR yields slightly better results (Fig. S10); for TM1442, the hybrid method is slightly favorable (Fig. S11).
When the δ13Cα/β offset error is increased to 1.7 ppm (dataset IIIc), convergence and accuracy of the resulting structures decreases noticeably (Figs. S10 and S11), but the folds remain essentially correct. However, when the offset error is increased to 2.7 ppm, which corresponds to the approximate difference between δ13Cα/β values referenced to TMS and DSS (Wishart et al. 1995; Markley et al. 1998), fragment selection results are poor and no acceptable structures are obtained with the CS-Rosetta protocol (data not shown).
When the chemical shift referencing error affects only a single type of nucleus, e.g. 13Cα or 13C', an erroneous bias towards selection of helical or extended fragments can occur, resulting in poorer fragment quality and decreased performance of the CS-Rosetta protocol (results not shown). Even in these cases, the impact of 15N or 13C chemical shift referencing errors of up to 1 ppm have very little adverse effect on CS-Rosetta performance.
Chemical shift referencing errors readily can be detected by automated methods (Moseley et al. 2004; Wang et al. 2005). For this purpose, a script has been added to the CS-Rosetta package which applies reference error corrections when the referencing error exceeds the average uncertainty in the database chemical shifts (1.0 ppm for δ13Cα/β and δ13C'; 0.3 ppm for δ1Hα). These referencing corrections are based on the method described by Markley and coworkers (Wang et al. 2005), and correlations between (Δδ13Cα−Δδ13Cβ) and Δδ13Cα/β/Δδ13C'/ Δδ1H are shown in Fig. S1.
A situation similar to the chemical shift referencing problem discussed above can arise when chemical shifts are measured from TROSY spectra (Pervushin et al. 1998), when the displacement between the observed resonance frequency and the true chemical shift (1JNH/2 for δ15N and δ1HN) is not taken into account. However, considering that this error is much smaller than the standard error in the predicted database chemical shifts, no adjustment of the chemical shift values is required.
A larger apparent referencing error can result from deuteration effects (Venters et al. 1996; Gardner et al. 1997) on δ13Cα (with deuterium isotope shifts of −0.5 to −0.9 ppm) and δ13Cβ (−0.7 to −1.3 ppm). These isotope effects on the backbone chemical shifts are relatively uniform and mostly smaller than the 1 ppm referencing error, discussed above. Although it is beneficial to apply uniform isotope shift corrections of +0.7 and +0.9 ppm to δ13Cα and δ13Cβ values, respectively, the absence of such corrections shows little adverse impact on the performance of CS-Rosetta (data not shown). Nevertheless, a script has been added to the CS-Rosetta package which adjusts the δ13Cα and δ13Cβ chemical shifts by the residue-type-specific values reported by Cavanagh et al. (2007).
The backbone chemical shifts δ15N, δ13C’, δ13Cα and δ13Cβ obtained by ssNMR for the proteins GB3 and ubiquitin were used as inputs for the CS-Rosetta structure generation protocols. For GB3, nearly complete ssNMR backbone chemical shift assignments, including 55 δ15N, 56 δ13C', 56 δ13Cα and 52 δ13Cβ, are taken from (Nadaud et al. 2007). For the most part, these chemical shifts closely agree with values observed by solution NMR (Fig. S11). For ubiquitin, the ssNMR backbone chemical shift assignments taken from (Igumenova et al. 2004) are about ~90% complete, and include 65 δ15N, 65 δ13C', 67 δ13Cα and 63 δ13Cβ values, with no chemical shift assignments for residues 8–11. With the exception of several residues involved in intermolecular contacts, these chemical shifts also agree well with values observed in solution (Igumenova et al. 2004) (Fig. S12).
Except for the ubiquitin target fragments that involve the missing residues 8–11, the quality of fragments selected on the basis of ssNMR shift values is good, with little difference apparent between results from the standard MFR and the hybrid fragment selection method (Fig. 5). As expected based on the evaluations carried out above for regions with missing assignments, the regular MFR method fares poorly when selecting fragments that include residues 8–11, whereas the hybrid method shows no decrease in structural quality for this region.
Importantly, either selection method yields fragments from the ssNMR chemical shifts that suffice for generating converged, high quality all-atom models for both proteins (Fig. 5C,F). When the MFR method is used to select the fragments, the coordinate rms deviations for GB3 between the lowest energy model and the experimental solution NMR structure are 0.71 Å for the backbone atoms (N, Cα and C') and 1.28 Å for all non-hydrogen atoms. For ubiquitin these numbers are 0.69 and 1.22 Å. When the fragments are selected by the hybrid procedure, the coordinate rmsd's are slightly higher: 0.73 and 1.70 Å for backbone and all non-hydrogen GB3 atoms, respectively, and 0.86 and 1.49 Å for ubiquitin.
Considering the generally somewhat lower spectral resolution attainable by ssNMR compared to solution NMR, detailed structural studies of globular proteins by ssNMR mostly have remained restricted to relatively small systems, typically less than ~80 residues. Clearly, CS-Rosetta provides a powerful new complementary tool for generating structural models of such proteins once chemical shift assignments have been completed, without requiring the extensive internuclear distance information which sometimes can be difficult to obtain.
Two small paramagnetic proteins for which chemical shifts are available in the BMRB (Doreleijers et al. 2005) have been used to evaluate the applicability of CS-Rosetta to such systems: calbindin and ferredoxin. The backbone chemical shift assignments of calbindin, chelating a paramagnetic Yb3+ ion in its C-terminal metal binding site and Ca2+ in the N-terminal site, include 52 δ15N/δ1HN, 43 δ13C', 37 δ13Cα/δ1Hα and 33 δ13Cβ shifts, but no chemical shift assignments for residues 18 to 24 and 47 to 66; the completeness of the backbone chemical shift assignments is ~60% (Barnwal et al. 2008). The backbone chemical shift assignments of ferredoxin include 78 δ15N/δ1HN, 83 δ13C’, 86 δ13Cα/δ1Hα and 78 δ13Cβ values, and assignments for residues 41–50 and 80–82; the completeness of the backbone chemical shift assignments is ~80% (Muller et al. 2002).
With the absence of chemical shift assignments for long segments in each of these two proteins, the standard CS-Rosetta protocol, using MFR fragment selection, fails to converge for both proteins (Fig. S13). However, the hybrid fragment selection procedure performs much better, in particular for those target fragments involving the unassigned residues (Fig. 6A,Fig. 6B,Fig. 6D,Fig. 6E), permitting the structure assembly phase to be successful (Fig.7). Interestingly, this improved performance does not result from recognition of the relatively common EF-hand and Fe-S metal-binding sites as, for testing purposes, proteins with a PSI-BLAST e-score <0.05 had been removed from the database. Subsequent manual evaluation of the 9-residue fragments covering the regions lacking chemical shifts showed the presence of six 9-residue fragments for calbindin segment 54–62, which were taken from EF-hand containing proteins that had escaped detection by the PSI-BLAST filter.
For both proteins, the Rosetta fragment assembly and relaxation procedure generates a number of good all-atom models, with the lowest energy models having backbone coordinates that differ by less than 2 Å from their respective reference structures when only including residues involved in secondary structure (Fig. 6C,F). Although, the standard convergence criterion (10 lowest energy structures cluster with 2 Å from the lowest energy structure) is not met for either protein (Fig. S13), when relaxing this limit to 3.3 Å both structures are converged.
For calbindin, the coordinate rmsd’s between the lowest energy all-atom model and the 1.6-Å X-ray structure of calbindin D9K (Svensson et al. 1992) are 1.5 and 2.1 Å for the backbone atoms (N, Cα and C') and for all heavy atoms involved in secondary structure, respectively. The Ca2+ binding loops of both metal binding sites are remarkably well formed in the CS-Rosetta structures (Fig. 7A), even with the second metal binding site lacking all of its chemical shift assignments and the absence of any restraints on metal chelation for both metal binding sites. For the first Ca2+ binding loop, a pseudo-EF-hand, the four backbone carbonyl groups are properly positioned and point towards the location where Ca2+ is found in the X-ray structure. Even the bidentate sidechain ligating group of Glu27 adopts a conformation suitable for metal chelation. For the second site, a regular EF-hand, the backbone carbonyl of Glu60 and the sidechains of Asp54 and Glu65 are well positioned for metal binding, but the sidechains of Asn56 and Asp58 point away from the position where the metal ion is observed in the X-ray structure.
For the secondary structure elements of ferredoxin, the lowest energy Rosetta model deviates from the experimental NMR structure obtained for the same protein by 2.06 Å for the backbone and by 3.54 Å for all non-hydrogen atoms. Two of the four Cys sidechains that ligate the [2Fe-2S] cluster are in close proximity, even though the loop conformations differ substantially from the experimentally determined structure, (Figure 7B).
Although previous reports have clearly demonstrated the potential of using chemical shifts to determine good quality all-atom structures for small proteins (Cavalli et al. 2007; Shen et al. 2008), these studies were based on relatively ideal cases where complete or nearly complete backbone assignments were available, in the absence of assignment errors. Our present study demonstrates that the CS-Rosetta procedure and its new variant, which uses a hybrid fragment selection procedure, are remarkably tolerant to such incompleteness and errors. Clearly, a study such as the present one, which evaluates the impact of missing or erroneous assignments, is never complete. We simply have evaluated the impact for two proteins, and have made an attempt to evaluate representative cases of missing assignments. Both proteins chosen for the current study, MrR16 and TM1442, yielded good (albeit not exceptional) results when originally studied with complete data sets, and these systems therefore are likely to be more robust to incompleteness or assignment errors than proteins which only yield borderline convergence to begin with.
The CS-Rosetta protocol uses the chemical shift information at two stages: first for fragment selection, and then again when evaluating the final full-atom models. There are two primary reasons for the improved performance of the CS-Rosetta protocol over a conceptually similar, earlier attempt to integrate chemical shift information into Rosetta (Bowers et al. 2000). First, the quality of fragments selected has improved considerably by the use of SPARTA to "assign" better chemical shifts to a structural database. SPARTA uses both a more advanced algorithm to assign these chemical shifts, but also benefits from a considerable expansion of entries in the BMRB for which complete chemical shift and high resolution structural information is available (Doreleijers et al. 2005). Second, a number of improvements in the Rosetta Monte-Carlo assembly process have been made in recent years, most notably the incorporation of explicit all atom refinement with a physically realistic force field (Das and Baker 2008).
The adverse impact of errors and incompleteness on the CS-Rosetta protocol results primarily from decreased quality of the fragment library, and has relatively little impact on the rescoring of the final full-atom models. The hybrid CS-Rosetta protocol first limits the selection of fragments to a ~0.1% fraction of the total structural database on the basis of the standard Rosetta selection mechanism. In the next step, it uses MFR to select the 200 fragments from this ensemble that agree best with experimental chemical shifts. This reduces the impact of chemical shift errors because only fragments compatible with standard Rosetta criteria are available for selection. Moreover, in the absence of any chemical shift information, the Rosetta pre-selection of the top 0.1% fragments yields better results than the less sophisticated MFR procedure, which had been designed primarily to find fragments with similar chemical shifts and/or RDCs (Delaglio et al. 2000; Kontaxis et al. 2005). In the absence of assignment errors or missing assignments, the initial Rosetta pre-selection used in the hybrid procedure is not beneficial and actually results in a small decrease in performance. On the other hand, for cases where significant fractions of assignments are missing or ambiguous, the hybrid procedure is considerably more robust.
For all evaluations, including those of the two paramagnetic proteins, homologous proteins were first eliminated from the structural database. In practice, this is clearly disadvantageous as Rosetta no longer can take advantage of standard structural elements, such as Ca2+-ligating EF-hand sequences, present in the database. Indeed 30 proteins containing a total of 64 EF-hands were removed prior to fragment searching. Similarly, proteins containing the relatively common Fe2S2 cluster were removed prior to searching for fragments for ferredoxin assembly. While for calbindin the CS-Rosetta protocol resulted in remarkably good backbone structures for its metal binding sites, even in the absence of chemical shift information, loop conformations in ferredoxin were poor. Nevertheless, using the hybrid protocol, CS-Rosetta was able to generate the remainder of the ferredoxin structure quite well, suggesting that even for these challenging systems the method will be quite useful.
For the two proteins for which a structure was generated from solid state NMR chemical shifts, lacking 1H chemical shifts, the standard MFR-based protocol and the hybrid CS-Rosetta method performed comparably well. For both proteins, the final structures obtained from these smaller input data sets approach the quality of structures obtained from solution NMR chemical shifts, indicating that CS-Rosetta may be a particularly useful complement when working with samples in the solid state.
Although CS-Rosetta considerably reduces the amount of spectral data collection time required for structure generation compared to conventional procedures, the amount of computational time required typically is very high. Although for simple systems such as GB3, generation of less than one hundred structures may suffice to reach convergence (Shen et al. 2008), for many other proteins as many as 10,000 models may be required. Rosetta assembly and minimization of each model takes 5–10 minutes on a single CPU, and in practice use of a large cluster or a central server such as BOINC is required to take advantage of this technology.
We also note that the CS23D program (Wishart et al. 2008) performs very well for the test datasets used in our study (Supplementary Material). The major strength of CS23D is that it takes optimal advantage of sequence homologues present in the database during fragment selection. Such homologues were present in the structural database for all six proteins evaluated in our work (see Supplementary Material Table S2), but were excluded from the database for CS-Rosetta testing. On the other hand, based on a limited number of tests, techniques such as CS-Rosetta and Cheshire are believed to be superior for proteins that lack significant homology to previously solved structures.
The CS-Rosetta software package with its newly implemented hybrid fragment selection module can be downloaded from http://spin.niddk.nih.gov/bax/
A brief discussion of CS23D results for proteins discussed in this study; multiple figures detailing the quality of the selected fragments and the CS-Rosetta results for various combinations of input chemical shift data.
This work was funded by the Intramural Research Program of the NIDDK, NIH, and by the Intramural AIDS-Targeted Antiviral Program of the Office of the Director, NIH; the NIGMS, NIH, and the Howard Hughes Medical Institutes (to D.B.). We also thank Rosetta@home participants and the BOINC project for contributing computing power.