Although previous reports have clearly demonstrated the potential of using chemical shifts to determine good quality all-atom structures for small proteins (Cavalli et al. 2007
; Shen et al. 2008
), these studies were based on relatively ideal cases where complete or nearly complete backbone assignments were available, in the absence of assignment errors. Our present study demonstrates that the CS-Rosetta procedure and its new variant, which uses a hybrid fragment selection procedure, are remarkably tolerant to such incompleteness and errors. Clearly, a study such as the present one, which evaluates the impact of missing or erroneous assignments, is never complete. We simply have evaluated the impact for two proteins, and have made an attempt to evaluate representative cases of missing assignments. Both proteins chosen for the current study, MrR16 and TM1442, yielded good (albeit not exceptional) results when originally studied with complete data sets, and these systems therefore are likely to be more robust to incompleteness or assignment errors than proteins which only yield borderline convergence to begin with.
The CS-Rosetta protocol uses the chemical shift information at two stages: first for fragment selection, and then again when evaluating the final full-atom models. There are two primary reasons for the improved performance of the CS-Rosetta protocol over a conceptually similar, earlier attempt to integrate chemical shift information into Rosetta (Bowers et al. 2000
). First, the quality of fragments selected has improved considerably by the use of SPARTA to "assign" better chemical shifts to a structural database. SPARTA uses both a more advanced algorithm to assign these chemical shifts, but also benefits from a considerable expansion of entries in the BMRB for which complete chemical shift and high resolution structural information is available (Doreleijers et al. 2005
). Second, a number of improvements in the Rosetta Monte-Carlo assembly process have been made in recent years, most notably the incorporation of explicit all atom refinement with a physically realistic force field (Das and Baker 2008
The adverse impact of errors and incompleteness on the CS-Rosetta protocol results primarily from decreased quality of the fragment library, and has relatively little impact on the rescoring of the final full-atom models. The hybrid CS-Rosetta protocol first limits the selection of fragments to a ~0.1% fraction of the total structural database on the basis of the standard Rosetta selection mechanism. In the next step, it uses MFR to select the 200 fragments from this ensemble that agree best with experimental chemical shifts. This reduces the impact of chemical shift errors because only fragments compatible with standard Rosetta criteria are available for selection. Moreover, in the absence of any chemical shift information, the Rosetta pre-selection of the top 0.1% fragments yields better results than the less sophisticated MFR procedure, which had been designed primarily to find fragments with similar chemical shifts and/or RDCs (Delaglio et al. 2000
; Kontaxis et al. 2005
). In the absence of assignment errors or missing assignments, the initial Rosetta pre-selection used in the hybrid procedure is not beneficial and actually results in a small decrease in performance. On the other hand, for cases where significant fractions of assignments are missing or ambiguous, the hybrid procedure is considerably more robust.
For all evaluations, including those of the two paramagnetic proteins, homologous proteins were first eliminated from the structural database. In practice, this is clearly disadvantageous as Rosetta no longer can take advantage of standard structural elements, such as Ca2+-ligating EF-hand sequences, present in the database. Indeed 30 proteins containing a total of 64 EF-hands were removed prior to fragment searching. Similarly, proteins containing the relatively common Fe2S2 cluster were removed prior to searching for fragments for ferredoxin assembly. While for calbindin the CS-Rosetta protocol resulted in remarkably good backbone structures for its metal binding sites, even in the absence of chemical shift information, loop conformations in ferredoxin were poor. Nevertheless, using the hybrid protocol, CS-Rosetta was able to generate the remainder of the ferredoxin structure quite well, suggesting that even for these challenging systems the method will be quite useful.
For the two proteins for which a structure was generated from solid state NMR chemical shifts, lacking 1H chemical shifts, the standard MFR-based protocol and the hybrid CS-Rosetta method performed comparably well. For both proteins, the final structures obtained from these smaller input data sets approach the quality of structures obtained from solution NMR chemical shifts, indicating that CS-Rosetta may be a particularly useful complement when working with samples in the solid state.
Although CS-Rosetta considerably reduces the amount of spectral data collection time required for structure generation compared to conventional procedures, the amount of computational time required typically is very high. Although for simple systems such as GB3, generation of less than one hundred structures may suffice to reach convergence (Shen et al. 2008
), for many other proteins as many as 10,000 models may be required. Rosetta assembly and minimization of each model takes 5–10 minutes on a single CPU, and in practice use of a large cluster or a central server such as BOINC is required to take advantage of this technology.
We also note that the CS23D program (Wishart et al. 2008
) performs very well for the test datasets used in our study (Supplementary Material
). The major strength of CS23D is that it takes optimal advantage of sequence homologues present in the database during fragment selection. Such homologues were present in the structural database for all six proteins evaluated in our work (see Supplementary Material Table S2
), but were excluded from the database for CS-Rosetta testing. On the other hand, based on a limited number of tests, techniques such as CS-Rosetta and Cheshire are believed to be superior for proteins that lack significant homology to previously solved structures.