|Home | About | Journals | Submit | Contact Us | Français|
A hybrid protein structure determination approach combining sparse Electron Paramagnetic Resonance (EPR) distance restraints and Rosetta de novo protein folding has been previously demonstrated to yield high quality models (Alexander et al., 2008). However, widespread application of this methodology to proteins of unknown structures is hindered by the lack of a general strategy to place spin label pairs in the primary sequence. In this work, we report the development of an algorithm that optimally selects spin labeling positions for the purpose of distance measurements by EPR. For the α-helical subdomain of T4 lysozyme (T4L), simulated restraints that maximize sequence separation between the two spin labels while simultaneously ensuring pairwise connectivity of secondary structure elements yielded vastly improved models by Rosetta folding. 50% of all these models have the correct fold compared to only 21% and 8% correctly folded models when randomly placed restraints or no restraints are used, respectively. Moreover, the improvements in model quality require a limited number of optimized restraints, the number of which is determined by the pairwise connectivities of T4L α-helices. The predicted improvement in Rosetta model quality was verified by experimental determination of distances between spin labels pairs selected by the algorithm. Overall, our results reinforce the rationale for the combined use of sparse EPR distance restraints and de novo folding. By alleviating the experimental bottleneck associated with restraint selection, this algorithm sets the stage for extending computational structure determination to larger, traditionally elusive protein topologies of critical structural and biochemical importance.
Decades into the structural biology revolution, tens of thousands of structures have been deposited in the Protein Data Bank (PDB) cataloging protein folds, defining motifs of catalysis, and revealing architectures of protein complexes. The overarching goal of delineating the biochemical and physiological circuitry that interconnect to form cells and organisms requires further progress on two fronts. The sampling of structure space has been uneven; primarily skewed towards classes of proteins amenable to analysis by the leading structural methods. Undersampled protein structure space includes proteins of high functional and pharmacological significance such as multispan membrane proteins  and large, conformationally heterogeneous soluble proteins . In addition, protein function often involves the transitions between conformational states or shifts in the equilibrium between such states. Static crystallographic snapshots represent a limited and sometimes biased view of the conformational space of dynamic proteins. Structures trapped in the confines of the crystal lattice may not be defined mechanistically or may be distorted by non-native environments such as detergent solubilization or osmotically active molecules .
These two challenges motivated the development of both theoretical and experimental methods to accelerate the speed of structure determination and to describe protein dynamic dimensions. EPR spectroscopy in conjunction with site-directed spin labeling (SDSL) [4–5] has been extensively applied to map conformational changes in soluble [6–7] and membrane proteins [8–16] and to probe the structure of dynamic oligomers [17–18] and amyloids [19–20]. Combining residue-specific measures of solvent accessibility and local dynamics with global geometric distance restraints describing packing of secondary structures and domains, this approach provides enough restraints for modeling protein structures and their rearrangements [21–24]. High sensitivity, absence of size limits and restriction on environment and/or solvent enables the evaluation of crystallographic structures and comparative models under native-like, well defined biochemical conditions.
However this approach is intrinsically limited by the need for incorporation of spin labels into protein sequences. Compared to other restraint-based approaches such as Nuclear Magnetic Resonance (NMR) Spectroscopy, this reduces the experimental throughput effectively reducing the practical number of obtainable restraints. Moreover, the linking arm of the spin label tethering it to the protein introduces uncertainty in the interpretation of EPR parameters in terms of backbone structure. In the case of distance measurements, the translation of a precisely measured distance between spin labels to a restraint between corresponding β-carbons (Cβ) is model dependent. Models derived from molecular dynamic simulations [25–28], crystallographic rotamer libraries , or based on simple geometric considerations  have been used to rationalize the experimental EPR distances.
A general approach for protein structure determination from EPR restraints was developed by Alexander et al. . It capitalizes on the de novo protein structure prediction algorithm, Rosetta [30–38], to overcome the sparseness of EPR experimental restraints. The premise of this work was that restriction of conformational space by the EPR restraints increases Rosetta’s efficiency in finding native folds. That a limited number of distances between pairs of spin labels significantly improved the quality of models put to rest concerns regarding the value of EPR distances as restraints for modeling. Experimental EPR distances were translated into Cβ-Cβ restraints using a simple cone model with virtually no restriction of spin label rotameric states.
The limited throughput of EPR methods and the ensuing restraint sparseness encourages a rational approach in the selection of spin labeled sites. Alexander et al.  demonstrated the importance of high information content (defined as the ratio between sequence separation and Euclidean distance) as a criterion for restraint quality. The improvement in model quality was attributed to a third of the restraints with the highest information content. However, for proteins of unknown structures where the Euclidian distance is not known, using the numerator (i.e. sequence separation) as a proxy for information content will cluster restraints between the ends of the primary sequence.
This paper reports the development and experimental application of a general algorithm for selection of optimized distance restraint patterns for protein structure determination. Starting from sequence information, an iterative computational approach validated by Rosetta de novo folding yielded the best scoring scheme for restraint selection. Using the α-helical domain of T4 lysozyme (T4L) as a model system , we demonstrate that restraints selected to simultaneously optimize sequence separation and pairwise connectivity of secondary structures led to high quality models. To test the robustness of the algorithm, distances were experimentally measured between pairs of spin labels at residue positions selected by the algorithm. Rosetta folding using these distances yielded high quality models as predicted.
Input parameters of secondary structure and solvent exposure predictions of the C-terminal 107 amino acids of T4L were obtained using psipred  and NetSurfP  analyses, respectively. The ideal secondary structure definitions were obtained directly from the crystal structure of T4L (PDB ID: 2LZM). The ideal solvent exposure definitions were generated from the T4L crystal structure (2LZM) using a Rosetta neighbor count protocol. A neighbor count threshold of smaller than or equal to 9 defines solvent exposed residues .
The Monte Carlo protocol is initiated with a random distribution of spin label pairs that yield a total score for the distribution terms being tested. Each iteration of the Monte Carlo optimization involves random reassignment of label positions for a single pair. New label positions that improve or equal the best previous score are accepted. A typical optimization included 10,000 iteration steps and 10 optimization trajectories after which scores converged. Restraint patterns were generated on local clusters using a perl script.
The Sequence Separation score (SSS) is calculated by taking the natural log of the number of amino acids separating the two spin labels in each restraint pair (di), averaging over all restraint pairs (r), and normalizing to the natural log of the sequence length (g) to yield a value between 0 and 1 (Eq. 1).
Thus, the sequence separation term effectively applies a penalty function for pairs separated by a small number of amino acids. This penalty logarithmically decreases with increased label separation. The logarithmic scaling is a modification of the original information content measure . We found that the improvement in model quality measures becomes less dependent on sequence separation as di increases (data not shown).
The Secondary Structure term distributes the spin labels evenly among the secondary structural elements (SSE). First, an ideal number of spin labels per SSE (Q) is calculated by dividing the number of spin labels (l) (twice the number of restraints) by the number of SSEs (s). We define Q′ = div(l, s) and Q″ = Q′ + 1. Note that the floor Q′ and ceiling Q″ are acceptable integer values for Q. Further, we define remainder of l/s as R = mod(l, s). An optimal spin label distribution will have Q″ labels in R SSEs, and Q′ labels in all the others.
The Secondary Structure score (SSSE) has two equally weighted components, SSSE(L) and SSSE(S). The first component, SSSE(L), is the average percentage of labels positioned in each SSE up to the ideal value, Q″. Thus,
where li = number of labels in the ith SSE. As defined, this component favors placement of labels into SSE during the optimization trajectory.
The second component of the score, SSSE(S), is derived from the fraction of SSEs that contain exactly the ideal number of spin labels:
where E′ is the number of SSEs with Q′ labels and E″ is the number of SSEs with Q″ labels. While SSSE(L) determines progress in achieving an optimal spin label placement during the Monte Carlo optimization, SSSE(S) is needed to arrive at precisely the correct number of spin labels for every SSE (data not shown). The two scores (Eq. 2 and Eq. 3) are averaged to yield the total SSSE term with values between 0 and 1.
Element Connection (SEC) favors patterns that connect each pair of SSEs with restraints. The ideal number of connections for each SSE pair (C) is defined by the ratio between the number of restraints (r) and the number of SSE pairs (p), p = (s(s − 1))/2, where s = number of SSEs. We define C′ = div(r, p) and C″ = C′ + 1. In this term, floor C′ and ceiling C″ are acceptable integer values for C. In addition, we define remainder of r/p as M = mod(r, p). An optimal restraint distribution will have C″ restraints in M SSE pairs, and C′ restraints in all the others.
Like the Secondary Structure term, SEC is a composite of two equally weighted component scores, SEC(R) and SEC(C). SEC(R) is the average percentage of restraints in each SSE pair up to the ideal value, C″. Thus,
where ri = number of labels in the ith SSE pair. This component favors placement of restraints into SSE pairs during the optimization trajectory.
The second component of this term, SEC(C), is derived from the fraction of SSE pairs that contain exactly the ideal number of restraints:
where F′ is the number of SSE pairs with C′ restraints and F″ is the number of SSEs with C″ restraints. As in the Secondary Structure term, the composite scores of this term are complementary with SEC(R) measuring progress toward the optimal restraint placement and SEC(C) determining the correct number of restraints for every SSE pair. The two scores (Eq. 4 and Eq. 5) are averaged to yield the total SEC term with values between 0 and 1.
The Label Density score, SLD, imposes equal distribution of spin labels along the sequence. For this purpose, spin label positions are treated as a vector (a0, a1, …, al, al+1), where a0 is the N-terminus and al+1 is the C-terminus and a1, …, al are the positions of the spin labels and l = number of spin labels. An optimal interval between spin labels (I) is the divisor of the ratio of the sequence length (g) to the number of intervals (n), where n = l + 1: I = div(g, n). The score utilizes a harmonic penalty function. A normalization function, f(x) = (x + 1)−1, is applied to rescale values between 0 and 1. Thus the term is defined as:
Rosetta simulations were performed in Rosetta++ [30–33]. Specific standard Rosetta procedures were used that are described in details elsewhere . In these course-grained simulations, residues side chains are regarded as centroid superatoms . All T4L homologs were excluded from the fragment database prior to modeling in order to simulate structure determination of a novel protein fold as closely as possible. Models were obtained in independent simulations on a cluster in Vanderbilt University’s
Advanced Computing Center for Research & Education (ACCRE). For each simulation, 1,000 models were created using the restraints selected by the algorithm for the α-helical subdomain of T4L (residues 58–164). In the algorithm optimization phase, Cα root mean squared deviation (RMSD) distributions and model quality measures for residues 70–164 were reported for all 1,000 models resulting from Rosetta folding. Residues 58–69 were excluded from RMSD analysis as these residues link the α-helical subdomain to the excluded β-strand subdomain and tended to vary in our models due to the absence of the β-strand domain. Cα-RMSDs were used due to the course-grained nature of the modeling. In the experimental implementation phase, models were additionally filtered by lowest energy and restraint violation scores.
EPR distance restraints were implemented in Rosetta in a RosettaNMR [43–44] protocol as described previously . Briefly, distance restraints are used as an additional penalty in the Rosetta energy function. This penalty is zero if the Cβ-Cβ distance (dCB) of the restraint residues fall within the range specified. If this distance falls outside this range, a quadratic penalty function is applied. The boundary range used was based on the motion-on-a-cone model developed by Alexander et al. . This model yielded a function describing the relationship between the experimentally measured spin label distance (dSL) and the dCβ. The dSL defines the range allowed for dCβ (dSL−12.5 Å to dSL+2.5 Å) which corresponds to the most probable relative spin label orientations. For simulated restraints, the crystallographic dCβ is used as the experimental distance (i.e. dSL − dCβ = 0 or a parallel spin label orientation).
Cysteine residues were systematically introduced into a cysless T4L construct through double point mutations at restraint positions identified by the algorithm using QuikChange™ Site-Directed Mutagenesis Kit (Stratagene) as previously described . Sample preparation has been described elsewhere [39,45]. Briefly, T4L mutants were sequenced, transformed into K38 cells, and expressed in Luria Broth (LB). All mutants were purified using cation exchange chromatography, labeled with a 5 fold excess of MTSSL (S-(2,2,5,5-tetramethyl-2,5-dihydro-1H-pyrrol-3-yl)methyl Methanesulfonothioate spin label, Toronto Research Chemicals) at room temperature for 2 hours, desalted and concentrated. A total of 21 double mutants (Table 1) resulted in the restraints used for the current analysis.
Of the 21 restraints (Table 1), 19 distances were found to be within the distance range appropriate for double electron-electron resonance (DEER) distance measurement [46–48]. DEER measurements were performed on a Bruker 580 pulsed EPR spectrometer operating at X-band (10 GHz) using a standard four-pulse protocol . Experiments were performed at 83 K. Sample concentrations were 150 μM in a MOPS/Tris buffer (9 mM MOPS, 6 mM Tris, 50 mM NaCl, 0.02%(w/v) Sodium Azide, 0.1 mM EDTA) with 20%(v/v) glycerol as a cryoprotectant and a sample volume of 50 μl. Spin echo decays were baseline-corrected and analyzed by Tikhonov regularization [49–50] to determine average distances and distributions in distance (Appendix A). For all data, the selected α parameter corresponds to the elbow of the L-curve .
For the 2 pairs with distances too short for DEER analysis, distance distributions were determined from the continuous wave (CW) EPR spectra using the CWdipfit program developed by Peter Fajer and colleagues (http://www.sb.fsu.edu/~fajer/Programs/CWdipFit/cwdipfit) . For each pair, fully labeled and underlabeled samples were prepared. Fully labeled samples were prepared as described above. Preparation of the underlabeled samples included incubation with 0.5× MTSSL for 1 hour at room temperature followed by addition of 20-fold excess of a diamagnetic MTSSL analog, (1-Acetyl-2,2,5,5,-tetramethyl-Δ3-pyrroline-3-methyl) Methanethiosulfonate (Toronto Research Chemicals). The fully labeled samples display distance-dependant dipolar coupling, while the underlabeled samples represent the EPR spectrum in its absence. CWdipFit assumes Gaussian-shaped distance distributions between spin labels and utilizes Monte Carlo/SIMPLEX algorithm to fit dipolar coupled spectra using the underlabeled spectra as a proxy for the sum of singles [51–52]. The dipolar coupled spectra and fits are shown in Appendix A.
The overall strategy, illustrated by the flowchart in Figure 1, uses the primary sequence, secondary structure, and solvent exposure definitions as input parameters. For secondary structure and solvent exposure, predicted and ideal (defined by the crystal structure) definitions were compared to assess the impact on model quality. The algorithm relies on a Monte Carlo search to optimize the restraint distribution terms that place pairs of spin labels along the sequence (Supplementary Fig. 1). Briefly, a Sequence separation term, defined as the number of intervening amino acids between two spin labels in a pair, was included as an approximation for information content. To balance its tendency to cluster spin labels at the N- and C-termini, three terms favoring uniform sequence coverage were investigated. A secondary structure element (SSE) connection term (Element Connection) evenly connects all pairs of secondary structures, in this case 7 α-helices, with restraints effectively introducing a triangulation strategy. Alternatively, a Label Density term which distributes spin labels along the sequence at equal and regular intervals was included. Finally, we tested the efficacy of a Secondary Structure term that confines spin labels to segments of secondary structures avoiding loops and termini. Term combinations and weight ratios were evaluated for their effectiveness in selecting informative restraints for Rosetta folding (Supplementary Fig. 2 and 3). The combination of Sequence Separation and Element Connection terms at a 1:1 weight ratio consistently yielded restraint patterns that resulted in the highest quality models by Rosetta folding. Figure 2 illustrates how an initial random distribution of labels is shuffled to maximize the Sequence Separation and Element Connections scores
In the algorithm development phase described above, the term combination and relative weight were determined using simulated EPR distances. For this purpose, the distance between the β-carbon of each pair of residues, dCβ, was obtained from the crystal structure (2LZM) and used as an experimental restraint. To simulate the uncertainty associated with interpretation of distances between spin labels, the corresponding restraint was allowed a range of dCβ −12.5 Å to dCβ +2.5 Å based on the motion on a cone model described previously  and in the Methods. Models with dCβ distances outside this range are penalized in the Rosetta Energy score.
The output of the restraint-assisted Rosetta folding consisted of 1,000 models. Quality measures defined by the models Cα-RMSD to the crystal structure were used as indicators of improvement in the Rosetta sampling of conformational space. To avoid perturbation due to spin label incorporation, the algorithm excluded residues predicted to be buried. This did not affect the quality of models generated by Rosetta (Supplementary Fig. 4). In contrast, the use of predicted secondary structure resulted in a significant decrease in model quality (Supplementary. Fig. 4). Therefore, for the purposes of evaluating the effectiveness of the algorithm, secondary structure definitions were based on the crystal structure.
The α-helical subdomain of T4L (residues 58–164) was selected as a model system for this analysis. T4L has been extensively investigated by spin labeling [53–54] and was the target of a previous study to assess the potential of EPR restraints to increase the efficiency of conformational space sampling by Rosetta . The 107 amino acid target region is well within the size limit for efficient structure prediction by Rosetta de novo folding . For the analysis presented here, we excluded structures homologous to T4L from the fragment library to mimic protein structure prediction of novel protein folds. Under these conditions, Rosetta folding in the absence of restraints yields consistently about 8% correctly folded models leaving sufficient dynamic range to evaluate the impact of EPR restraints.
Following selection of the terms and their relative weights described above, we assessed the degree to which optimized restraint patterns improve the quality of T4L models predicted by Rosetta. For this purpose, 10 sets of 21 restraints were used in conjunction with Rosetta to generate 1,000 T4L models. An equivalent number of models was generated by folding without restraints as well as in the presence of 21 randomly selected restraints. Consistently, models obtained using optimized restraints had vastly better quality measures (Fig. 3). A left shift in the RMSD distribution reflects the presence of a major population of models with RMSD below 7.5 Å (Fig. 3A). It is generally accepted that 7.5 Å is the RMSD at which models have the correct overall fold as the native structure . Thus using optimized restraints, 54.4% of Rosetta models achieve the general fold compared to 21.0% and 8.0% of models if randomized or no restraints are used, respectively (Fig. 3B).
Optimized restraints also lead to a significant increase in the percentage of models with Cα-RMSDs below 3.5 Å reflecting more effective sampling of conformational space by Rosetta (Fig. 3C). These models being closest to the native structure are ideal candidates for subsequent high resolution refinement . Using an RMSD cutoff of 3.5 Å as a criterion, 1.7% of models generated by incorporation of optimized restraints are considered high quality. To achieve 1Å resolution, a starting set of at least 2,000 such models are needed , which is within a computational reasonable time frame. In contrast, only 0.2% of models generated using randomized restraints fulfilled the 3.5 Å RMSD criterion. Thus, to achieve high resolution, one million models are needed which requires substantially more computational resources. If no restraints are used, the computational cost becomes prohibitive, as only 0.04% percent of models have less than 3.5 Å RMSD, therefore requiring tens of millions of models. Furthermore, EPR restraints allow selection of correct topology models for refinement .
The choice of 21 restraints above was dictated by detailed analysis of the dependence of model quality on restraint number. For this purpose, the Rosetta folding protocol of Fig. 1 was applied successively increasing the number of restraints followed by assessment of model quality. Note that 21 restraints are required to fulfill all pairwise connections between the 7 helices of T4L C-terminal domain. Therefore for restraint numbers larger than 21, the algorithm was modified to ensure that the additional restraints duplicating existing secondary structure connectivities are evenly distributed.
Fig. 4A demonstrates that increasing the number of restraints leads to a rapid increase in the percentage of models having the correct fold (Cα-RMSD below 7.5 Å). This effect is pronounced with as few as 5–10 restraints. The trend levels off in the region of 20–22 restraints suggesting that redundant connections between secondary structures add little information (Fig. 4A). In contrast, a more stringent quality measure, the percentage of models with RMSD below 3.5 Å, hardly improves until the number of restraints is well above 10 (Fig. 4B). This lag reflects the significantly lower probability that these models are sampled in the absence of restriction on the search space. Indeed, this number remains rather unaffected by the introduction of additional random restraints.
The percentage of models with the correct fold plateaus at approximately 60 percent. The rest typically fulfill the restraints but have incorrect folds. This is not surprising given the soft interpretation of the restraints within a wide error margin (15 Å) by the cone model. It is likely that this limitation also accounts for the relatively limited percentage of high quality models, i.e. with RMSD below 3.5 Å. Interestingly, the incorrectly folded models score worse in Rosetta’s knowledge-based potential (see below) allowing for selection of correctly folded models by energy score. In addition, the improved overall quality with few restraints (Fig. 4) provides a plausible explanation for the surprisingly good performance of random restraints in Fig. 3.
The optimization of the algorithm used simulated distances between residue pairs. As described above, this approximation centers the distribution at the dCβ while the experimental distribution is centered on the distance separating the two spin labels. The offset between these two values is determined by the relative orientation of the labels and represents the major source of uncertainty in interpretation of EPR distance restraints. To assess the consequences of this approximation and validate the optimization strategy, we carried out Rosetta folding of T4L using experimentally determined distances for a set of spin label pairs selected by the algorithm described above. Double cysteine mutants were constructed and the corresponding proteins purified and spin labeled as described in the Methods. Most pairs, except two, were in the distance range suitable for DEER analysis . Spin echo decays were baseline-corrected and analyzed by Tikhonov regularization  to yield distance distributions as described in the Methods and illustrated in Appendix A. For the short range pairs (86C/112C and 127C/155C), spectral simulation was used to extract a Gaussian distribution of distance from the CW-EPR spectra (Appendix A).
The position of the pairs is mapped onto the T4L crystal structure in Fig. 5. Table 1 reports the average distance between the spin label pairs as well as the width of the distance distribution. Compared to the dCβ, the deviations show the expected pattern of larger spin label distances. The distributions are predominantly narrow despite the surface exposed location of the spin labeled sites. Thus, even though most spin labels are mobile as evidenced by the EPR lineshapes (data not shown), it appears that the sampled rotameric states are restricted.
Fig. 6 demonstrates that Rosetta folding of 10,000 models using the experimental distances leads to improvement in model quality measures that follow the same trends of Fig. 3. These include a left shift in the Cα-RMSD distribution, an increase in the fraction of models with the correct folding topology (RMSD < 7.5 Å), and more importantly of the percentage of high quality models (RMSD < 3.5 Å). However, these improvements underperform those expected from simulated distances. The origin of this underperformance can be rationalized by comparing the upper bound of the simulated and experimental restraints. Experimentally determined distances tend to be larger than the dCβ thereby increasing the upper bound. Thus, conformational space is less restrained leading to a reduced model quality.
The models generated by incorporation of experimental restraints into Rosetta folding were sorted based on their Rosetta energy and restraint violation scores. While models of vastly different RMSDs have similar Rosetta energy or restraint violation scores, only models with low RMSDs have low scores in both criteria. Fig. 7 demonstrates the improvement in model quality when a Rosetta energy score threshold of below 30 and a cumulative restraint violation score threshold of less than 2.5 Å were applied. This resulted in an enrichment factor of 7.2 for models with RMSDs below 3.5 Å, retaining 44 of the 61 original models. Thus the combination of these two scores can identify the subset of models with topologies closest to the native structure.
The requirement for incorporation of spin labels into protein sequences shapes the methodology of spin labeling in two fundamental ways. First, the experimental throughput is limited leading to sparse restraints. Second, the arm linking the spin label to the protein backbone introduces an uncertainty in the interpretation of these restraints. The algorithm presented in this paper advances the methodological blueprint of spin labeling and EPR spectroscopy by optimizing the information content of EPR distance restraints and consequently alleviating the experimental bottleneck.
The experimental implementation of this strategy presented here charts a roadmap for future improvements. As expected, using the cone model of Alexander et al.  for interpretation of the EPR distances significantly compromises the quality of the experimental data. Narrow distance distributions at a number of sites imply a tighter limit on the distance range than the 15 Å assumed in the cone model. Furthermore, the shape of the distribution (Appendix A) is in stark contrast to the flat scoring potential implemented in the Rosetta protocol. The consequence of these approximations is that topologies with Cα-RMSDs as large as 12 Å fulfill the EPR restraints. We are developing a probability function to describe the offset distribution between the dSL and dCβ (Hirst et al., this issue). Furthermore, explicit modeling of the spin label should limit the uncertainty associated with unknown spin label orientation to the backbone. It has been demonstrated that molecular dynamics simulations can reproduce average distances between spin labels [27–28]. Though more computationally intensive, these approaches will enhance Rosetta models quality specifically increasing the fraction of those below 3.5 Å RMSD.
The performance of the algorithm is also degraded when prediction rather than actual secondary structures are used. The origin of this effect is the inaccurate prediction of the number of secondary structures which for a fixed number of restraints alter the required pairwise connectivities. In the context of the application of this approach to a protein of unknown structure, the location and length of secondary structure can be experimentally determined and/or verified through nitroxide scanning experiments [56–58].
That the many approximations did not hinder the identification of the correct fold by Rosetta reflects the robustness of its energy function. Similarly, a few EPR restraints lead to a measurable improvement in the quality of the folds highlighting the critical role of these restraints in reducing Rosetta’s conformational search space. These findings reinforce the rationale of using de novo folding to balance the sparseness of the EPR restraints and their intrinsically lower quality.
Although the algorithm developed in this paper is general, our ultimate goal is to develop a suite of tools to determine structure of membrane proteins. While Rosetta has been successfully used to generate constrained models of membrane proteins [59–61], it is likely to be less robust given the limited number of folds and topologies in the protein data bank. Though this may be partially mitigated by the restricted diversity of membrane protein fold imposed by the membrane environment, the number of EPR restraints needed to obtain high quality models is likely to be larger. Furthermore, the rule of one restraint per pair of secondary structures may have to be modified for the longer helices found in these proteins. In this context, redundant restraints may prove important for longer helices common in transmembrane proteins. We expect that additional algorithm terms to optimize the distribution of redundant restraints will be developed. Nevertheless, this algorithm represents a first step in this direction.
In our simplified 4 helix, 6-restraint system, each arrow represents a spin label and arrows of the same color correspond to the same spin label pair. Sequence Separation, which maximizes the number of amino acids between spin labels in a pair, is a proxy for information content. Three sequence coverage terms were tested: Element Connection, Secondary Structure, and Label Density as defined in the methods section.
Sequence coverage terms were combined with Sequence Separation to generate simulated restraint patterns. These were then incorporated into Rosetta to fold T4L C-terminal domain. The outcomes were evaluated by model quality measures, Cα-RMSD <7.5 Å (purple) and Cα-RMSD <3.5 Å (green). The best combination was found to be Element Connection and Sequence Separation (black dotted box).
Effect of varying Sequence Separation weight relative to Element Connection on model quality measures, Cα-RMSD <7.5 Å (purple) and Cα-RMSD <3.5 Å (green). The best weighting ratio was 1:1, denoted with an asterisk.
(A) Comparison of predictions to crystal structure definitions for secondary structure and solvent exposure. (B) Effects of excluding buried residues, predicted and ideal. (C) Comparison of predicted (Pred) and Ideal secondary structures for 21 restraints. Predicted secondary structure yielded 8 helices rather than the 7 found in the crystal structure. Therefore to account for the increase in connectivities for 8 helices, we included the effects of using 28 restraints.
The authors would like to acknowledge Derek Claxton for assistance with EPR data collection and EPR distance analysis, Ping Zou for assistance with EPR data collection, and Richard Stein for assistance with EPR distance analysis. This work was conducted in part by using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville, TN. K. K. and H. S. M. were supported by NIH grant R01-GM77659. K. K. and J. M. were supported by NIH grant R01-GM080403. N. S. A. was supported by NIH grant F31MH086222.
Baseline-corrected spin echoes or CW spectra (86/112 and 127/155) along with corresponding distance distributions. The experimental data is shown in black, the fits are shown in red.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.