Previous successes in high-resolution de novo
structure prediction using Rosetta have relied on generating low-resolution models on a large number of sequence homologs along with the target sequence in order to successfully sample the near-native region of the energy landscape. With the large amount of CPU time available through Rosetta@home, which consists of over 150,000 computers roughly half of which are available for use at any given time, large-scale sampling runs without using information from sequence homologs can be successful. To determine how well structures can be predicted using single sequence information and to set a baseline for subsequent experiments, we investigated the performance of Rosetta on a test set of 32 small alpha-helical and alpha-beta protein domains by generating many independent models using different random number seeds starting from an extended chain. Models were generated using the standard Rosetta Monte Carlo and gradient-based energy-minimization strategy, which consists of a low-resolution conformational search followed by full-atom refinement. Trajectories were also started from the native structure using the full-atom refinement protocol to obtain an estimate of the energy in the native region4
The results of these calculations are grouped into three different categories in . The first category includes proteins for which Rosetta, with this level of conformational sampling, is successful in producing an accurate (root mean square deviation (RMSD) of α-carbons to the native structure less than 2 Å) lowest-energy model. 9 of the 32 proteins in our test set were in this category (1aiu, 1b72, 1di2, 1dtj, 1elw, 1pgx, 1r69, 256b, and 2reb). The second category includes proteins for which the native state was not sampled but there was an energy gap between the models and refined native structures. The majority of proteins are in this category (1acf, 1a19, 1a68, 1bm8, 1bq9, 1cc8, 1ctf, 1dcj, 1iib, 1mky, 1n0u, 1opd, 1tif, 1ubi, 2chf, 2ci2 and 4ubp). A number of proteins within this category had lowest-energy models that were less than 3.5 Å RMSD from the native and two had core RMSDs of less than 2.0 Å (1bq9, 1.85 Å, and 1dcj, 1.15 Å, ). The last category includes proteins that had incorrect lowest-energy models that were lower in energy than the refined natives (1a32, 1hz6, 1ig5, 1tig, 1scj and 5cro). These proteins likely reflect inaccuracies in the current Rosetta energy function, the influence of crystal packing interactions on the experimentally determined structure, or missing interactions with a bound ligand. Energy versus RMSD plots of one example from each category are shown in .
Structure prediction performance and sampling
Successes and failures in large scale de novo protein structure prediction calculations using Rosetta@home
To quantify the dependence of structure prediction accuracy on the amount of sampling, we extracted random subsets of models from the control runs, and determined the average RMSD of the lowest-energy models as a function of subset size. The results for category 1 and 2 proteins are shown in , respectively. The majority of proteins in the category 1 group required a sample size of less than 100,000 for an average RMSD of the lowest-energy models of less than 2 Å. Two of these proteins, 1elw and 2reb, required a sample size of less than 100, and on the other extreme, 1b72 and 1di2 required a sample size of over 1 million. Among the category 2 group, the average RMSDs of 9 proteins generally decreased with increased sampling ().
Dependence of structure prediction accuracy on the number of independent trajectories
Sampling in feature space
The category 2 proteins for which sampling is insufficient even with Rosetta@home are the major focus of this paper. How much sampling is required to successfully solve the structures of these proteins computationally? There is an obvious limit to how well this question can be answered by continued brute-force sampling, and instead we have developed an alternative approach. As shown in , a protein conformation can be described in part by a string of features. The features used in this study include backbone torsion bins with five possible values for each residue position, ABEGO5
, representing different regions in ramachandran space (and cis-peptide torsion angles (O)), secondary structure with three possible values, strand (E), helix (H) and loop (L), and beta-contacts represented by the pair of contacting residues, the orientation of the pairing (parallel or antiparallel), and the pleating.
For this feature space representation to be useful, there must be a way to invert the projection from a three-dimensional structure onto a feature string, i.e. to go from the feature string back to a three-dimensional structure. In particular, the native feature string must provide sufficient information to determine the native three-dimensional structure. Rosetta structure prediction calculations in which torsion angles (or the other features) are constrained to lie within the discrete feature bins in principle can provide such a mapping from feature strings back to three-dimensional structures. To investigate the extent to which the native feature string encodes the native three-dimensional structure in this sense, we carried out a second set of structure generation calculations in which the native torsion bin and secondary structure features for every position were enforced. The results of this test are listed in (columns 2 and 5). For 29 of the 32 proteins, the lowest energy models had RMSDs less than 3 Å; in 22 of these the RMSD was less than 2Å. Thus, once the overall region of conformational space is indicated by the native torsion bin and secondary structure feature strings, Rosetta conformational search can generally identify the native minimum. As shown in , the low energy models superimpose well with the native structure.
Rosetta generates accurate high-resolution structures when native feature strings are constrained
Since the native feature string provides sufficient information to sample the native three-dimensional structure using Rosetta, we can now formulate the probability of sampling the native state in structure prediction calculations as the product of the probability of sampling the native feature string in unconstrained trajectories with the probability of sampling the native state given the native feature string, i.e. P(native state) = P(native feature string) * P(native state | native feature string). The second (conditional probability) term can be estimated by determining the frequency of sampling the native state in calculations such as those in in which the native feature string is enforced. An example of such a constrained sampling curve is shown in (dashed line).
The above formulation allows us to recast the sampling problem in feature space: the amount of sampling required to solve a given structure can be related to the frequency of sampling the native feature string. For almost all proteins, the native values of most features are sampled frequently in Rosetta trajectories. For example, unbiased trajectories for 1di2 frequently sample the native torsion bin features for all but one nonterminal residue (). This is not surprising as one of the central ideas in the Rosetta approach is to model as accurately as possible the distribution of local conformations populated by each segment of the protein chain. However, for some proteins, particularly those that require large amounts of sampling to locate the native state, we find one to three features whose native values are almost never sampled in unconstrained runs (such as the torsion bin for residue 23 in ) yet are almost always present in low-energy near-native structures. A previously described example of such a feature is a rarely sampled kinked helix present in the native structure (residue F65 in 1pva) and enriched in low-energy low-RMSD models produced by Rosetta6
Native torsion bin features are frequently sampled with Rosetta
As described in the Materials and Methods, we systematically searched for rarely sampled features which when constrained yielded a much greater frequency of near-native structures. Such “linchpin” features were identified for 16 proteins in our test set and are listed in . Columns 5 and 6 compare the frequencies of these features in the overall sample population to the frequencies in low-energy low-RMSD models from the control runs. The frequencies in the overall control run population (column 5) ranged from 0.41 to 0.00046, while in contrast the frequencies in low-energy low-RMSD models (column 6) were greater than 0.80 for most of the proteins.
For each protein with identified linchpin features, the probability of generating a low-energy low-RMSD structure was estimated using P(native structure) = P(native linchpin features) * P(native structure | native linchpin features); P(native linchpin features) is the frequency of sampling the native values of the linchpin features in the unconstrained population and P(native structure | native linchpin features) is the frequency of sampling the native state when the linchpin features are constrained at their native values. The validity of this approximation can be assessed directly for the proteins for which the native structure was sampled in unbiased runs (category 1), since P(native structure) can be measured directly (it is simply the frequency of low-energy low-RMSD structures in the unbiased runs; ). These directly observed values are compared to the values obtained using the above approximation in (column 10 = column 5 * column 8). For the majority of proteins, the predicted value was reasonably consistent with the observed value--the observed probabilities of 8 out of 11 proteins were within a factor of 3 of the modeled values (as described in the Materials and Methods, discrepancies exist since P(native structure|feature) is somewhat underestimated in constrained runs due to artifacts introduced by the constraint).
As illustrated in , the increases in the frequency of sampling the native state when constraining a small number of linchpin features to their native values are often quite dramatic. For some cases, such as ribosomal protein l7/l12 (1ctf) and translation initiation factor IF3 (1tif), rare but critical native beta-contact, torsion, and secondary structure features were never simultaneously sampled in the control runs. The undetectably low frequency of sampling critical combinations of rare features was estimated in such cases by determining the frequency of one of the features and multiplying by the frequency of the other features in calculations where the first feature was constrained: P(feature 1 and feature 2) = P(feature 1) * P(feature 2 | feature 1). For 1ctf and 1tif the predicted frequencies of sampling the native linchpin features simultaneously were 0.0000036 and 0.0000080, respectively—in these cases the origin of the sampling bottleneck is clearly evident at the feature space level. For the two proteins, enforcing the linchpin features reduced the amount of sampling predicted to be necessary for a successful structure prediction by nearly 280,000 and 130,000 fold.
The P(native structure) estimates range widely for the different proteins in . At one extreme is recA (2reb), with an observed value of 0.000181, which corresponds to sampling the native state approximately 1 out of every 6000 unbiased runs. At the other extreme, for 1ctf and 1tif, the frequency of sampling the native state in unbiased runs is estimated to be 1.70×10−9 and 1.84×10−8, respectively. The corresponding estimates of the number of trajectories necessary for successful structure prediction for these two proteins are 588 million and 54 million, respectively. Clearly this result cannot be determined by brute force computation! The estimated P(native structure) provides an answer to the question of how much computer power is required to solve the structure of a given protein by de novo structure prediction, and in a sense provides a part of the solution to the prediction problem.
Structural Context of Linchpin Features
The estimated probabilities for successful predictions highlight the limitations of brute-force sampling using Rosetta—the amount of computing required may be considerably out of range of current capabilities. On the other hand, since linchpin features represent rare structural features that are important for sampling near-native conformations, understanding their structural context may help in the development of improved sampling methods. The structural context of the identified linchpin features for each protein is shown in (linchpin features are labeled with bold italicized text). Three general characteristics stand out and are described in detail below.
Structural context of linchpin features
Functional regions ()
Many linchpin features were found within or close to functional regions of the protein (highlighted in red in ). The proteins with linchpin features located within regions important for function include the DNA-binding domain of Mbp1 (1bm8), Rubredoxin (1bq9), the Nova-2 K-homology RNA-binding domain (1dtj), and CheY (2chf). In 1bm8, the positive phi torsion feature of N41 is located within the turn of a helix-turn-helix DNA binding motif7
(). N41 is involved in capping the first helix of the motif and makes the hydrophobic interaction between residues A40 and F42 possible. A beta-contact feature is also located at the termini of a conserved segment that contains the motif. In 1bq9, a torsion and beta-contact feature involve two of four cysteines that make up its metal-binding site (). The torsion feature of E43 in 1dtj is located within a variable loop that, along with a highly conserved GxxG loop motif, is part of a vise-like binding site for RNA8
(). The cis-peptide torsion feature of K108 in 2chf is located in a highly conserved position involved in the active site (). In chymotrypsin inhibitor 2 (2ci2; ), YhhP (1dcj; ), and IIBcellobiose (1iib; ), functional residues are located in positions between paired parallel strands associated with short-range beta-contact linchpin features. A highly exposed loop that contains the reactive site9
in 2ci2 lies between the paired strands of the beta-contact feature, Pair_28_46, and the R46 side-chain of this feature interacts with the reactive site and may be functionally important (). In addition to the beta-contact feature in 1dcj, the positive phi torsion feature of G8 is also located near E13, which lies within a strictly conserved CPxP helix capping motif and is critical for function10
(). In 1iib, the paired strands of the beta-contact feature, Pair_53_75, bring together three strictly conserved residues suggested to be part of the active site, Q57, P56, and Y8211
(). Functional residues were also located close to linchpin features in ubiquitin (1ubi; ) and translation initiation factor IF3 (1tif; ).
Regions that form late in folding ()
For proteins whose folding pathway and transition state have been experimentally characterized and whose transition state is structurally polarized with regions that are significantly formed and disrupted, linchpin features were found in regions of the protein that form late in folding: i.e. regions with low Φ-values. In protein G (1pgx), the lowest Φ-values are in the first beta turn and the loop connecting the helix with the second hairpin; these two regions are in contact in the three dimensional structure (). The linchpin features G8 and D39 lie in the first beta turn and the connecting loop, respectively. The torsion angles of D39 position the adjacent residue V38 which packs on G8; taken together these data suggest the formation of the first beta hairpin and packing on the connecting loop occur late in protein G folding and are bottlenecks in Rosetta simulations. In ubiquitin (1ubi), the positions with the lowest Φ-values12
() are located in the C-terminal region of the protein, which includes the linchpin residues R54 (Tor_54_B) and L67 (Pair_5_67). Both R54 and L67 are involved in important tertiary interactions; the backbone CO of R54 caps the N-terminal end of the major helix through non-local backbone hydrogen bonds, and the side-chain of L67 is part of the protein core. Positions with the lowest Φ-values in barstar13
() flank two linchpin residues, Q55, which has a positive phi torsion feature and is involved in capping the third helix, and V4 from the beta-contact feature, Pair_4_52. The backbone conformation of Q55 is stabilized by a perpendicular aromatic interaction between residues W53 and F56, and backbone hydrogen bonds that are involved in adjacent secondary structure elements (). The tight transition between the strand and helix along with the positive phi torsion feature suggest that this part of the backbone may be under conformational strain. In CheY (2chf) the linchpin residue K108 is in the C-terminal portion of the protein that was found experimentally to form late in folding and where mobility may be important for function ()14; 15
Both in Rosetta trajectories and in reality, the regions described above likely do not form early in folding because they are energetically unfavorable in the absence of stabilizing non-local interactions. In Rosetta these interactions may not form later because they require a concerted “clicking in” which stochastic Monte Carlo moves may not hit upon; in this case the interactions would be missing in the final structures and would be detected as linchpin features by our analysis.
Irregular strands ()
Linchpin features were identified within or adjacent to irregular strands in dsRNA binding domain (1di2), ribosomal protein l7/l12 (1ctf), translation initiation factor IF3 (1tif), and protein G (1pgx). In 1di2, there is a single beta-bulge in an otherwise regular beta-sheet and the torsion feature at the beta-bulge position was identified as a linchpin feature (). For 1ctf, a beta-contact feature (Pair_6_40) in combination with a torsion angle and secondary structure feature at K43 was identified as a linchpin feature. The three-stranded beta-sheet in 1ctf is highly irregular, particularly in the strands involved in Pair_6_40. Beta bulges exist in the first and second strands at K7 and L42, respectively, and hydrogen bonding is disrupted between the strands at the position following K43 (). For 1tif, a beta-contact feature (Pair_5_44) along with a torsion and secondary structure feature at G25 were identified as a linchpin feature (). A beta-bulge exists at the position adjacent to G25, and the corresponding edge strand contains a prominent bend. Pair_5_44 consists of a beta pairing between a very short edge strand that consists of only two backbone hydrogen bonds and an irregular strand that contains a beta-bulge. The edge strand in 1pgx has an irregular left-handed twist and convex curve near the torsion feature, Tor_8_B (). A common characteristic of the linchpin features described above is that the irregular local structures they lie in are primarily edge strands. Irregular edge strands are commonly found in proteins to avoid edge-to-edge aggregation16