|Home | About | Journals | Submit | Contact Us | Français|
A complete understanding of how proteins fold, i.e. self-assemble to their biologically relevant “native state,” remains an unattained goal1. Computer simulation, validated by experiment, is a natural means to elucidate this. There is over a million-fold range in folding rates, suggesting a possible diversity in mechanisms between slow and fast folding proteins2. Very fast (microsecond timescale) folding proteins3,4 appear to fold via a large number of heterogeneous, parallel paths5–7, potentially key for folding on such fast timescales. Does the folding of much slower proteins change this picture?
To date, the slowest-folding proteins folded ab initio by all-atom molecular dynamics simulations with fidelity to experimental kinetics have had folding times in the range of nanoseconds to microseconds. These include the designed mini-protein Trp-cage (~4.1 μs)8, the villin headpiece domain (~10 μs)9, a fast-folding variant of villin (<1 μs)7, and Fip35 WW domain (~13 μs)10. In this communication, we report simulations of several folding trajectories, each from fully unfolded states, of the 39-residue protein NTL9(1–39), which experimentally has a folding time of ~1.5 milliseconds11.
Trajectories were simulated via the Folding@Home distributed computing platform12 at 300K, 330K, 370K and 450K from native, extended, and random-coil configurations using an accelerated version of GROMACS written for GPU processors13, for an aggregate time of 1.52 ms. GPUs play a key role here, allowing for dramatically longer trajectories than previously possible. The AMBER ff96 forcefield14 with the GBSA solvation model15 was used, a combination previously shown to give good results folding Fip35 WW domain10, and shown to exhibit a good balance of native-like secondary structure for a set of small helical and beta sheet peptides studied by replica exchange17.
We find that the native state (taken from the N-terminal domain of the crystal structure of ribosomal protein L918) is stable in this forcefield at 300K, exhibiting decreasing stability with increasing temperature (Figure 1a). RMSD-Cα distributions after 10 μs show well-defined native and collapsed unfolded basins near 3Å and 5Å, respectively. Of the ~3000 trajectories started from unfolded (extended and coil) states at 370K (Figure 1b), two reach an RMSD-Cα < 3.5Å and eight reach an RMSD-Cα < 4Å. No productive folding trajectories were observed at lower temperatures, consistent with the enhanced forward folding rate expected by Arrhenius kinetics. Higher temperature trajectories (450K) exceed the melting temperature of NTL9 in the forcefield.
The observed number of folding events n is consistent with expectations from a simple model of parallel uncoupled folding simulations19 in which folding is modeled as a two-state Poisson process: <n> = ∫M(t)k exp(−M(t)kt)dt, where M(t) is the number of simulations that reach time t (Figure 1b) and k is the experimental folding rate (~640/sec)11. This theory predicts (on average) ~1.8 folding trajectories for the amount of sampling performed, in agreement with the two folding trajectories found in practice. Posterior distributions of folding rates given the amount of simulation time and number of folding trajectories were computed using a Bayesian approach16, which yield expectation values within an order of magnitude of the experimental folding rate.
In addition to native-like conformations, we see near-native configurations, which show heterogeneity in hydrophobic packing, most notably in alternative side chain arrangements in the beta-sheet structure (Figure 2). Most common of these is a non-native hydrophobic core involving residues I4, I18 and I37 (which normally contact the C-terminal helix in the full-length protein) with F5 solvent-exposed.
In order to describe the kinetics and mechanistic aspects of folding, we employ a new paradigm for sampling the global free energy landscape of folding, using Markov State Models (MSMs). MSM approaches, by automatically identifying a set of kinetically metastable states (such as foldons20) and efficiently sampling transitions between these states, can model long-timescale kinetics from much shorter trajectories21–24.
Our strategy for simulating slow-folding proteins is first to generate an initial series of kinetically connected states from both the folding and unfolding directions, and then to use adaptive resampling techniques25 to produce statistically converged estimates of metastable basins and the transition rates between them. In the remainder of this communication, we report progress toward the first goal, by constructing an MSM from the entire set of 370K trajectory data26,27, which we will use to seed future rounds of transition sampling. While additional rounds of adaptive sampling could likely aid in increasing the quantitative power of this model, there are several notable observations which can be made with the current data set.
Key to accurately identifying metastable states is the clustering of trajectory conformations into microstates fine-grained enough to be used for lumping into groups of maximally metastable macrostates26. 100,000 microstate clusters were calculated using an approximate k-centers algorithm28, each with an average radius of 4.5Å RMSD-backbone. Lag times ranging from 1 to 32 ns were used to build a series of MSMs. The implied time scales predicted by these models (obtained by diagonalizing the rate matrix) show a clear spectral gap separating the slowest relaxation time scale from the rest, indicative of single-exponential kinetics (see Figure S1). The implied time scale of the model levels off beyond a lag time of ~10 ns to an implied time scale of ~1 ms, close to the experimental folding time.
An important strength of MSMs is their ability to gain insight at coarser scales by “lumping” the kinetic transitions into a simpler model with fewer states. To gain a mesoscopic view of the folding free energy landscape, we lumped our 100,000- microstate MSM into a 2000-macrostate model. In this view, we find that the metastable states are diffuse collections of conformations over which multiple possible folding pathways can occur, indicating a vast heterogeneity of folding substates that need to be understood in greater detail. At the same time, we can identify highly populated “native” (state n) and “unfolded” (state a) macrostates that dominate the observed relaxation rates (Figure 3 and Figure S2).
The ten pathways with the highest folding flux from macrostate a to n were calculated by a greedy backtracking algorithm (see SI) from the macrostate transition matrix using transition path theory29,30 (TPT). The diversity of pathways demonstrates the power of the MSM approach: although we observe only a few folding trajectories directly, a network of many possible pathways can be inferred from the overlapping sampling of local transitions.
While NTL9(1–39) folds quickly for a two-state folder, it is similar in size to many ultrafast (sub-millisecond) folders that appear to exhibit so-called “downhill” folding. Hence, we would like to understand the structural features that limit the overall folding rate. As in a macroscopic two-state model, the highest-flux pathways in our mesoscopic model are a→m→n and a→l→n direct routes from disordered to structured macrostates, reminiscent of nucleation-condensation. These pathways by themselves, however, account for only ~10% of the total flux, and the structural diversity seen in all pathways is reminiscent of more hierarchical folding models such as diffusion-collision. Thus, we sought to more fully study the 15 macrostates transited by the top ten folding pathways.
To examine structural changes along the folding reaction, we considered three main native structural elements: the central helix (α), the pairing of strands 1 and 2 (β12), and the pairing of strands 1 and 3 (β13). To quantify the extent of native-like structuring for each of these elements we calculated Qα, Qβ12 and Qβ13, respectively (see SI for details). The Q-value is a number between 0 and 1 that quantifies the extent of native-like contacts. We then examined, for each macrostate, the Q-values in relation to the pfold value (committor), a kinetic reaction coordinate. The pfold value is computed from the macrostate transition matrix24,29,30.
This analysis yields several key insights into the folding mechanism of NTL9(1–39) on the mesoscale. We find the “unfolded” state a is compact, and contains a baseline level of residual native-like structure, with Qα near 0.5, and Qβ12 and Qβ13 near 0.2. In general, across the fifteen macrostates studied, Q-values increase as pfold values increase, although the relative balance of Qα, Qβ12 and Qβ13 varies, indicating pathway heterogeneity: i.e. native-like structures can form in different orders (Figures S4–S6). An exception to this, however, is observed for β12 strand pairing. Only for macrostates with pfold > 0.5 (states g-n) does appreciable β12 strand pairing occur (Figure 4). This suggests that the formation of a local strand pair (β12), rather than a nonlocal strand pair (β13), is rate-limiting. This effect is not predicted by strictly topological models of folding in which loop closure entropy loss dominates31, but instead may result from sequence-specific details. Unlike the β13 strand pair, which has a small interaction surface stabilized by hydrophobic contacts, the β12 hairpin contains seven of the protein's eight lysine residues, and three of its five glycine residues in a flexible loop region, features which may imbue β12 with larger barriers to folding. This proposed role of β12 is also consistent with the large changes in kinetics and stability seen experimentally for mutations in the β12 hairpin11.
It is natural to compare our results with previous unfolding simulations of NTL9(1–39) K12M by Snow et al.32. In that work, a detailed characterization of the transition state ensemble required the definition of strand-pairing reaction coordinates corresponding to β12 and β13 formation. In our MSM analysis, no such pre-definition is required. Snow et al. also note the difficulty in resolving kinetic intermediates not captured by the chosen order parameters. Indeed, our structural analysis can resolve subtle kinetic intermediates within the native basin, corresponding to alternative rearrangements of the β12 hairpin loop (Figure S7).
The above results suggest that existing forcefield models using implicit solvent are indeed accurate enough to fold proteins ab initio at long time scales (milliseconds), opening the door to simulating more structurally complex proteins. Moreover, our work demonstrates that there need not be a single pathway or single, dominant mechanism for the folding of a given protein: since the theories proposed for how proteins fold are based on broadly relevant physical principles, it is natural to imagine that multiple mechanisms could be simultaneously present, but that the sequence of the protein, coupled with the chemical environment would control the balance to which each mechanistic pathway is seen.
We thank the NSF for support through FIBR grant EF-0623664, the NIH through R01-GM062868 and Simbios U54-GM072970, and NSF award CNS-0619926 for computer resources.