In order to describe the kinetics and mechanistic aspects of folding, we employ a new paradigm for sampling the global free energy landscape of folding, using Markov State Models (MSMs). MSM approaches, by automatically identifying a set of kinetically metastable states (such as foldons
20) and efficiently sampling transitions between these states, can model long-timescale kinetics from much shorter trajectories
21–24.
Our strategy for simulating slow-folding proteins is first to generate an initial series of kinetically connected states from both the folding and unfolding directions, and then to use adaptive resampling techniques
25 to produce statistically converged estimates of metastable basins and the transition rates between them. In the remainder of this communication, we report progress toward the first goal, by constructing an MSM from the entire set of 370K trajectory data
26,27, which we will use to seed future rounds of transition sampling. While additional rounds of adaptive sampling could likely aid in increasing the quantitative power of this model, there are several notable observations which can be made with the current data set.
Key to accurately identifying metastable states is the clustering of trajectory conformations into
microstates fine-grained enough to be used for lumping into groups of maximally metastable
macrostates26. 100,000 microstate clusters were calculated using an approximate
k-centers algorithm
28, each with an average radius of 4.5Å RMSD-backbone. Lag times ranging from 1 to 32 ns were used to build a series of MSMs. The implied time scales predicted by these models (obtained by diagonalizing the rate matrix) show a clear spectral gap separating the slowest relaxation time scale from the rest, indicative of single-exponential kinetics (see
Figure S1). The implied time scale of the model levels off beyond a lag time of ~10 ns to an implied time scale of ~1 ms, close to the experimental folding time.
An important strength of MSMs is their ability to gain insight at coarser scales by “lumping” the kinetic transitions into a simpler model with fewer states. To gain a mesoscopic view of the folding free energy landscape, we lumped our 100,000- microstate MSM into a 2000-macrostate model. In this view, we find that the metastable states are diffuse collections of conformations over which multiple possible folding pathways can occur, indicating a vast heterogeneity of folding substates that need to be understood in greater detail. At the same time, we can identify highly populated “native” (state
n) and “unfolded” (state
a) macrostates that dominate the observed relaxation rates ( and
Figure S2).
The ten pathways with the highest folding flux from macrostate
a to
n were calculated by a greedy backtracking algorithm (see
SI) from the macrostate transition matrix using transition path theory
29,30 (TPT). The diversity of pathways demonstrates the power of the MSM approach: although we observe only a few folding trajectories directly, a network of many possible pathways can be inferred from the overlapping sampling of local transitions.
While NTL9(1–39) folds quickly for a two-state folder, it is similar in size to many ultrafast (sub-millisecond) folders that appear to exhibit so-called “downhill” folding. Hence, we would like to understand the structural features that limit the overall folding rate. As in a macroscopic two-state model, the highest-flux pathways in our mesoscopic model are a→m→n and a→l→n direct routes from disordered to structured macrostates, reminiscent of nucleation-condensation. These pathways by themselves, however, account for only ~10% of the total flux, and the structural diversity seen in all pathways is reminiscent of more hierarchical folding models such as diffusion-collision. Thus, we sought to more fully study the 15 macrostates transited by the top ten folding pathways.
To examine structural changes along the folding reaction, we considered three main native structural elements: the central helix (α), the pairing of strands 1 and 2 (β
12), and the pairing of strands 1 and 3 (β
13). To quantify the extent of native-like structuring for each of these elements we calculated
Qα,
Qβ12 and
Qβ13, respectively (see
SI for details). The
Q-value is a number between 0 and 1 that quantifies the extent of native-like contacts. We then examined, for each macrostate, the
Q-values in relation to the
pfold value (committor), a kinetic reaction coordinate. The
pfold value is computed from the macrostate transition matrix
24,29,30.
This analysis yields several key insights into the folding mechanism of NTL9(1–39) on the mesoscale. We find the “unfolded” state
a is compact, and contains a baseline level of residual native-like structure, with
Qα near 0.5, and
Qβ12 and
Qβ13 near 0.2. In general, across the fifteen macrostates studied,
Q-values increase as
pfold values increase, although the relative balance of
Qα,
Qβ12 and
Qβ13 varies, indicating pathway heterogeneity: i.e. native-like structures can form in different orders (
Figures S4–S6). An exception to this, however, is observed for β
12 strand pairing. Only for macrostates with
pfold > 0.5 (states
g-
n) does appreciable β
12 strand pairing occur (). This suggests that the formation of a
local strand pair (β
12), rather than a
nonlocal strand pair (β
13), is rate-limiting. This effect is not predicted by strictly topological models of folding in which loop closure entropy loss dominates
31, but instead may result from sequence-specific details. Unlike the β
13 strand pair, which has a small interaction surface stabilized by hydrophobic contacts, the β
12 hairpin contains seven of the protein's eight lysine residues, and three of its five glycine residues in a flexible loop region, features which may imbue β
12 with larger barriers to folding. This proposed role of β
12 is also consistent with the large changes in kinetics and stability seen experimentally for mutations in the β
12 hairpin
11.
It is natural to compare our results with previous unfolding simulations of NTL9(1–39) K12M by Snow et al.
32. In that work, a detailed characterization of the transition state ensemble required the definition of strand-pairing reaction coordinates corresponding to β
12 and β
13 formation. In our MSM analysis, no such pre-definition is required. Snow et al. also note the difficulty in resolving kinetic intermediates not captured by the chosen order parameters. Indeed, our structural analysis can resolve subtle kinetic intermediates within the native basin, corresponding to alternative rearrangements of the β
12 hairpin loop (Figure S7).