|Home | About | Journals | Submit | Contact Us | Français|
Quantitatively accurate all-atom molecular dynamics (MD) simulations of protein folding have long been considered a holy grail of computational biology. Due to the large system sizes and long timescales involved, such a pursuit was for many years computationally intractable. Further, sufficiently accurate forcefields needed to be developed in order to realistically model folding. This decade, however, saw the first reports of folding simulations describing kinetics on the order of milliseconds, placing many proteins firmly within reach of these methods. Progress in sampling and forcefield accuracy, however, presents a new challenge: how to turn huge MD datasets into scientific understanding. Here, we review recent progress in MD simulation techniques and show how the vast datasets generated by such techniques present new challenges for analysis. We critically discuss the state of the art, including reaction coordinate and Markov state model (MSM) methods, and provide a perspective for the future.
Understanding protein folding via molecular simulation has been an aspiration of computational chemists ever since Anfinsen uncovered the surprising fact that proteins folded to a unique structure[1–3]. Applying simulation to folding appeals for many reasons. Folding is rapid and complex, requiring atomic-level resolution at nanosecond timescales for a complete detailed picture – outside the hard limits of temporal and spatial resolution of most experimental techniques . Furthermore, the complexity of protein native states and the inherent physical heterogeneity in the folding process have frustrated the search for microscopic physical theories of folding, though some advanced phenomenological approaches have been proposed [5–9]. Thus atomic-level simulations of folding, possessing intrinsically high resolution, have been aggressively pursued with the hope of surmounting these difficulties.
Three main problems must be overcome to achieve useful simulations of protein folding: accurate models (forcefields), sufficient sampling, and robust data analysis. Forcefield development has received much attention from the field, and has been extensively discussed [10–15]. Although more remains to be done to build and validate even more accurate models, forcefields capable of folding proteins in good agreement with experiment already exist (Fig. 1). Instead, the major challenge in producing reliable simulations of folding has been harnessing enough computer power to produce sufficient sampling to study folding. Because classical simulations must integrate Newton’s equations of motion with femtosecond timesteps (10−15 s), folding simulations require ~1012 timesteps to reach millisecond timescales. This expense is compounded by large system sizes (~105 atoms for explicit solvent simulations) and the need to witness many events for statistical confidence, making the computational effort required to study folding via simulation enormous.
Very recently, advances in hardware, software, and sampling techniques have made millisecond simulations possible. In 2010, using an aggregate of 1.5 ms of data, Voelz et. al. reported the first simulation describing the folding of a millisecond folder in implicit solvent, from data generated by the distributed computing network Folding@home [16,17]. Later that year, Bowman et. al. reported a millisecond timescale in the folding dynamics of lambda repressor in explicit solvent . More recently, using 30 milliseconds of aggregate data, Voelz et. al. studied the folding of ACBP on the 10 millisecond timescale, revealing that an experimentally observed folding intermediate was in fact a complex, heterogeneous ensemble of structures . Finally, with the advent of ANTON, a computer specialized for protein simulation, the first single trajectory of millisecond length was reported near the end of 2010 , making it possible to predict folding times of up to 100 µs from a single trajectory.
While challenging, generating enough sampling in an accurate forcefield does not constitute the end of the road for a folding simulation. Another major challenge is gaining scientific insight from the simulation – turning data into knowledge. Insights gained from simulations have already begun to shape the protein folding field, through connection to experiment and analysis of the simulations themselves [18,19,21–25]. These preliminary studies have revealed that the analysis of simulation data is difficult, with cases where certain techniques have led researchers to believe results inconsistent with their raw simulation data [20,26,27]. The simulation community needs to develop general, robust, and easy to use data analysis tools to continue towards the goal of understanding folding.
In this review, we briefly explain how the sampling problem has been overcome, and why we can expect the future to yield even longer simulations more efficiently. As sampling becomes less of an issue, a new challenge in folding simulations raises its head – given the massive amount of data extensive sampling provides, how does one make sense of it all? MD simulations are a high-dimensional time series, and therefore present a “Big Data” challenge [28,29]. We expect the techniques of data analysis will be the new limiting factor in the quest to understand folding through molecular simulation.
Over the past decade, the system sizes and time scales accessible to protein simulations have grown exponentially (Fig. 2). This gain has been achieved through progress on three main fronts: efficient parallelization of MD codes, specialized hardware, and statistical analysis of multiple independent trajectories. While there are many techniques designed to accelerate dynamic sampling via biasing the system in some way (e.g. replica exchange, metadynamics, aMD, string method, etc. [30–34]) each has advantages and disadvantages, and detailed discussion of these methods is beyond the scope of this review. Here, we focus on unbiased molecular dynamics simulations capable of describing realistic system kinetics.
Traditionally, such calculations were accelerated by dividing up single, long simulations across many processors. This tactic is inherently challenging, since each individual machine must communicate with one another, and as the number of machines grows so does the necessary time spent communicating, rather than actually simulating. While significant advances in mitigating communication costs have been made [35–38], such costs nonetheless place a hard upper limit on the efficiency of such techniques.
A second avenue pursued to enhance sampling has been hardware development. GPU technology, adapted from the video gaming industry, has resulted in tremendous acceleration of simulations at very low cost [39–46]. GPU simulations, like their CPU counterparts, are bounded by an upper limit of possible parallelism. Shaw and co-workers overcame these difficulties with ANTON, a special purpose machine, that has generated single trajectories one hundred times longer than previously reported simulations . ANTON combines specialized computer chips and a fast network that allow it to generate simulations with unprecedented speed.
A different approach, Markov state models (MSMs), have recently become a practical alternative to overcoming computational difficulties associated with traditional single trajectory simulations [48–52]. In this paradigm, independent short simulations are generated and then aggregated in a statistical fashion, resulting in a complete model of the system dynamics. The MSM effectively pieces together this complete model from independent parts (trajectories), with each trajectory describing one small part of protein phase space – similar to how a complete picture in a jigsaw puzzle emerges by connecting many individual pieces. When combined in an MSM, one can predict kinetic phenomena on timescales much longer than the individual trajectories used to build the model.
Through this mechanism, the MSM facilitates efficient use of resources, since machine-level parallelism can be employed until it becomes inefficient due to communication costs, and then further parallelism can be gained by running independent trajectories on different machines. Furthermore, because the MSM framework naturally partitions the simulation into many independent parts, and it can provide feedback about which areas of protein phase space are undersampled. Simulations can then be intelligently placed to increase sampling in the areas where it is most needed, through a process called adaptive sampling, avoided wasteful simulation of processes that have already been witnessed with statistical confidence [53–55].
The advances in sampling techniques, along with historical exponential increases in achievable sampling methods (Fig. 2), make us hopeful that protein folding simulations will become routine calculations for commodity hardware within a decade. Indeed, one can now simulate the folding of small proteins in explicit solvent at a rate of up to 100ns/day/GPU [44–46], such that a cluster of 100 GPUs can produce MSMs with the ability to predict the millisecond time scales in only three months. Given such extensive sampling, the next challenge simulators must face is that of data analysis – the process of turning information into knowledge. A successful analysis method should be able to reduce simulation data to its essential scientific features, without oversimplifying. It should let the data tell the story, discovering things that the investigator didn’t think to look for and mitigating any biases she might have. Currently, the analysis techniques employed by the simulation community fall into two broad classes: 1) methods focused on finding reaction coordinates and associated transition states and 2) Markov State Models (Fig. 3).
In the reaction coordinate paradigm, one looks for a single coordinate capable of describing the progress from unfolded to folded structures, and builds a model for kinetics along that coordinate [56–59]. This usually culminates in finding one or more transition state ensembles, usually defined to be the structures along the coordinate that have a 50% probability of proceeding to fold or unfold . These structures are usually assumed to be kinetically relevant in the same way transition states in physical organic chemistry are, such that their geometry reflects the kinetics of the process. These methods are appealing because they reduce information down to a single coordinate – discarding orthogonal degrees of freedom – and identify specific structures (the transition states) to investigate.
Markov state models (MSMs), on the other hand, represent folding as first-order kinetics between a set of discrete states. Automatic methods exist to employ simulation to identify a set of states and parameterize discrete-time master equation describing dynamics on that state space [61–63]. MSMs simplify the simulation analysis by discarding very fast dynamics, below the so-called lag time. Because the lag time is tunable, this allows for multiple levels of resolution, from fine (order nanoseconds) to coarse (microseconds or longer). This tunable nature allows the same model to be simultaneously quantitatively accurate (high resolution) and comprehendible (low resolution), all within a common theoretical framework.
These two techniques do not always yield the same results; there has been significant disagreement between investigators studying the same simulation datasets with different techniques. One such disagreement is at the very heart of understanding how proteins fold – the question of whether folding occurs via a single, dominant pathway or many independent routes. This is one of the most basic questions in the study of folding, and has been actively debated for almost two decades [20,26,64–69].
For instance, in the folding of a WW domain, Shaw et. al. employed a state of the art technique to construct a reaction coordinate for the folding process [20,57]. From that coordinate, they concluded that the folding of the WW domain was mechanistically homogenous, always beginning with the first hairpin – this was in contrast to the previous WW simulations of Noé et. al. . While the coordinate reproduced the correct folding time and was validated by a committor analysis (gave the correct probability that a give structure would fold before completely unfolding) [71,72], it failed to detect a parallel pathway, where the second hairpin of the WW domain folded first [26,27]. Later, the same technique was employed to analyze the folding of 12 small proteins, once again leading to the conclusion of predominantly single, serial folding pathways [67,68]. However, when MSM analysis was applied to these 12 simulations, a richer picture emerged  that suggested two-state models were inappropriate for half of the simulated systems.
In the above examples, the reaction coordinate scheme Shaw et. al. employed was capable only finding the highest-flux folding path, and ignored the others. In fact, reaction coordinates have great difficulties dealing with many of the features that make simulations appealing, such as their ability to elucidate parallel paths. Simulations provide a way to get information about the entire folding process, while reaction coordinates lead one to focus just on transition state(s). Further, reaction coordinates, by construction, discard potentially interesting dynamics orthogonal to the coordinate.
MSMs are not susceptible to these drawbacks, and are able to capture either simple or complex phenomena. While MSM techniques are well developed, some challenges do remain. In particular, the optimal manner in which to partition configuration space [74,75], choose a lag time, and perform adaptive sampling remain unknown. Effective heuristics are currently available for these problems [51,62,63], but systematic errors are often observed in MSMs, and improvements are certainly possible. Answering these challenges is the next step in analysis methods development, and should go a long way towards generalizing MSM techniques beyond protein folding.
Given that the sampling at millisecond timescales has been possible for only two years, and analysis methodology is still immature, unambiguous scientific results learned from atomic simulation have thus far been modest. It will be a major challenge in the next five years to turn advances in sampling and accuracy into scientific insight about how proteins fold.
Despite this relative immaturity, atomistic simulation has already begun to influence our view of protein folding. Detailed comparisons to experiment have been performed for many specific proteins, including villin, NTL9, WW domains, lambda repressor, ACBP, and the 12 fast folding proteins studied by Shaw [16,18–20,22,26,68,70,76,77]. Universally accepted generalities amongst these specific protein simulations have not yet emerged, though some have been suggested, for instance that folding kinetics might be hub-like [78–80], that folding proceeds via parallel paths [26,73], and that the unfolded state of proteins is compact . These hypotheses are by no means completely vetted, and require validation through additional simulations and experiment.
The list of potential questions the folding field might hope to address through simulation is long. A few of the most exciting include
Much effort has been poured into advancing molecular simulation, and in this decade the fruits of that effort are coming to bear. Hopefully with continued progress in sampling and forcefields, combined with powerful analysis techniques, simulation can play a key role – alongside experiment and theory – in discovering how proteins fold.
DS and VSP acknowledge support from the Simbios NIH Center for Biomedical Computation (NIH U54 Roadmap GM072970), VSP acknowledges NIH (R01GM62828). TJL was supported by an NSF GRF, KAB was supported by a Stanford Graduate Fellowship.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.