|Home | About | Journals | Submit | Contact Us | Français|
Recent years have witnessed a tremendous explosion in computational power, which in turn has resulted in great progress in the complexity of the biological and chemical problems that can be addressed by means of all-atom simulations. Despite this, however, our computational time is not infinite, and in fact many of the key problems of the field were resolved long before the existence of the current levels of computational power. This review will start by presenting a brief historical overview of the use of multiscale simulations in biology, and then present some key developments in the field, highlighting several cases where the use of a physically sound simplification is clearly superior to a brute-force approach. Finally, some potential future directions will be discussed.
In 1964, Seymour Cray’s “CDC 6000”, with its 1 megaflop performance, revolutionized the world of scientific computing, becoming the world’s first successful supercomputer.1 Since then, computer power has increased astronomically, culminating (at present) with the development of petaflop supercomputers specially designed for and dedicated to, molecular dynamics simulations and studies of proteins, such as IBM’s Blue Gene Project,2 RIKEN’s “MDGRAPE-3”3 supercomputer, and more recently, D. E. Shaw’s Anton supercomputer.4 Such a tremendous increase in computational power has resulted in a corresponding increase in the size of the biological systems that can be explored with explicit simulations, as well as the length of the simulations that can be performed, as is the case for instance, with a recent attempt at an all-atom ab initio simulation of protein folding for a millisecond folder.5 However, even with the existence of petaflop supercomputers, our available computer time is not infinite, and thus, in many cases, the use of brute-force computational approaches is not the solution. Additionally, and as shall be outlined in this review, in many examples, biologically important problems have already been successfully resolved by the use of physically sound simplifications, even before the existence of such powerful supercomputers. In fact, there exist cases where one neither can nor should approach the problem without the use of a simplified model. Although here, of course, a key question is what level of simplification to employ in order to be able to accurately model the problem at hand without sacrificing too much of the physics of the system, while also taking into account the available computational power and its ability to give a convergent result at that point in the history of the field. The earliest example of such coarse graining in biology was demonstrated in 1975, with the introduction of a simplified model for protein folding.6 Since then, simplified models have been successfully used to address ever more challenging problems. This review will first provide a brief historical background, and then present key developments in the field, illustrating examples where a physically meaningful simplified approach is clearly superior to the use of a brute-force approach. Particular emphasis will be placed on approaches that allow one to move back and forth between the coarsegrained and explicit models, such as the reference potential and renormalization approaches. Finally, we will point out some potential future directions for the field. It is worth mentioning that due to space limitations, and considering the breadth of the field, it is not possible to take up all historical issues in this work, and therefore, we direct interested readers to some other detailed discussions of the topic (e.g. ref. 7–10, amongst others), for further information.
One of the key problems in biology in the 1970s centered around what later came to be known as the “Levinthal Paradox”.11 That is, in 1968, Levinthal12 pointed out that with the large number of degrees of freedom in an unfolded polypeptide chain, the number of possible conformations the protein can attain is astronomically large, and to sample all available conformations would take a time longer than the age of existence of the universe. However, proteins routinely fold in a timescale of seconds or less, and trying to rationalize this apparent paradox became a problem of major importance in the field.
In 1975 Levitt and Warshel (LW) attempted to attack this fundamental problem (keeping in mind the computational power at that time), and, in order to do so, they introduced a model6 in which the protein side chains are replaced by spheres with an effective potential which implicitly represents the average potential of the solvated side chains, and the main chain is represented by virtual bonds between the Cαs. This model was used in simulations that were initialized from an extended starting conformation, in which the only information about the native protein that was used during the simulation was that all the terminal α helices were set to 180°, with the exception of residues 48 to 58, in which α was set to 45°. The simulations involved “thermalization” using the normal modes of the system (see ref. 6), and thus allowed for effective movement in the landscape of the system. Remarkably, this drastically simplified approach was able to find several native structures while starting from the native unfolded state, making it (probably) the first realistic dynamical treatment of large amplitude motion in proteins, as well as the first physically based solution to the Levinthal paradox.6 A further useful simplification suggested at the same time was a model that kept the helices of the simplified model in a fixed helical configuration.13 This drastic simplification appeared to provide very interesting insights into the packing of helices, and on the role of helix–helix interactions in forming the overall structure of helical proteins.
Significantly, later in the same year as our model, Gō and coworkers14 introduced another CG model for protein folding. This model, which has come to be referred to as a “lattice model” (in contrast to our approach which has been termed an “off-lattice model”), considers the system as being a chain of non-intersecting units of a given length on a 2D square lattice. However, at least in its initial incarnation, this model sacrificed too much physics to be considered a realistic model of a protein, and thus could not be used to explore the protein folding puzzle before the introduction of major changes. That is, while this approach is clearly an interesting approach, amongst other problems, constraining the system to a 2D square lattice limits the correspondence between this simplified 2D model and the actual complex structure of a protein, making it potentially too primitive to produce a real protein topology. Note that the problems with this approach have been discussed elsewhere by us as well as other workers.15,16 Nevertheless, despite this, the model explored some interesting formal issues such as the partition function of the simplified model, and has been used by some of the key workers in the field,17–21 while, in contrast, other works that tried to be more realistic used the LW model.
The discussion above illustrates the fact that different simplifications are needed for different problems, and the selection of which model to use depends on the question being asked. Here, both models provided tremendous insight into the folding process on the corresponding landscape and timescales,16 as well as encouraging both more experimental studies as well as the initial attempts at all-atom simulations (e.g. ref. 22 and 23). At this point, it is worth commenting on the common assumption that being able to run long trajectories (e.g. a millisecond trajectory for an ion channel or a single microsecond trajectory that explores protein folding)24 presents, in and of itself, a major breakthrough. That is, while such studies are clearly interesting and technically exciting, it is essential to remember that in order to understand e.g. selectivity, it is essential to run multiple runs and explore the effect of different parameters on the overall current (i.e. ion flux).25 Of course, with increasing computational power, it will be possible to generate multiple long trajectories with all-atom models, as was done in a recent protein folding study which did in fact attempt to examine multiple folding trajectories for a millisecond folder.5 However, the amount of computational resources expended into doing this are out of the reach of the majority of researchers in the field, and by the time such computational power is standard, it is most likely that most of the key problems will have been resolved by the use of simpler models.
Of course, when modelling a complex system, a key issue is what level of simplification should be applied, and how simple is “too simple”? As illustrated above, this is dependent on firstly what the question of interest is, secondly, how in depth an understanding one is hoping to attain about this problem, and, finally (and perhaps most critically), what level of computational resources one has available to oneself. However, it should be noted that despite the contemporary widespread use of simplified models, when they were first introduced, such models were not easily accepted by the scientific community. An excellent example of the early problems in accepting (and comprehending) such simplified models by those whose background was from more rigorous schools can be illustrated by considering a referee report on one of the very early papers in this field.26 This paper introduced the first physically valid microscopic model for chemical processes in water, using a soft-sphere dipolar model for modeling the crucial solvation effects. However, at that point in the field, while a work with two water molecules and an ion27 would be considered very reasonable, having presumably a complete Hamiltonian, according to one referee, the soft-sphere dipolar model did not “present a complete Hamiltonian” and thus was not justified. The point here is not the opinion of this particular referee on the topic, but rather the difficulties with the fact (which now looks obvious), that a system with a simplified model for each water molecule, and with a physical representation of the surface between the explicit system and the bulk by surface constraints is, in fact, an excellent model for microscopic studies of solvation effects (and with a continuum bulk model) (see also refs 28 and 29). However, a preoccupation with minute details can prevent one from realizing this point, thus making one unable to study the relevant system for many years.
Ultimately, when deciding what approach and level of simplification is suitable for studying the problem at hand, the most critical question remains that of what the relationship between the corresponding simplified and explicit models actually is. Strategies for addressing this issue are described below.
As mentioned in Section II, when using a simplified model, it is essential to use a model that manages to capture the physics of the all-atom model in a meaningful way, and, to assess this, it is essential to be able to understand what the relationship between the two models actually is. In this section, we will present two general approaches that allow us to move from simplified to full and full to simplified models, namely the “reference potential” and “renormalization” approaches respectively, and subsequent sections will show practical examples of the implementation of these approaches.
As CG models are, by nature, approximated treatments, it is important to be able to move from the CG model to a more realistic explicit model. One solution30 to this problem was presented some time ago, with the use of a simplified model as a reference potential for explicit calculations of folding free energies (note that other workers have since explored related strategies).21,31,32 In addition to its use in protein folding studies,30 this strategy has proven to be successful in solving a wide range of problems, including the acceleration of QM/MM calculations,33,34 and path integral calculations of nuclear quantum mechanical (NQM) effects.35,36 Additionally, this strategy has become particularly important in recent years in light of the increased interest in the relationship between landscapes and function.21,37–40 This approach resolves the problem of obtaining an adequate correspondence between the explicit and simplified models by attempting to generate the results of the explicit model, using the simplified model as a reference potential. Specifically, following the idea introduced in ref. 30, if one considers the partition functions Qsp and Qep of the simplified and explicit models of a protein, respectively, as well as the ratio between them (i.e. Qsp/Qep), is possible to obtain the free energy of transferring between the two surfaces by the following relationship:
Here, β = 1/kBT, where kB is the Boltzmann constant, and T is the absolute temperature. In order to evaluate eqn (1), we start by expressing the simplified (sp) and explicit potentials (ep) as:
where represents the coordinates of the simplified model, and represents the coordinates of the explicit side chain atoms, relative to the centers of the corresponding side chains in the simplified model (where the side chains are designated by mc. The task at this stage is to evaluate the free energy of moving from the simplified to the explicit models, i.e. ΔGsp→ep or ΔGep→sp. This can be expressed formally as the ratio between the corresponding partition functions (see ref. 41):
where vep denotes an average run over trajectories using the explicit potential, Vep, and the averaging is formally performed over the combined + coordinate set. The practical calculations are done with a mapping potential using the relationship Um = Uep(1−λm) + Uspλm which leads to:
Here, Vm denotes an average over trajectories run using the mapping potential. This potential is used to move the system from the simplified to the explicit model, and:
Thus, the free energy of moving between the simplified and explicit models can be evaluated using a multistep perturbation approach, or alternately by the linear response approximation42 (LRA). Now in many applications, we are not interested in the total partition function and the total free energy, but rather in the potential of mean force (PMF) of the simplified and explicit models. In such cases, we define the PMF by considering a partial partition function, i.e.:
where X is the parameter that defines the PMF, with Δgsp(X) and Δgep(X) being defined according to the thermodynamic cycle shown in Fig. 1. The free energy for the transition between the simplified and explicit PMFs is then obtained by the same strategy as in eqn (5), where we obtain the required ΔΔgsp→ep by using the mapping potential of eqn (5), and the relationship: 41:
The power of this approach lies in the fact that the computationally expensive calculation of the free energy of moving between the two relevant states in the explicit model (Δgep) is obtained by evaluating the free energy of moving between the states using the CG model (Δgsp), and the free energy of moving from the CG to the full surfaces (ΔΔgsp→ep) only needs to be evaluated at the end points of the simulation, thus saving significant computational time (see Fig. 1). The use of eqn (4) and/or eqn (9), in studies of protein folding and protein stability has been reported by us in ref. 30 and 41, and other workers (e.g. ref. 21, 32 and 43) have also recently explored the same direction. It is useful to address here a potential concern about the evaluation of ΔΔg using only the end-points of the mapping, where one needs to factor in the fact that the minima of the of the simplified and explicit models are not necessarily identical. It is however worth mentioning that this issue has been examined by us in detail in a related situation, namely in the challenging case of performing proper sampling of ab initio surfaces by means of hybrid QM/MM calculations.33,34 Here, we demonstrated that a classical reference potential can be used to evaluate ab initio QM/MM (QM(ai)/MM) free energy surfaces, and in fact even the LRA endpoint approach can be quite powerful even when the two potentials and the two TSs differ greatly (see discussion in e.g. ref. 34). Additionally, in the context of QM(ai)/MM calculations, we recently developed an approach which we refer to as “Paradynamics” (see Section IV.5), which uses an automated parameter refinement approach to minimize the difference between the two potentials.44 In other words, while there is indeed a computational challenge when dealing with two potentials with different minima, with care, this is a resolvable issue. In any case, the idea of using a reference potential has also been exploited by us in a wide range of problems, including path integral calculations of nuclear quantum mechanical (NQM) effects35,36 as well as (as mentioned above) the acceleration of QM/MM calculations.33,34,45,46
At this stage, it is worth touching on the quantum classical path strategy,35,47–49 which is another powerful approach that exploits the idea of transferring between two potentials. Specifically, this approach evaluates the nuclear quantum mechanical corrections to free energy surfaces by means of a perturbation from a classical trajectory to a path integral centroid potential. Perhaps the best testimony to the effectiveness of this approach is the attempts of others to not only adopt it,48–50 but also to try to attribute it to other workers,51 as well as to its reincarnation in closely related approaches.52
Finally, other possible applications of our reference potential approach include the evaluation of the effect of mutations on protein stability. While this can in principle be done by evaluating the folding PMF for both systems, it is much simpler to do so by means of a thermodynamic cycle of the type presented in Fig. 1, as was done in ref. 41 which used this approach to study the mutation of ubiquitin. Additionally, the reference potential idea considered above provides a promising strategy in the field of enzyme design, where it can be used to evaluate the binding free energy of rate-determining transition states. This can be done by using the cycle described in Fig. 1 and ref. 41, while focusing on the electrostatic free energy contribution, and the potential of this approach in enzyme design has been discussed in ref. 53.
Like the reference potential approach, the renormalization approach25,54–57 allows one to move between an explicit and simplified model, thus allowing one to capture the long timescale dynamics of the full explicit model and to help in obtaining accelerated free energy surfaces. That is, this approach aims to obtain the best correspondence between the free energy and dynamics of a full explicit potential model (obtained by running all-atom molecular dynamics simulations), as well as that of a corresponding implicit Langevin Dynamics (LD) model of reduced dimensions (with its simplified free energy landscape). This is usually achieved by first imposing a series of constraints of different strengths on both the explicit and simplified models, which force the system to move along a given reaction coordinate on different timescales, and then subsequently adjusting the friction in the simplified model until the timescale for the motion (for each constraint) becomes equivalent in both the explicit and simplified models. For example, as demonstrated in Fig. 2, we take an explicit system (in this case a truncated model of gramicidin), evaluate the time dependence of the process of interest (in this case ion transport) for different force constants (Fig. 2a), and then force a three dimensional LD model to reproduce the time dependence for each force constant, by adjusting the frictional coefficient (Fig. 2b). After obtaining the best fit from validating the LD model (by showing that it reproduces the results of the explicit model for all force constants, cf. Fig. 2c), one can run trajectories on the LD model without any force constants, and study the long timescale trajectories of the actual simplified system (Fig. 2d). The advantage with using external constraints is that it allows one to run all-atom molecular dynamics simulations, while forcing the system to undergo large structural changes within a reasonable simulation time (where larger constraints force a faster transition). The point is that the runs with the large constraint allow us to obtain the optimal effective friction (the value of which cannot be obtained by standard approaches, as discussed below), and to then use this friction for long timescale LD simulations without the constraints, a process which would otherwise require enormous computational time. This approach has been successfully and effectively used in studies of the selectivity of ion channels,25 simulations of proton transport58 and a study which critically examined the possible role of protein dynamics in enzyme catalysis,55 demonstrating the power of this approach in obtaining free energy surfaces.
An important issue in our validation of the renormalization approach56,57 is the selection of the optimal friction. That is, while the use of frictional models for describing biological and chemical processes has a long history and is well established (see e.g. ref. 59–62), as we recently demonstrated,56,57 the frictions obtained using standard frictional models cannot be safely used (in an LD model without a memory kernel) when the focus is on obtaining similar physics in the full and CG systems, and/or on studying long timescale biological processes. However, by “fine-tuning” the friction such that the passage time and overall time dependence obtained from simulations in different models are similar, it is possible to generate a reliable correspondence between the simplified and explicit models (note that the dynamical properties of our renormalization approach can be further refined by requiring that the relevant velocity autocorrelation functions or other related properties have similar behaviour in both the simplified and explicit models). This issue was discussed (and illustrated) in detail in a recent work57 which validated our renormalization approach on a truncated gramicidin system in the gas-phase, presenting the first detailed systematic validation of this approach. The advantages of this particular system are that, on the one hand, it is small enough in order to allow for very long explicit simulations, while simultaneously, it is complex enough to meaningfully present the physics of realistic ion channels. Specifically, this work focused on examining the usefulness of our approach in renormalizing steered molecular dynamics simulations,63,64 which, while popular, have the problem that it is not clear how to convert the results obtained by using the large external forces required in such simulations to the corresponding results that would have been obtained without the external force, at equilibrium.
Fig. 2 illustrates (i) the overall dependence of the first passage time through the channel on the force constant of the explicit and implicit models used in ref. 57, and (ii) long timescale trajectories of both models for a system with a low barrier, and without a pulling force. From this figure, it can be seen that not only does the renormalization give excellent agreement between the behaviour of the implicit and explicit models over a very wide range of first passage times, but also, that once the system has been renormalized, it is possible to reproduce the long timescale behaviour of the explicit model without a pulling force, and without expending the enormous computational time required to obtain such trajectories in the explicit system. Additionally, it was demonstrated that by adjusting the effective potential in the implicit model to reproduce the same passage time as that of the explicit model under the influence of an external force, it is possible to accelerate the evaluation of free energy barriers. That is, by having a reasonable effective friction, it is possible to extract the relevant PMF without investing the time needed for regular PMF calculations. Here, it is worth mentioning the similarities in spirit between our renormalization approach and the elegant “Surrogate Process Approximation” of Calderon and coworkers,65 which uses time series techniques66,67 in order to estimate a low-dimensional stochastic differential equation (SDE). These SDEs approximate the dynamics of an observed signal, which can come from either a computer simulation or an experiment. Recently,65 the authors applied such SDEs in order to approximate the various statistical properties that are associated with SMD simulations of ion transport across a channel protein. However, it is worth noting that the SPA approach is predominantly focused on evaluating the PMF, whereas, as illustrated in ref. 57 and our recent study of the long timescale conformational dynamics of adenylate kinase,55,56 our approach is directed more towards providing long timescale properties of the system which cannot be estimated by other approaches. Ultimately, the validation of a simulation approach does not arise out of the formal elegance of the approach, but rather in its ability to obtain reasonable and convergent results for the systems of interest. Here, it is essential that there is some correspondence between the full and CG models, and we believe that the best way to obtain this is by simultaneously adjusting the friction and PMF of the CG model until the best agreement between the two models is obtained.
Multiscale simulations play an increasing role in studies of complex biological problems. Here, we describe some of the advances in the field.
In recent years, significant interest has developed in MD simulations that can take changes in ionization states during the simulated process of interest into account (see e.g. refs 68–70). However, a shortcoming of current models is the fact that they do not consider the time dependence of the proton transport (PTR) process, as a result of which they cannot reproduce the proper time dependence of the response of a protein to pH changes. Therefore, in order to advance in this challenging field, we combined our earlier approach of performing time-dependent Monte Carlo (MC) simulations of PTR processes71 with the simplified protein model described in Section III in studies of pH dependent MD.41
Our model represents the energetics of the system by means of a simplified version of the EVB approach, which uses the modified Marcus equation58 in order to take the energetics of any possible proton transfer (PT) step into account. Here, the free energies of the different protonation states are obtained by the electrostatic energy of the CG model. The MC moves are based on the electrostatic energies of the CG model, and then scaled by the characteristic PT time in order to correspond to the rate constant that would be predicted by transition state theory. More specifically (see ref. 72 and 73), the free energy of each protonation state can be expressed by:
where m designates the vector of the charge states of the given configuration (i.e. ). Here is the actual charge of the kth group at the mth configuration, and the term represents the charge–charge interactions. Finally, is the intrinsic pKa of the kth group when all other groups are neutral. The barrier for the PT moves is then given by:58
Here, is the free energy of the reaction, λ is the reorganization energy, and Hij is the EVB off-diagonal term that mixes the two relevant states, whose average values at the transition state (x≠) and at the reactant state , are designated by the corresponding . The first term is the expression of the regular Marcus equation (see ref. 58), which corresponds to the intersection of Δg1 and Δg2 at x≠. The second and third terms represent, respectively, the effect of H12 at x≠ and .
Using the model above allowed us to convert an MC procedure to a time-dependent simulation using additional simplified modifications,58 which exploit the isomorphism between the probability obtained from the MC procedure and the probability factor from transition state theory (TST) (see ref. 41 for details). Additionally, the PT to and from the bulk is determined by the considerations described in ref. 41.
While discussing CG simulations of PT, it is important to note that renormalized simulations with the LD model have also been used to study PTR in carbonic anhydrase,74 as well as in gramicidin.58 In both cases, it was established that PTR in proteins is controlled by the electrostatic free energy barrier, and not the Grotthuss mechanism (this issue is discussed in detail in ref. 75). However, when such simulation studies become too time consuming, it is possible to move to the MC approach, which was used in ref. 71 to study PTR in cytochrome c oxidase (CcO).
However, the renormalized LD model was sufficiently powerful to allow us to explore the selectivity of the KcsA potassium channels.25 The latter study was performed by first evaluating the free energy for the penetration of a single ion by means of the PDLD/S-LRA model, and then evaluating the ion–ion interaction “on the fly” by use of a large effective dielectric constant, where the ion–ion interaction, ΔGij, is given by:
where i and j are the charges of the corresponding ions, rij is the distance between the two ions, and the dielectric function, εeff(r), is a distance dependent dielectric constant. Finally, ΔG′(rij) is a complex additional interaction potential that has been discussed in detail in ref. 25, but, in brief, serves two purposes: (a) to introduce a steric repulsion term to prevent ions of opposite sign (e.g. K+ Cl− or Na+ Cl−) from “collapsing” together, and (b) to mimic the effect of explicit water molecules in the channel, which are missing from the implicit solvation model (again, for discussion, see ref. 25). The friction constant was then evaluated by the same renormalization strategy discussed in Section III, and, as was also demonstrated in ref. 57, our treatment of the charge interaction appears to provide an extremely effective way of studying long timescale processes in ion channels.
When studying light induced electron and proton transport processes, there are two time scales of interest. In the short time limit, which comprises femtoseconds up to hundreds of picoseconds, it is possible to use explicit all-atom simulations in order to effectively explore the nature of the corresponding primary events, as was the case in studies of the primary photochemical event in bacteriorhodopsin76 as well as photosynthesis.77 In contrast to this, in the long timescale, which ranges from hundreds of picoseconds to seconds, CG models become important in exploring the behaviour of the system. For instance, an important challenge is posed by the simulation of electron transport (ET) processes. Here, we introduced many of the key approaches (see ref. 77) for obtaining the relevant activation barriers from direct simulations. However, as far as the present review is concerned, it is worth mentioning that the long-range ET steps can be explored by the same MC as used for studying PT, but with a characteristic time reflecting the corresponding pre-exponential factor.71
Another example of this can be seen in a recent study78 which combined the QCFF/PI and EVB methods in a unified QM/MM framework, in order to examine the energetics of the primary proton transfer in bacteriorhodopsin. The advantage of such an approach is that it allows for significant sampling which facilitates the obtaining of a quantitative free energy surface for the proton transfer at different protein configurations. Specifically, this study verified that rather than being driven by e.g. twist-modulated changes in the pKa of the Schiff base, changes in hydrogen bond directionality or other non-electrostatic effects, the primary proton transfer is a light-induced charge separation process after all. Additionally, this study was important because it provided the first glimpse into the energetics of protein conformational changes.
In recent years, there has been great interest in studies of free energy landscapes. Here, a reduced model which still manages to capture the main physics of the given system can be extremely helpful. An important issue when modeling such landscapes is the choice of a proper reaction coordinate. An excellent example of the use of a reduced coordinate can be seen in the use of an order parameter, ρ, in order to describe the folding landscape.79 Here, ρ serves as a measure of the degree of similarity of each state with the native conformation, with the energy having been demonstrated to, on average, be a decreasing function of ρ (in line with the assumption of minimal frustration). At present, however, the most effective possibility when modeling the catalytic landscape is arguably provided by the EVB formalism (see e.g. ref. 80 and references cited therein for discussion).
An excellent example of this is provided by a recent study39 which examined the catalytic landscape of chorismate mutase. This study used a CG model to generate the free-energy landscape of the enzyme, followed by an explicit EVB evaluation of the activation barriers for the chemical step in different regions of the landscape, which was defined in terms of the conformational and chemical coordinates. The approach introduced in ref. 39 has also been combined with the renormalization approach outlined in Section III in order to study the relationship between conformational dynamics and catalysis in adenylate kinase.55 This study, which was the first of its kind, allowed us to examine the conformational dynamics of adenylate kinase in the relevant ms timescale, and demonstrated that contrary to popular assumptions (for a discussion see ref. 81 and references cited therein) there is no evidence for dynamical contributions to enzyme catalysis.
An area in which CG models have shown great effectiveness is in a study of the nature of vectorial translocation processes, where the details of the simulated system are not always clear. For instance, a recent study54 of the nature of vectorial translocation of a single-stranded DNA by translocases focused on the electrostatic interaction between the protein and the DNA main-chain-ionized phosphate group. Here, the simplified system used the PDLD/S-LRA electrostatic potential in order to generate a unique free energy surface with a clear valley leading in one direction, supporting a vectorial process. LD simulations were then run on the corresponding system (Fig. 3), recovering unidirectional translocation where the energy of ATP hydrolysis is coupled to the translocation process, verifying the simple insight obtained by inspecting the surface. This study provided the first fully consistent simulations of a biological vectorial process where the results were neither assumed a priori, nor introduced by phenomenological parameters (see discussion in ref. 54).
More recently, Warshel et al. examined the challenging problem of understanding the mechanism of the insertion of transmembrane (TM) helices through the translocon.82 Here, a combination of uncertainty about the structural changes of the translocon, as well as the complexity of the insertion process itself suggest that, despite elegant recent studies of this problem using all-atom models,83 such models are not optimal for examining this problem. In contrast, a simplified CG model would allow for the exploration of a large range of configurational changes, as well as the obtaining of reasonable conclusions even in cases where the available structural information is “fuzzy”. Thus, our study aimed to address this problem,82 and, particularly, to attempt to rationalize the small apparent insertion energy, ΔGapp, of an ionized residue in the center of a TM helix.84 This was achieved by means of a CG model, which provided a more sophisticated treatment of electrostatic effects than existing CG models (for technical details see ref. 82). This approach was then used to examine the energetics of each step in the insertion process using a CG model, considering only the energetics of inserting the helices in the relaxed protein, without fully optimizing the protein structure or exploring its deformation free energy (the missing deformation energy should in principle make the total energy of states where the helices are out of the translocon more stable). Fig. 4 shows the system free energy of the two helices of interest in the membrane as a function of helix rotation. Here, the first helix, TM1, has polar regions, and the second helix, H, has positively charged Arg. As can be seen from the graph, the energy goes down when the charged Arg points towards the TM1 helix, reducing the exposure of the charge to the membrane. The nature of the energetics of the H helix and the corresponding ΔGapp was then examined in much greater detail, focusing on the case where the Arg residue is in the center of the H-helix (for details see ref. 82), and, from this analysis, it was found that the lowest free energies correspond to the case where the insertion is assisted by the polar parts of either the translocon or other helices, with the lowest energy configuration giving a ΔGapp in reasonable agreement with experiment. Thus, this study provides a rationale for the observed ΔGapp of ionized residues, suggesting that the apparent insertion free energy of TM with charged residues probably reflects more than just the free energy of moving the isolated single helix from water into the membrane.
The QM/MM approach85 is an excellent example of the multiscaling idea in modeling complex systems. This model, with its crucial electrostatic embedding idea (see e.g. ref. 45 and 86 for reviews) has rapidly become one of the most popular approaches for studying chemical reactivity in general, and enzyme function in particular (for discussion, see ref. 45 and references cited therein). Here we will focus only on the key issue of the need for extensive configurational sampling during the course of QM/MM calculations. That is, the use of QM/MM simulations without proper sampling is not so effective. The problems with limited energy minimization have already been discussed elsewhere,87 however, here we would like to emphasize the key issue that since the enzyme active site landscape is quite complex, estimating the QM/MM reaction path along a fixed reaction coordinate can reflect artificial minima. This is a particularly challenging problem when one is dealing with ab initio QM systems in QM(ai)/MM simulations. Addressing the need for properly sampling ab initio surfaces has, however, led to several important advances,45,46,88–95 many of which45,46,90,91,93–95 exploit our idea46,96 of utilizing a classical potential as a reference for QM/MM calculations. Now similarly to the cycle shown in Fig. 1, the key point in the use of a reference potential for QM/MM calculations is the evaluation of the free energy of moving from the reference to the corresponding ab initio potentials. This free energy can in principle be evaluated either by a single-step free energy perturbation (FEP) approach, or by means of the linear response approximation (LRA). While both approaches are viable, the LRA is particularly powerful as it allows for reasonable results to be obtained even in cases where the ab initio and simplified potentials have significant differences. Here, the empirical valence bond (EVB) approach, which has been discussed in detail elsewhere (e.g., ref. 80 and 97) provides a particularly powerful reference potential, as it stores a tremendous amount of chemical information. Such an approach has been both effectively and successfully applied to the study of activation barriers in solution and in proteins,34,46,96,98 most notably in the case of,34 which successfully resolved the highly controversial issue (see ref. 99) of the energetics of the reference reaction for Haloalkane Dehalogenase (DhlA) in solution, using the EVB as a reference potential for QM(ai)/MM calculations. Specifically, this study34 clarified that the earlier EVB estimate99 of the catalytic effect and the effect of the enzyme are quantitatively correct, and is at present the only true QM(ai)/MM study which considers the free energy surface of the DhlA reaction in both the protein and in solution.
Recently, we also advanced the idea of performing QM(ai)/ MM calculations of solvation free energies using a classical reference potential by means of a powerful approach which represents the solute environment by an average solvent potential which is then added to the solute Hamiltonian.45,46 This approach has been demonstrated to lead to computational time savings of up to 1000× in QM(ai)/MM-FEP calculations of solvation free energies of simple systems in which the solute structure is kept fixed during the simulation.46
Finally, it is important to point out that the use of the EVB as a reference potential for QM(ai)/MM calculations is, in fact, far more effective than the subsequently developed and extremely popular “metadynamics”100,101 (see ref. 44, 45 and 56). That is, while the metadynamics approach incrementally adds an additional potential to the system’s potential, by placing Gaussian at deep wells, making the sum of both potentials approach zero (see Fig. 5a), our approach, which we now call “paradynamics”, starts with an EVB potential that is already similar to the system’s potential, and then just refines the EVB until the difference between the system’s potential and the EVB potential approaches zero (Fig. 5b, see also ref. 44). Thus, the two approaches are formally similar, however, paradynamics requires many less computationally expensive evaluations of the QM(ai) potential.
The use of simplified models to model biological and chemical problems have expanded dramatically since the introduction of our simplified protein folding model in 1975.6 Particularly, dramatic advances in computational power since then have greatly expanded the complexity of the problems that can be addressed by computational means. When attempting to obtain convergent results for complex systems, issues such as proper sampling become central, and it is necessary to run and examine results across multiple runs. This means that such challenging problems as protein insertion of transmembrane helices into membranes in a meaningful way (which was recently achieved by means of a simplified model)82 are out of the reach of all-atom simulations, even with current computational power. However, the use of simplified models has allowed for the addressing (and sometimes resolving) of problems that would otherwise have been impossible to study by means of all-atom models.
This review has aimed to introduce readers to the history of the field, starting with the introduction of a simplified model to study protein folding,6 as well as to cover some of the most important advances in the field (while emphasizing our works). Obviously, due to the vastness and rapid growth of this field, it is not possible to cover the entire breadth of the field in the limited space of this review; however, there are two recent developments worth mentioning at this point. The first of these is the MARTINI force field of Marrink and coworkers,102 which represents a major productive effort in modeling membranes by CG models. This model uses extensive calibration of the building blocks of the CG force field against thermodynamic data, and shows reasonable behavior for lipid bilayers in terms of the stress profile across the bilayer, its tendency to form pores, as well as accurate agreement with all-atom simulations for the free energies of lipid desorption, and even (to a more limited extent) flip-flopping across the bilayer. However, despite the promise of this approach, while taking the idea put forth in ref. 6 and expanding it into membranes would have been the most logical direction three decades ago, as is also the case with many other examples, the effects of the membranes, despite their overwhelming importance, are frequently second order as far as function in the folded protein is concerned. In other words, even though the given folding cannot be established, nevertheless, in most cases, the most important effect of membranes on the function of membrane proteins can be obtained by modelling the membrane on a grid of induced dipoles.103,104 Despite this, many early and recent efforts in modelling membrane effects has been invested instead into constructing detailed membrane models,105–109 which are both computationally expensive and potentially ineffective (particularly when studying the relevant functional membrane proteins). Here, it would have been far preferable to garner the relevant basic information using a simplified model and only then move to more complex models over time.
Another important example can be seen in a recent study5 which presented simulations of several folding trajectories of NTL9(1–39), a protein with a folding time of ~1.5 ms, which is far longer than is otherwise common to study by means of ab initio all-atom molecular dynamics simulations. Such a long simulation time was achieved by means of distributed molecular dynamics simulations using implicit solvent on multiple GPU processes, using the Folding@Home computing platform,110 and the resulting simulations agreed reasonably with the experimental folding rate. Clearly, it is tempting to see such simulations as a great advance in the field. However, even though such simulations can be performed (using exhaustive computational resources) for the folding problem, where the issue at hand is relatively simple, when one is dealing with the subject of function (where the problem needs to be thought about in depth, and it is crucial to explore different mechanisms), unfortunately having only a brute force result (most likely after several months of running the simulation) is not the optimal strategy. Our point may not be intuitive considering the fact that it is not common in the field to explore multiple mechanisms with multiple options (a case in point being cytochrome c oxidase, see ref. 111 and discussion therein), however, one would like to be able to contemplate in detail on the problem at hand before one forgets what the original issue was while waiting for the simulation to run. We are of course not discrediting the enormous potential of advanced technology here; rather, we just want to remind those who are in awe of such accomplishments that the actual returns in terms of improved understanding are not guaranteed.
Clearly, from this review it can be seen that brute-force simulations are not necessarily always the best way to address interesting biological and chemical problems, but rather, a powerful compromise between detail and computational cost can be obtained by means of physically meaningful multiscale approaches which allow one to effectively explore complex systems and to model long-timescale processes which are otherwise out of the reach of standard computational techniques. This review has focused on approaches that use a simplified model as a reference potential for all-atom simulations, highlighting several examples where such an approach has been able to solve key problems in the field. Ultimately, it is up to the reader to judge the validity of a given simplification, however, and as illustrated by the examples outlined here, we believe that the use of CG multilevel approaches is recommendable in almost any simulation of biophysical systems, starting from folding and moving into enzyme design, energy and signal transduction, and clearly larger scale problems such as protein translocation.
SCLK would like to thank the Swedish Research Council (grant 2010-5026), and AW would like to thank the NCI (grant 1 U19 CA105010-1), NSF (grant MCB-03442276) and NIH (grant GM24492) for funding. Additionally, both authors would like to thank Nikolay Plotnikov and Spyros Vicatos for their assistance in the preparation of the artwork for this manuscript.
Shina Caroline Lynn Kamerlin
Lynn Kamerlin is currently an Assistant Professor of Organic Chemistry at Stockholm University, in a special position funded by the Swedish Research Council (VR). She obtained her PhD in 2002 from the University of Birmingham (UK) under the supervision of John Wilkie, and joined the faculty of Stockholm University in 2011 after postdoctoral training in the Boresch (2005–2007), Warshel (2007–2010) and Himo (2010) groups. Her scientific research interests center on theoretical physical organic chemistry and computational biophysics, and her non-scientific interests include both ancient and modern languages (5 spoken fluently), amateur photography, and playing the piano.
Arieh Warshel is a Distinguished Professor of Chemistry and Biochemistry at the University of Southern California. He is a member of the National Academy of Sciences and a Fellow of the Royal Society of Chemistry. Dr Warshel received his BSc in 1966 from the Technion in Israel and his PhD in 1969 from the Weizmann Institute. He has pioneered the key computer modeling approaches for simulating biological functions. These include the development of consistent treatments of electrostatic energies in proteins, the development of the QM/MM approach, the introduction of molecular dynamics simulations to studies of biological processes and the introduction of microscopic thermodynamics cycles in biological systems.