|Home | About | Journals | Submit | Contact Us | Français|
Computer simulations are indispensable tools for studying the structure and dynamics of biological macromolecules. Biochemical processes occur on different scales of length and time. Atomistic simulations cannot cover the relevant spatiotemporal scales at which the cellular processes occur. To address this challenge, coarse-grained (CG) modeling of the biological systems are employed. Over the last few years, many CG models for proteins continue to be developed. However, many of them are not transferable with respect to different systems and different environments. In this review, we discuss those CG protein models that are transferable and that retain chemical specificity. We restrict ourselves to CG models of soluble proteins only. We also briefly review recent progress made in the multi-scale hybrid all-atom/coarse-grained simulations of proteins.
Classical all-atom molecular dynamics simulations with explicit solvent and ions are routinely used to investigate the dynamics of biological macromolecules. Such simulations can reveal the mechanism of biochemical processes occurring on different scales of length and time1. With the availability of highly optimized atomistic force fields, specialized supercomputers, scalable codes, and enhanced sampling methodologies, one can study the dynamics of biomolecules in a realistic fashion and explain and predict experimental results. Most notable recent highlights are folding studies of small proteins by Schulten et al.2 and Shaw et al.3 The Schulten group has obtained a continuous trajectory of 10 μs while the Shaw group has performed 1 ms long simulation that would have been unthinkable not long ago. Other examples include simulations of proteins in cellular environments4–6 to investigate the effect of cellular-like environments on the stability and diffusive properties of biological macromolecules and simulations of very large complexes such as RNA polymerase II7 and MutSα/MutSβ8,9. Despite such successes, atomistic simulations are still several magnitudes away from being able to effectively cover the spatiotemporal scales of most cellular processes.
To address this challenge, the most promising strategy is a reductionist approach, i.e. a simplification of the underlying model that is used to describe the biological system (see Figure 1). This can be achieved by coarse-graining of the system. In coarse-grained (CG) models several atoms are grouped into a single particle (CG bead) and the interactions between these CG beads are then described by effective potentials that incorporate both energetic and entropic contributions resulting from integrating out (or averaging over) the neglected atomistic details. The coarse-graining of the system can greatly reduce the number of degrees of freedom of the system compared to a complete all-atom description of the system. As a result, less computational power is required to simulate the system. Furthermore, the interactions between CG sites are usually “softer” which allows for longer integration time steps in molecular dynamics simulations. Typical time steps that can be used with CG models are tens of femtoseconds compared to 1–2 fs used in all-atom MD simulations. CG models combined with the dissipative particle dynamics (DPD) allow time steps of 100 femtosecond or above. Interactions in CG models tend to be of shorter range than in all-atom models due to the use of effective energy terms and typically the expensive treatment of long-range electrostatics can be avoided. This results in additional significant speedups in CG simulations compared with atomistic simulations. Finally, CG simulations generate inherently accelerated dynamics since the energy landscapes in CG models are smoother compared with atomistic models. Because of these advantages over fully atomistic simulations, CG models have the potential to cover significantly larger spatial and longer temporal scales compared to all-atom models.
In the past few years there has been a flurry of CG models for proteins with varying complexities10,11. These CG models are employed for various applications, such as study of protein folding12,13 problems, protein–protein docking14, structure predictions15,16, scoring protein decoys17, structural features of viral capsid18,19, structure and dynamics of membrane or protein-membrane systems20, and influence of certain amino acid substitutions in the HIV-1 Gag polyprotein and how the resulting physical effects influence HIV-1 virion formation and structure21.
There is no unique way of coarse-graining a protein22. It depends on the levels of granularity and specific applications. The CG mapping schemes of proteins fall into two major categories: shape-based19 and residue-based approaches11. The shape-based CG models emphasize outwards appearance and are thus not suitable for studying protein folding problems or distinguishing the dynamics between the wild-type and mutant variants of a protein. Residues-based models are suited best for studying internal dynamics and effects of chemical specificity of different residues.
Depending on the CG mapping scheme and specific applications, CG force fields may range from simple effective interaction potentials to more realistic all-atom like force fields. Generally, two different approaches, namely bottom-up (structure-based coarse-graining) and top-down (thermodynamic-based coarse-graining), are followed to generate a CG force field. In the bottom-up approach, CG force fields are parameterized based on comparisons to reference atomistic simulations. Several approaches, such as inverse Monte Carlo (IMC)23, iterative Boltzmann inversion (IBI)24, force matching (FM)25–27, fluctuation matching28, minimization of relative entropy29, or conditional reversible work (CRW)30,31 are employed in bottom-up approaches to match a set of target distributions from atomistic simulations. The bottom-up approaches are also highly amenable to multi-scale approaches since the fine grained (AA) information is propagated to a lower resolution (CG) model. On the other hand, in top-down approaches, CG force fields may be parameterized against experimental data, in particular thermodynamic data such as oil/water partitioning coefficients and density. Among various bottom-up approaches, IBI and FM methods are most popular and are frequently used to parameterize CG force fields for biomolecular simulations. IBI reproduces the structure of the underlying atomistic model by means of an iterative optimization scheme through which the effective potentials between CG sites are obtained. This method is motivated by Henderson’s uniqueness theorem32 which states that, for a given pair radial distribution function, there is a unique underlying pair potential that will produce it. On the other hand, FM does not depend on the pair correlation functions. Instead, the idea of FM is to match the average force on CG beads in the CG system to that expected from the atomistic model. We are not going to discuss the underlying theories of different parameterization procedures employed in the bottom-up approaches and for an extended overview of such topics, the reader is referred to a recent review by Noid33.
A major challenge remains in developing CG models that are highly predictive with respect to (unknown) specific behavior of specific systems rather than simply recapitulating either results from atomistic simulations or the generally expected characteristics of protein biophysics. This generally requires that CG models exhibit a sufficient degree of both transferability and chemical specificity. The range of CG models is very broad given different levels of coarse-graining and various philosophies of the parameterization of the CG force fields. Here, we will focus primarily on recent advances in developing such transferable CG force fields as they show the most promise in broad future applications of CG methodologies.
This review is organized as follows: First, we give a review of popular CG models for proteins. Next, we discuss recent advances in hybrid all-atom/coarse-grained (AA/CG) simulation methodologies. Finally, we conclude with a short outlook.
Generally, a CG model involves the following four choices34: (i) resolution of the model appropriate for the size of the system and the types of questions that are going to be addressed; (ii) mapping of the CG beads to AA sites; (iii) type of interaction potential, and (iv) parameterization of the interaction potential to faithfully describe the energetics and dynamics of the biological system in comparison with experiment and AA simulations. CG models of varying complexity have been proposed for proteins since the pioneering work of Levitt and Warshel35. Their resolution may vary from one or a few beads per amino acids to near atomistic resolutions11,36–43. The coarser the description, the larger the gain in computational speedup. However, coarser models lack transferability and may not be suitable for multi-scale simulations.
Early on, one-bead models38,44,45 were most popular. For instance, Tozzini and McCammon have developed a Cα based single bead CG model to study the dynamics of flap opening in HIV-1 protease38. In their model, each amino acid is represented by a single bead placed on the Cα atom. The force field was parameterized based on the Boltzmann inversion procedure. A similar approach was adopted by Korkuta and Hendrickson45 to build the virtual atom molecular mechanics (VAMM) force field. To study the disordered state of proteins, Ghavami and co-workers42 have proposed a one-bead-per-amino-acid model. Generally, one-bead models are highly system-specific, such as the widely used Gō models, and they typically lack chemical specificity of amino acid side-chains. Furthermore, one-bead models are typically not well suited for multi-scale simulations because of their low resolution.
In contrast to other CG models, a model developed by the Voth group did not rely on a predefined potential but instead utilized the multi-scale coarse-graining (MS-CG) method25,46,47. In this approach, the CG model and interaction potential was derived via a force-matching method in which atomistic force information from reference simulations is matched to forces from CG potentials within a variational framework26,27. One of the advantages of the MS-CG approach is that it can be applied at arbitrary resolutions. The MS-CG method has been applied to the study of protein folding and dynamics13,48. However, MS-CG potentials are commonly not transferable. These models will not be described here. The reader is referred to earlier reviews on such models48–51.
Side-chain chemical specificity is essential for transferability and is the key for describing packing interactions within and between proteins. Adding one or several beads for the side-chain in addition to a single bead for the backbone can greatly improve the specificity of local interactions43,52. A detailed description of the backbone, on the other hand, allows a faithful reproduction of secondary structure propensities. In the last few years, a number of CG models were developed that have relatively high resolution, with several beads per residue, and that take into account the chemical specificity of side-chains. These models employ different types of interaction potentials but generally offer at least some degree of transferability. In the following, the main features of the most prominent examples of such models are described. Table 1 summarizes mapping schemes and different force field terms for these models.
The UNited RESidue (UNRES) model was initially developed by the Scheraga group almost two decades ago as a knowledge-based CG model53–55. However, over the years, it has undergone several refinement stages and finally evolved into a fully physics-based model. In the UNRES model53–56, a polypeptide chain is represented as a sequence of α-carbon atoms with attached united side-chains (SC), whose sizes depend on the nature of the amino acid residue and united peptide groups (P), each of the latter being positioned in the middle between two consecutive Cα atoms (Figure 2). The α-carbon atoms are linked by virtual Cα…Cα bonds of length 3.8 Å corresponding to trans peptide groups. It should be noted here that these Cα atoms are not interaction sites per se but rather they assist only in the definition of the geometry. Each SC is represented as an ellipsoid and is attached to the respective Cα atom by a virtual Cα…SC bond.
The force field is derived as a restricted free energy (RFE) function, which corresponds to averaging the energy over the degrees of freedom that are neglected in the UNRES model. The UNRES effective energy function is given by equation 157.
where w represents the relative weights of each term. The initial three terms represent non-bonded interaction energies between SC-SC, SC-P, P-P beads. The next five terms describe the bonded interaction energies corresponding to torsion, double-torsion, angle, SC rotamers, and bonds, respectively. The last two remaining terms model multi-body (correlation) interaction energies and the energy of forming disulfide bonds, respectively. Notably, the correlation term is temperature dependent. The force field is optimized based on the concept of a hierarchical protein-energy landscape58.
The potential function of UNRES is parameterized against free energies from atomistic polypeptide simulations in explicit water using a functional form that resembles atomistic force fields but with additional terms. In UNRES, solvent effects are represented implicitly. Applications of UNRES to date have been largely limited to de novo protein folding studies12,59,60 and protein structure prediction61,62. Recently, Solernou and Fernandez-Recio63 have developed a CG potential by extending the CG potential UNRES with additional terms accounting for a solvation energy contribution and an explicit electrostatic interactions to predict the protein-protein interactions.
Among all of the physics-based CG models, UNRES performs exceptionally well in folding proteins to their native-like conformations. This model has been used successfully in CASP structure prediction competition. For instance, the target T0215 was fold to a Cα RMSD of 3.5 Å from the native state. UNRES has been used to fold multi-chain proteins also. For example, UNRES was used to ab initio folding of the multi-chain protein with PDB ID 1G6U and the lowest RMSD obtained was 2.4 Å from the native state64. Their simulations reveal that individual chain folded independently to their native structure and then assembled into a complex structure.
The UNRES model is physically accurate and highly transferable but it suffers from a number of drawbacks. First, the model requires a very complex reverse-mapping scheme to convert from UNRES to all-atom models and is thus difficult to use within the context of multi-scale AA/CG simulations. Second, UNRES is neither transferable to the lipid environments nor is it easy to extend the model for other biomolecules such as nucleic acids because the necessary model re-parameterization would likely be very time-consuming.
The most popular and most widely used CG model for lipids as well as proteins is MARTINI65–67. This model has been employed in a wide range of applications68–76. The model was initially developed for lipid molecules and later was extended to proteins and several other biomolecules. The MARTINI model is based on a four-to-one mapping, i.e., on average four heavy atoms plus associated hydrogens are represented by a single CG site (Figure 2). However, ring-like molecules (e.g., benzene, cholesterol, and several of the amino acids) are mapped with a higher resolution of up to two non-hydrogen atoms to one MARTINI particle. The backbone of each amino acid is modeled with a single bead placed at the center of mass of the backbone, while up to four beads are used for SC. In contrast to UNRES, solvent is treated at the CG level in MARTINI. Four real water molecules are mapped to a single CG water bead. Recently, polarizable CG water models77,78 were also developed to be used with the MARTINI model. Ions are represented by a single CG bead, which represents both the ion itself and its first hydration shell. Based on the nature of different chemical groups, four main types of particle are assigned in the MARTINI model, namely polar (P), non-polar (N), apolar (C), and charged (Q). Within a main type, subtypes are assigned according to hydrogen bonding capabilities or the degree of polarity. In total, MARTINI has 18 particle types.
The MARTINI CG force field consists of generic bonded (bond, angle, dihedral, and improper dihedral) and non-bonded (van der Waals and electrostatics) interaction potentials. Notably, the electrostatic interaction is described by the Coulomb potential with a relative dielectric constant (εrel = 15) for explicit screening. However, a lower relative dielectric constant of 2.5 is suggested when combining MARTINI with a polarizable water model. The parameterization of this model rely on a combination of bottom-up and top-down approaches. For instance, a bottom-up approach was adopted to derive the bonded terms and were obtained from the underlying atomistic geometry or by comparing to atomistic simulations. On the other hand, the non-bonded interaction potentials were optimized based on a systematic comparison to experimental theromodynamic quantities, such as free energy of hydration and vaporization, and the partitioning free energies between oil and aqueous phases. Recently, the non-bonded parameters were further optimized79 based the binding studies of Wimley-White pentapeptides to a 2-oleoyl-1-pamlitoyl-sn-glyecro-3-phosphocholine (POPC) bilayer80 and SC analog dimerization free energies81. This led to the addition of an explicit polarization term to polar SCs such as glutamine and asparagine. The MARTINI CG model can achieve about a factor of 75–100 speedup with respect to fully atomistic simulations82.
MARTINI suffers from not being able to accurately represent amino-acid specific secondary structure propensities. As a consequence, a weak bias has to be employed to maintain secondary structures based on what is known from native structures. Therefore, MARTINI is not suitable for modeling processes in which secondary structure changes and, more generally, folding and unfolding play a substantial role. However, movements of secondary structures with respect to each other can be studied with the MARTINI model. Furthermore, although MARTINI is very successful in modeling interactions with lipid bilayers and lipid-water interfaces, it is not applicable to systems involving solid/fluid or gas/fluid interfaces because of the fluid-phase based parameterization.
It is worth mentioning here that Peraro and coworkers83 have adopted the MARTINI mapping scheme and have developed an electrostatic-consistent CG potential for proteins in which they have introduced an explicit dipole defined by three consecutive Cα beads along with the treatment of non-radial dipole-dipole interactions in dynamics. As a result, in contrast to MARTINI, this model can maintain stable secondary elements of unspecific proteins and sample conformational transitions.
The OPEP (Optimized Potential for Efficient structure Prediction)16,84–86 is a transferable CG force field for proteins developed by Derreumaux and coworkers. OPEP is an intermediate resolution model where each amino acid is represented with six beads. The backbone is retained in full atomic detail while SCs are represented by a single bead or centroid (Figure 2). However, the amino acid proline (Pro) is represented by all heavy atoms. The centroids are placed at the center of mass of the heavy atoms and calculated employing the rotamer distribution of each SC. Each SC bead is described by an appropriate van der Waals radius, and their positions are defined with respect to the backbone heavy atoms N, Cα, and C′. The OPEP energy function consists of bonded and non-bonded terms. Bonded terms include bonds, angles, dihedrals, and improper torsions (to maintain the cis or trans conformation of the peptide bond and the chirality of the amino acid). Additional terms confine the dihedrals to the most populated regions in the Ramachandran plot via harmonic barriers. The force constants and equilibrium values associated with these terms were derived from the AMBER all-atom force field87. Non-bonded interactions in OPEP consist of van der Waals interactions and explicit hydrogen bonding potentials. The hydrogen-bonding potential consists of two-body and four-body terms, and the later term was used to capture the strong cooperative nature of the amide hydrogen bonds. The van der Waals interactions are knowledge-based as van der Waals radii were calculated using a dataset of thousands of Protein Data Bank (PDB) structures. The van der Waals interactions combine backbone-backbone, backbone-SC terms with 210 SC pair interactions. The aqueous solvent is treated implicitly in OPEP. A genetic algorithm using a scoring function, which requires that the native structure has the lowest energy and that native-like structures have an energy higher than the native structure but lower than non-native conformations, was used to optimize the OPEP force field. In OPEPv4, the van der Waals term was composed of an r−8 term combined with an analytical formulation that quickly vanishes as a function of distance88. This analytical function has a steeper profile with compared to a 12–6 Lennard-Jones potential. In OPEPv4, authors have changed the εij coupling constants of some interaction pairs present in α-helices based on a statistical analysis of PDB structures. To further improve OPEP, Sterpone and coworkers89 recently derived new effective potentials for describing the ion pair interactions from all-atom PMF (potential of mean force), that feature a single minimum for Lys-Asp and Lys-Glu pairs and two minima for Arg-Asp and Arg-Glu. The OPEP force field has been used successfully for predicting protein stabilities as well as folding and aggregation studies of small peptides88,90 and the structure prediction of small proteins91.
OPEP is highly transferable among different proteins, but it is limited to aqueous solvent environments. OPEP requires an integration time-step of 1.5 fs due to the detailed backbone and hydrogen bonds. This limits the amount of sampling that can be accomplished. Furthermore, a single-bead representation of the SC puts a limit on capturing the chemical specificity of SCs specificity and creates ambiguity in when reconstructing all-atom models from the CG model. Therefore, applications within multi-scale frameworks may be limited. Furthermore, due to the lack of chemical SC specificity, this model may not be suitable for investigating protein-protein interactions.
A major challenge in computational modeling of protein-protein interactions is the de novo prediction of protein-protein interfaces. To access the large sampling space, CG simulations are often employed. Basdevant and coworkers14 have developed the physics-based simplified CG model SCORPION (Solvated COaRse-grained Protein interactION) which targets applications in protein-protein recognition. In SCORPION, the backbone of each amino acid is represented by a single bead while one or two beads are used to describe SCs depending on their size (Figure 2). The CG beads are located at the geometric center of the heavy atoms. This mapping scheme yields 29 CG types. An elastic network model (ENM) model was introduced in SCORPION to account for the internal protein flexibility. Because the ENM model maintains the internal protein structure, no bonded terms are necessary in SCORPION. Following classical all-atom force fields, non-polar and electrostatic interactions were expressed as sums of pairwise van der Waals and Coulombic interaction potentials and were optimized in a consistent bottom-up approach. Van der Waals interactions were extracted by fitting van der Waals SC pair interactions to PMFs extracted from all-atom MD simulations that were performed in vacuum with vanishing charges (i.e., no Coulomb interactions) in order to account for the purely non-electrostatic interaction. The resulting van der Waals interaction potential is softer than its atomistic counterpart and is composed of a very smooth repulsive part in 1/r6 and a Gaussian attractive part14. The CG point charges for a protein were obtained by reproducing the electrostatic potentials obtained with an all-atom model in vacuum. The motivation for carrying out the parameterization in vacuum was to obtain a model that can then be coupled to any implicit or explicit solvent model. While this makes sense in principle, this approach neglects solute polarization that is induced upon transfer from vacuum to solvent environments and that may require a polarizable version of SCORPION. As a step in that direction, the authors of SCORPION have recently developed a polarizable coarse-grained solvent (PCGS) model, which can capture such polarization effects in part and was validated successfully against solvation free energies of peptides. The resulting CG model was found to be about 100–130 times faster than the atomistic simulation.
The combined protein and water model was used successfully in mechanistic studies of protein-protein recognition. For instance, they have conducted seven long molecular dynamics simulations starting from several initial configurations to predict the structure of barnase/barstar complex. Notably, the simulations were carried out at a very high temperature (500 K) to accelerate the phase-space sampling. Within 1 to 500 ns, they observed that five of seven simulations adopted a conformation very similar to the native complex structure with RMSDs from the crystal structure varying between 0.5 and 2 Å. Furthermore, their simulations reveal that this protein-protein recognition process is mainly driven by the van der Waals and electrostatic interactions between two partner proteins. These results agree with the previous all-atom simulations92.
The current model suffers from two main drawbacks: lack of bonded interactions and the use of high temperature that affects the balance between the enthalpic and entropic contributions.
Recently, Pasi, Lavery and Ceres (PaLaCe)93 developed a new CG model for proteins for investigating the mechanical properties of proteins. This model uses a two-tier representation of the polypeptide chains: one representation is used for bonded interactions and for backbone hydrogen bonds and involves atomic beads while a different level is used for non-bonded terms, other than main chain hydrogen bonds, and involves a pseudo atom (or CG) representation of the amino acids. One or two CG beads are used for SCs while three beads (N, Cα, and C‵) are used to represent the backbone so that an explicit description of hydrogen bonds becomes possible (Figure 2). The potential energy function of PaLaCe consists of physics-based bonded and non-bonded terms with an additional term for the backbone hydrogen bonding potential, which is introduced to allow secondary structures changes. The van der Waals interactions in PaLaCe are characterized by a Lennard-Jones-like interaction but with a softer repulsive term than atomistic force fields. The Hydrogen bonding potential is composed of 12-6 Lennard-Jones and electrostatic contributions. The electrostatic component of the hydrogen bonding potential is modeled via Coulomb potential damped by a distance-dependent dielectric constant. The solvent is treated implicitly with a one-body solvation term based on circular variance, which is used to describe the occlusion of a CG atom within a protein structure. The PaLaCe potential was parameterized following a bottom-up approach by applying the Boltzmann inversion technique to conformational probability distributions derived from a large data set of protein structures followed by iterative refinement. PaLaCe was shown to maintain stable structures of folded proteins. It can also correctly model large-scale, force-induced conformational changes for the immunoglobulin-like domain of the giant protein titin. PaLaCe is reported to be faster than equivalent all-atom simulations in an explicit solvent by roughly 1000 times. The model is limited to aqueous solvent environments only.
Another transferable CG model for proteins was designed by Bereau and Deserno94. It uses an intermediate level of description of the peptide and treats solvent effects implicitly. Each amino acid is described by four beads: three beads for the backbone and one bead for SC. The CG beads for backbones are located at N, Cα and C′ while the CG bead for SCs is located at Cβ (Figure 2). For non-glycine amino acids, all SC beads are assigned the same van der Waals radius. The bonded terms are generic and the corresponding geometric parameters were taken from existing peptide models39,95,96. The spring constants are set high enough and an approximate 5% flexibility around their reference values was given to account for thermal fluctuations. The peptide bond is modeled by a single minimum centered around the trans conformation while two minima centered around cis and trans conformations are employed to model the isomerization.
The SC non-bonded interactions were described by the venerable knowledge-based potential from Miyazawa and Jernigan17 that resulted from a statistical analysis of SC contacts in a protein structure database. No explicit electrostatic interactions were included since they are implicitly accounted for in several of the other terms. The backbone bead interaction potential includes a short-range, purely repulsive Week-Chandler-Anderson (WCA) potential97 to capture locally excluded volume interactions. Furthermore, a backbone hydrogen bonding potential as a function of the relative distance and orientation of an amide and a carbonyl group was added following Irbäck et al.39, where a radial 12-10 Lennard-Jones potential is combined with an angular term. It has been shown previously98 that carbonyl and amide groups at the peptide bond form dipoles that interact with each other and that the nearest-neighbor interaction is enough to favor β over α content99. Therefore, an extra term describing dipole-dipole interactions of neighboring residues is also included in this model. A single cosine function centered around , ψ = 0 with an identical amplitude along both coordinates was employed to model the neighboring dipole potential. These terms were optimized by reproducing dihedral angle distributions of GGG and GAG tripeptides and global folding properties of a three helix bundle. The resulting CG force field was used for studying protein folding100 and aggregation94. One of the drawbacks of this model is that the stability of β-sheet structures and proteins with mixed α/β content is apparently underestimated. The model is implemented in the ESPRESSO package101.
Very recently, we developed the intermediate resolution CG model for proteins, PRIMO102,103. The backbone in PRIMO is described by three CG beads (N, Cα, and combined CO) while one to five sites are used for SCs (Figure 2). A unique feature of PRIMO is that the CG beads are mapped in such a way that atomistic models can be reconstructed from the CG beads using an analytical procedure that is extremely fast and provides near-atomistic accuracy in the reconstructed models. This allows PRIMO to be tied closely to fully atomistic models in order to maximize transferability. It also positions PRIMO as an ideal CG model for multi-scale models. Like some of the other CG models, PRIMO uses virtual sites that are constructed on the fly from the CG interaction sites in order to maintain better bonding geometries.
PRIMO is largely a physics-based model with an all-atom like force field as shown in equation 3:
The bonded terms are described by harmonic potentials as well as spline potentials to model non-harmonic bonding functions between CG beads. The interaction potential also includes a distance- and angle-dependent explicit hydrogen-bonding potential term because hydrogen-bonding cannot be captured with electrostatics alone in CG models. A CMAP104 correction term is used to provide a correct description of the backbone (,ψ) and SC (χ) distributions. PRIMO uses an implicit solvent potential to capture solvent effects. PRIMO was optimized by direct comparison with the latest all-atom CHARMM36 force field105,106 against a diverse set of decoys of peptides and proteins. There is a high energetic correspondence between PRIMO and CHARMM which provides the basis for a seamless mixing of PRIMO and CHARMM in multi-scale AA/CG simulations107. In contrast to most other similar CG models, PRIMO does not require any bias towards known structural information and is suitable for running stable MD simulations of proteins and for folding of small peptides from their extended conformations. PRIMO allows a time step of 4–6 fs and achieves a speedup of about 10–15 times faster compared to atomistic simulations. This speedup is significantly less compared to other CG models in part due to the relatively high resolution but also because of the expensive GBMV implicit solvent model108 that is currently used in combination with PRIMO. PRIMO essentially maintains the same level of transferability of all-atom force fields, including transferability to different environments. For example, it is possible to transfer to membrane environments by simply switching the GBMV108 implicit solvent model for the heterogeneous dielectric generalized Born (HDGB)109,110 implicit membrane model without further optimizations as demonstrated recently111.
In general, CG beads are considered to be isotropic. However, in the context of biomolecular systems, CG beads are anisotropic because a single CG bead represents a group of atoms. Motivated by this observation Shen and coworkers112 have recently introduced a physics-based CG model, named GBEMP, based on the anisotropic Gay-Berne (GB) and point electric multipole (EMP) potentials that are used to describe the non-bonded interactions. In GBEMP, the bonded terms consist of bond, angle and dihedral potentials. Instead of using simple harmonic potentials, the bond stretching term is described by the fourth-order Taylor expansion of the Morse potential while the angle bending term adopts a sixth-order potential. A Fourier series expansion is used to model the torsional potential. The generalized Kirkwood (GK)113 implicit solvent model was employed to describe the solvent effect in GBEMP. In an alternative to the traditional leap-frog or Verlet algorithm, Euler’s rigid body integrator is used to propagate the equation of motion.
The force field was parameterized based on a combination of experimental and all-atom data. The bonded terms were parameterized by fitting to the PMFs obtained from atomistic molecular dynamics simulations of dipeptides using CHARMM27/CMAP104 force field114,115. The parameters for torsional potentials were further optimized by iteratively matching to the experimental results for the distributions of backbone and SC torsions. The non-bonded terms were parameterized by reproducing the individual energy components of the atomistic force field. The model was validated by reproducing experimental backbone and SC conformations and stable MD simulations of two proteins. The resulting GBEMP CG force field is general and transferable, suitable for simulating biological macromolecules. The model is also suitable for “serial” or “parallel” multi-scale AA/CG simulations. The GBEMP model is about 50 times faster compared to atomistic force fields with point charge models.
Hall and coworkers116,117 have introduced an implicit solvent intermediate-resolution protein model, namely PRIME based on the discontinuous molecular dynamics (DMD)118 methodology. Here, each amino acid is described by four beads: three beads for the backbone and a single bead for SC (Figure 2). The backbone beads are comprised of the united atoms NH, CαH and CO. Furthermore, SCs are distinguished into 14 groups depending on their size, charge, hydrophobicity, polarity, and potential for SC hydrogen bonding. All forces are modeled either by hard-sphere or by square well potentials with realistic diameters. For instance, a square-well potential was employed to model the interactions between hydrophobic SCs while the backbone hydrogen bonding potential was represented by a directionally dependent square-well attraction between NH and CO. The resulting force field contains 19 energy parameters. These parameters were obtained by using a perceptron-learning algorithm119 together with a modified stochastic learning algorithm that optimizes the energy gap between 711 known native states from the PDB and decoys generated by gapless threading.
The SC representation in PRIME is too coarse to correctly capture the SC entropy and the size effect for the packing of SCs that make critical contributions to protein folding. Therefore, Ding and coworkers120 have extended this model by incorporating one or two more additional effective SC atoms. For the β-branched amino acids (Thr, Ile, and Val), two beads representing the two branches after Cβ were introduced while an additional bead was incorporated for bulky amino acids (Arg, Lys, and Trp).
Generally, AA/CG approaches may be divided into three categories121–123: 1) Serial approaches where the information from fine-grained simulations are transferred to a CG model, and subsequently only the CG model is explored; 2) a tightly coupled approach where both levels are present simultaneously; and 3) a loosely coupled approach, where fine-grained and CG levels of resolution are alternated (see Figure 3). The first type is essentially the idea behind Boltzmann inversion24 and force matching methods25,121. A typical example of the second type may be alternating simulations at the CG and all-atom levels, where the CG simulations are used to identify an ensemble of possible conformational states that are then connected via fine-grained simulations to obtain accurate conformational PMFs124–126. The third type would involve either resolution exchange simulations, with multiple replicas simulated at different resolutions127–129, or hybrid models25–27, where part of a system is represented in atomistic detail while other parts remain at the CG level122,123,130–132. Such hybrid approaches are most attractive for detailed mechanistic studies of specific parts of a large system where an explicit description at the atomistic level at the region of interest is vital but computational costs of simulating the entire complex in full atomistic detail are prohibitive.
In the first two approaches, a reverse mapping133,134 scheme is employed once at the end (type 1) or infrequently (type 3) to reconstruct the high-resolution model from the corresponding low-resolution model. In the case of tightly coupled AA/CG simulations, the conversion between AA and CG regions is more critical as it has to be done frequently and has to maintain detailed balance when going back and forth between different resolutions. In the case of hybrid AA/CG models, the treatment at the boundary between the different regions becomes a key issue. The boundary may be treated as rigid (fixed-resolution AA/CG method)135–138 analogous to the conventional hybrid quantum mechanics/molecular mechanics (QM/MM)139,140 approach or it may be open (adaptive AA/CG method)132,141. In the fixed-resolution AA/CG scheme, the coupling is continuous in time, and the atomistic and CG components of the system directly interact. A limiting case of the adaptive AA/CG scheme are the resolution-exchange methods already mentioned above.
In this section, we will focus primarily on tightly-coupled hybrid AA/CG models as they promise to be the most effective multi-scale scheme in the foreseeable future. For an extended review on the multi-scale modeling of proteins, readers are referred to recent reviews123,142–145.
Multi-scale AA/CG methods are developed to reduce the computational cost while capturing the relevant interactions at the atomic level. Since the solvent degrees of freedom are much larger compared to the solute (protein) degrees of freedom, a significant speedup can be achieved already just by representing the solvent at the CG level while the solute is described at the atomistic level. Motivated by this idea, Schulten and coworkers have proposed PACE (Protein in Atomistic details coupled with Coarse-grained Environment)146. More specifically PACE combines a united-atom protein model with solvent, water or lipids, represented with the MARTINI model. The potential energy function of PACE is given by the following equation.
The first four terms are generic bonded interaction potentials while the fifth term accounts for backbone and SC rotamers. The remaining terms capture non-bonded interactions that consist of standard Lennard-Jones 12-6 potentials for ECGW-CGW interactions between CG waters, for ECGW-UA interactions between CG water and protein sites, and for EvdW interactions between non-polar protein sites. The interactions between charged groups are modeled with a Coulomb potential using partial charges from the all-atom force field. The cross-resolution parameters were optimized by following the parameterization scheme of MARTINI, i.e., by reproducing experimental thermodynamic properties. Moreover, in the context of the hybrid force field, the interactions of atomic sites in proteins were reparameterized by using different reference data, such as PMF of polar interactions from atomistic simulations and statistical backbone potentials from a Protein Data Bank (PDB) coil library146. The resulting hybrid model was able to reproduce the statistical backbone and SC potentials of all amino acids accurately. The force field was further validated with ab initio folding simulations of a series of peptides and proteins. For instance, PACE146 has been employed to fold a 73-residue protein (α3D)147, which is a de novo designed three-helix-bundle protein. Starting from the denatured state, Han and Schulten146 folded this protein to an average Cα RMSD of 3.4 Å in three extended (10–30 μs) simulations. This result is very close to the result (3.1 Å) obtained by Shaw and coworkers in fully atomistic simulations148.
While replacing only the solvent with a CG representation limits the computational gains that can be realized, this approach competes with already established implicit solvent models. Furthermore, the initial tests indicate that the stability of native proteins may be underestimated in PACE, suggesting that the balance between the atomistic protein model and CG solvent may be challenging. Nevertheless, recent studies have extended PACE further towards protein-membrane systems149.
Zacharias et al. recently developed a multi-scale AA/CG model150 by combing a modified GROMOS united atom force field with the ATTRACT151 CG model for studying protein-protein interactions. In this model, bonded interactions are generic and are described by atomistic force field while non-bonded interactions are calculated via CG terms. Backbone atoms were retained in full atomic detail to allow accurate representations of hydrogen-bonded secondary structures and an explicit term for hydrogen bonds of main chain was incorporated to maintain a better secondary structures. In the multi-scale scheme, all SC heavy atoms were treated as virtual atoms and non-bonded SC-SC as well as main chain-SC interactions were calculated via the ATTRACT CG model instead of standard atomistic Lennard-Jones and Coulomb terms. In ATTRACT, the backbone is represented by two pseudoatoms (N and O) and SCs are represented by one or two pseudoatoms depending on their size. Effective interactions between pseudo atoms are described by soft distance-dependent Lennard-Jones-type potentials. The attractive and repulsive parameters for each pseudo atom pair were iteratively optimized based on predictions of protein–protein complexes151. Hence, the parameterization of this model represents the effective interactions between pseudo centers at protein-protein interfaces. The parameters implicitly account for solvation effects, and no explicit solvent molecules were included during simulations. Promising results were obtained for protein-protein interactions with this hybrid methodology. However, the current scheme is of limited use for studies of internal protein dynamics. Although ATTRACT can be used to investigate the dynamics of protein-protein recognition process, computational efficiency is an issue.
Rzepiela and co-workers152 have proposed a mixed model between all-atom and MARTINI CG force fields where all-atom-CG interactions are replaced by CG interactions through dummy sites. This avoids the need for reparameterization of cross-resolution terms from the CG perspective but it does not fully address the environment seen by the atomistic part. Following this protocol, Sokkar et al.153 found that the MARTINI force field couples too strongly with the atomistic model leading to complete unfolding of a test protein within a very short time.
In a similar effort, the Carloni group154 has also developed a method in which the biologically relevant region of the protein is represented at the atomic level, while the rest of the protein is treated at the CG level, by only considering Cα centroids. The atomistic region is treated using a standard molecular force field, the CG region using a simplified Go potential and the interface region is also treated with an AA force field. This AA/CG scheme shows promising results for the HIV-1 protease (HIV-1 PR) and human β-secretase (BACE). As solvent-protein interactions are modeled in terms of viscosity and environment random forces in the framework of stochastic dynamics, this current model is only suitable for simulating globular proteins with buried active sites. Further improvements are necessary for modeling the solvent flow in the AA part.
The Pantano group has recently introduced a CG model for DNA155 and later combined it with the atomistic AMBER force field156 for multi-scale AA/CG simulations of DNA157. Villa and coworkers developed a multi-scale method for simulating protein-DNA complexes158 and conducted a multi-scale simulation to investigate a complex between the lac repressor protein (Lacl) and a 107 bp DNA segment159. The Lacl complex with the operator DNA segment is modeled at the level of atomic details while a continuum model is employed for the 75 bp DNA loop. Using this multi-scale strategy, they have investigated the change in structural dynamics of the repressor due to the strained DNA loop.
Finally, PRIMO103 is another promising CG model that can be used to carry out multi-scale AA/CG simulations of proteins. Recently, Predeus and coworkers107 have employed PRIMO103 and CHARMM114 force fields simultaneously for conformational sampling of trp-cage and melittin peptides in the presence of crowder proteins. A three-component multi-scale scheme was adopted in their simulations. The peptides of interest were modeled with the atomistic CHARMM force field114 while the PRIMO force field was employed to model the crowder proteins and the solvent was treated implicitly via the GBMV108 scheme. Their simulations yielded significant populations of unfolded structures of trp-cage and melittin in the presence of cellular crowders. It should be noted here that there were no direct bonded interactions between AA and CG regions in this multi-scale AA/CG hybrid model. Both regions were interacting with each other via non-bonded interactions only. However, PRIMO can also be coupled with the atomistic force field to describe the direct bonded interactions between AA and CG atoms with minimum effort. The coupling can be achieved via two ways as shown in Figures 4 A and B. A crude way of combing PRIMO with CHARMM is to let the CG sites directly interact with the atomistic sites. One such example is shown in Figure 4C. Here, 1–14 residues of trp-cage are modeled with the CHARMM force filed while residues from 15–20 are described by the PRIMO force field. The cross-resolution parameters can be taken from the existing PRIMO or CHARMM model as shown in Figure 4D. An advantage of such a mixing scheme is that it avoids an extensive parameterization of the interfacial parameters.
A sophisticated way of achieving multi-scale coupling is shown in Figure 4B. Here, the AA part is converted to CG on the fly and let it interacts with the corresponding CG site. Furthermore, on the CG side, virtual sites are constructed on the fly and let it interacts with the corresponding atomistic bead from the AA part. This is a very robust approach and can be easily implemented as the reverse mapping in PRIMO allows a near atomistic accuracy of 0.1 Å.
During the last years we have seen the development of numerous CG models that are developed to study various biophysical processes. In this chapter, we have focused on those CG models that maximize transferability and are thus more universally applicable. We also discuss extensively multi-scale hybrid AA/CG methods that rely on transferable CG models and are starting to become useful for practical applications. While we believe, that the use of CG models and hybrid AA/CG schemes will continue to grow in the future many limitations remain as described extensively, e.g. by Saunders and Voth34. Still, even most ‘transferable’ CG models currently available are biased towards known data and therefore suffer from limitations when applied to a full range of current problems (protein dynamics, protein-protein interactions, folding etc.), when considering different environments, when combined with different biomolecules, or under varying thermodynamic conditions. One question how to further address this challenge is how to balance bottom-up and top-down approaches in developing effective CG models as both have strengths and weaknesses. A mixed approach was adopted in the development of MARTINI force field with overall good success. However, there are also fundamental questions of what degree transferability is theoretically possible at a chosen level of resolution34,160.
The coupling of AA and CG models in hybrid multi-scale schemes is still in its infancies. While first success stories have been reported, a robust strategy for tightly coupling AA and CG levels across an interface or between replicas at different resolutions is lacking. One of the outstanding issues is the preservation of detailed balance at both the AA and CG levels. Other questions are at what levels CG and AA models can be effectively combined since CG models that are too coarse are most certainly going to lead to problematic behavior when coupled directly to detailed atomistic potential.
Another topic that has remained largely unaddressed is how the dynamics resulting from CG and hybrid AA/CG models reproduces the correct kinetic behavior. For thermodynamic questions this can be neglected but as CG models are increasingly applied to cellular scales107,161,162 questions related to diffusive properties become more relevant163,164. While it is expected that CG models generally exhibit accelerated kinetics, it is much less clear whether different types of motion are affected similarly. Even more problematic is how dynamic properties may be affected when mixing AA and CG models. Therefore, a future challenge will be to develop a CG model that is dynamically accurate as well as computationally efficient34.
In this chapter, we have reviewed recent advances in developing transferable, more universally applicable CG models of proteins. While there is clear progress in recent years, many challenges remain. Another promising area is the development of multi-scale AA/CG hybrid methods that may benefit most from more transferable CG models. It is likely that with further increased accuracy and efficiency, CG and AA/CG schemes will play an even greater role in the future in addressing questions related to biomolecular dynamics and interactions on ever increasing spatiotemporal scales all the way to cellular scales.
This work was supported by National Institute of Health Grants GM084953 and GM092949.