|Home | About | Journals | Submit | Contact Us | Français|
Binding free energy calculations based on molecular simulations provide predicted affinities for biomolecular complexes. These calculations begin with a detailed description of a system, including its chemical composition and the interactions between its components. Simulations of the system are then used to compute thermodynamic information, such as binding affinities. Because of their promise for guiding molecular design, these calculations have recently begun to see widespread applications in early stage drug discovery. However, many challenges remain to make them a robust and reliable tool. Here, we highlight key challenges facing these calculations, describe known examples of these challenges, and call for the designation of standard community benchmark test systems that will help the research community generate and evaluate progress. In our view, progress will require careful assessment and evaluation of new methods, force fields, and modeling innovations on well-characterized benchmark systems, and we lay out our vision for how this can be achieved.
Molecular simulations provide a powerful technique for predicting and understanding the structure, function, dynamics, and interactions of biomolecules. Often, these techniques are valued because they provide a movie of what might be going on at the atomic level. However, simulations also can be used to make quantitative predictions of thermodynamic and kinetic properties, with applications in fields including drug discovery, chemical engineering, and nanoengineering. A thermodynamic property of particular interest is the binding affinity between biomolecules and ligands such as inhibitors, modulators, or activators. With accurate and rapid affinity predictions, we could use simulations in varied health-related applications, such as the prediction of biomolecular interaction networks in support of systems biology, or rapid design of new medications with reduced side-effects and drug resistance. In this work, we give a view of how these simulations could impact drug discovery, briefly discuss where they stand now, and then argue for benchmark systems chosen to drive and assess the advancement of these methods, helping to make them practical for drug discovery.
A major aim in the development of molecular simulations is to create quantitative, accurate tools which will guide early stage drug discovery. Consider a medicinal chemist in the not-too- distant future who has just finished synthesizing several new derivatives of an existing inhibitor as potential drug leads targeting a particular biomolecule, and has obtained binding affinity or potency data against the desired biomolecular target. Before leaving work, he or she generates ideas for perhaps 100 new compounds which could be synthesized next, then sets a computer to work overnight prioritizing them. By morning, the compounds have all been prioritized based on reliable predictions of their affinity for the desired target, selectivity against alternative targets which should be avoided, solubility, and membrane permeability. The chemist then looks through the predicted properties for the top few compounds and selects the next ones for synthesis. If synthesizing and testing each compound takes several days, this workflow compresses roughly a year’s work into a few days.
While this workflow is not yet a reality, huge strides have been made in this direction, with calculated binding affinity predictions now showing real promise (83, 19, 27, 109, 128, 18, 25, 123), solubility predictions beginning to come online (107, 99, 70), and predicted drug resistance/selectivity also apparently tractable (67, 67), with some headway apparent on membrane permeability (62, 23). A considerable amount of science and engineering still remains to make this vision a reality, but, given recent progress, the question now seems more one of when rather than whether.
Recent progress in computational power, especially the widespread availability of graphics processing units (GPUs) and advances in automation (72) and sampling protocols, have helped simulation-based techniques reach the point where they now appear to have sufficient accuracy to be genuinely useful in guiding pharmaceutical drug discovery at least for a certain subset of problems (78, 53, 109, 128, 18, 25, 123). Specifically, in some situations, free energy calculations appear to be capable of achieving RMS errors in the 1-2 kcal/mol range with current force fields, even in prospective applications. As a consequence, pharmaceutical companies are beginning to use these methods in discovery projects. The most immediate application of these techniques is to guide synthesis for lead optimization, but applications to scaffold hopping and in other areas also appear possible.
At the same time, it is clear that not all situations are so favorable, so it is worth asking what level of accuracy is actually needed. It is often suggested that we need binding free energy predictions accurate to within ~ 1 kcal/mol, but we are not aware of a clear basis for this figure beyond the fact it is a pleasingly round number that is close to the thermal kinetic energy, RT. Instead of setting a single threshold requirement for accuracy, it is more informative to consider how accurate calculations must be to reduce the number of compounds synthesized and tested by some factor, relative to the number required without computational prioritization. If one targets a three-fold reduction, the answer appears to be that calculations with a 2 kcal/mol RMS error will suffice (113, 83). Thus, one can gain substantial benefit from simulations that are good yet still quite imperfect.
More broadly, this analysis does not address the net value of computational affinity predictions in drug discovery. Costs include those of the software, computer time, and personnel required to incorporate calculations into the workflow; while benefits include the savings, revenue gains, and externalities attributable to reducing the number of low-affinity compounds synthesized and arriving earlier at a potent drug candidate. In addition, with sufficiently reliable predictions, chemists may choose to tackle difficult synthesis efforts they otherwise might have avoided, resulting in more novel and valuable chemical matter.
The present review focuses on a class of methods in which free energy differences are computed with simulations that sample Boltzmann distributions of molecular configurations. These samples are usually generated by molecular dynamics (MD) simulations (59), with the system effectively coupled to a heat bath at constant temperature, but Monte Carlo methods may also be used (75, 76, 21). In either case, the energy of a given configuration is provided by a potential function, or force field, which estimates the potential energy of a system of solute and solvent molecules as a function of the coordinates of all of its atoms. Such simulations may be used in several different ways to compute binding free energies or relative binding free energies, as detailed elsewhere (76, 20, 17, 112) and summarized below. In all cases, however, the calculations yield the free energy difference between two states of a molecular system, and they do so by computing the reversible work for changing the initial state to the final one. Two broad approaches deserve mention.
The first general approach directly computes the standard free energy of binding of two molecules by computing the reversible work of transferring the ligand from the binding site into solution. (This is sometimes called an absolute binding free energy calculation.) The pathway of this change may be one that is physically realizable, or one that is only realizable in silicoin which case it is sometimes called an “alchemical” pathway. Physical pathway methods provide the standard binding free energy by computing the reversible work of, in effect, pulling the ligand out of the binding site. Although, by definition, the pathway used must be a physical one that could occur in nature, it need not be probable, and improbable pathways, governed by an order parameter specifying how far the ligand is from the binding site, are often used (133, 138, 122, 50, 54, 8). In addition, artificial restraints may be useful to avoid sampling problems in the face of often complex barriers along the pathway (133, 122, 50, 54, 8). By contrast, alchemical pathway methods artificially decouple the ligand from the binding site and then recouple it to solution from the protein (58, 51, 45, 9, 80). Although alchemical decoupling methods may avoid clashes of the ligand with the protein that might be problematic in pathway methods for a tight binding site, they still can pose some of the same sampling challenges. For example, sampling of the unbound receptor must be adequate after the ligand is removed, and water molecules must have time to equilibrate in the vacated binding site. Given that free energy is a state function, it is not surprising that alchemical and physical pathway approaches yield apparently comparable results when applied to the same systems (65, 49, 26, 136).
The second general approach computes the difference between the binding free energies of two different ligands for the same receptor, by computing the work of artificially converting one ligand into another, first in the bound state and then free in solution (119, 76, 20, 17). Because these conversions are not physically realizable, such calculations are, again, called alchemical. These calculations can be quite efficient if the two ligands are very similar to each other, but they become more complicated and pose greater sampling problems if the two ligands are very different chemically or if there is a high barrier to interconversion between their most stable bound conformations (72). In addition, there may be concerns about slow conformational relaxation of the protein in response to the change in ligand. Nonetheless, alchemical relative free energy calculations currently are the best automated and most widely used free energy methods (83, 72, 128).
Importantly, the accuracy and precision of all of these methods are controlled by the same considerations. First, many conformations typically need to be generated, or sampled, in order to obtain an adequate representation of the Boltzmann distribution. In the limit of infinite sampling, a correctly implemented method would yield the single value of the free energy difference dictated by the specification of the molecular system and the chosen force field. In reality, however, only finite sampling is possible, so the reported free energy will differ from the nominal value associated with infinite sampling. In addition, because sampling methods are typically stochastic and the dynamics of molecular systems are highly sensitive to initial conditions (2), repeated calculations, using different random number seeds or initial states, will yield different results. The problem of finite sampling is most acute for systems where low-energy (hence highly occupied) conformational states are separated by high effective barriers, whether energetic or entropic. Second, even if adequate sampling is achievable, free energy differences may disagree substantially with experiment if the force field is not sufficiently accurate. Third, errors may also arise if the representation of the system in the simulation does not adequately represent the actual system, e.g. if protonation states are assigned incorrectly and held fixed.
Thus, in order for a free energy calculation to be reliable, it must use an appropriate representation of the physical system and an accurate force field, and it must adequately sample the relevant molecular configurations. In the case of the more widely used alchemical relative free energy approach, this means that the best results are expected when:
Beyond this domain of applicability—whose dimensions are, in fact, still somewhat vague — substantial challenges may be encountered. For example, binding free energy calculations for a cytochrome C peroxidase mutant suggest limitations of fixed-charge force fields. In this case, the strength of electrostatic interactions in a buried, relatively nonpolar binding site appears to be overestimated by a conventional fixed-charge force field, likely due to underestimation of polarization effects (103). Sampling problems are also common, with slow sidechain rearrangements and ligand binding mode rearrangements in model binding sites in T4 lysozyme posing timescale problems unless enhanced or biased sampling methods are carefully applied (81, 12, 82, 56, 37, 127); and larger-scale protein motions induced by some ligands also posing challenges (12, 68).
Although such problems need not prevent free energy calculations from being used, they can require specific adjustment of procedures and parameters based on experience and knowledge of the system at hand. Thus, a key challenge for the field is how to use insights from well-studied cases to enable automation and reduce the detailed knowledge of each system required to carry out high quality simulations.
Troubleshooting is also a major challenge. In most cases where calculations diverge substantially from experiment, the reason for the discrepancy is not apparent. Is the force field inaccurate? Would the results improve with more sampling? Were protonation states misassigned—or do they perhaps even change on binding? There might even be a software bug (30) or a human error in the use of the software. As a consequence, it is not clear what steps are most urgently needed to advance the field as a whole. In this work, we argue that many of these problems can be alleviated, and that the field will advance more rapidly, if we select a set of well-chosen benchmark systems on which free energy methods are regularly tested.
Modeling can in some cases improve rapidly, but, in our experience, rapid advances are most common when computational models undergo regular cycles of improvement, predictive testing, learning, and then further improvement. This can be particularly difficult for academic groups which may not have the resources for predictive tests; however, these are essential, since it is only in predictive tests that we can be sure we are assessing the performance of a method as it works in real life, rather than relying on knowledge of the expected outcome to inform setup of the calculations. With this in mind, the Statistical Assessment of the Modeling of Proteins and Ligands (SAMPL) blind challenges, as well as the Community Structure Activity Resource (CSAR) challenge, later replaced by the Drug Design Data Resource (D3R) grand challenges, have arisen to meet part of this need. Currently, D3R focuses on running blind challenges on protein-ligand binding with datasets from the pharmaceutical industry, allowing testing and evaluation of computational methods on systems of direct pharmaceutical relevance. SAMPL, in contrast, focuses on predictions in simpler physical settings (? ), such as small molecules in aqueous and organic phases, and small molecules binding to supramolecular hosts. Together, the SAMPL and D3R challenges roughly span the spectrum from properties we can predict now (though they may be challenging in some cases (6, 136? )) to the drug binding we want to be able to reliably predict. These challenges are vital as they provide our only opportunity, at present, to routinely see how different methods compare when attempting to compute the same properties, and they provide the beginnings of a model for how we can best advance free energy techniques: routinely testing our methods on the same, well-understood systems to learn what does and doesn’t work to improve performance. Thus, we need not just blind tests, but retrospective testing on well-understood, “benchmark” systems, detailed below.
Although tests of individual free energy methods are not uncommon today (78, 128, 18, 25, 123), the use of nonoverlapping molecular systems and computational protocols makes it difficult to compare methods on a rigorous basis. In addition, few studies are designed to identify key sources of error and thereby focus future research and development. A few molecular systems have now emerged as de facto standards for general study (Section 3). These selections result in part from two series of blinded prediction challenges (SAMPL (91), and CSAR (29) followed by D3R (40)), which have helped focus the computational chemistry community on a succession of test cases and highlighted the need for methodological improvements. However, broader adoption of a larger and more persistent set of test cases is needed. By coalescing around a compact set of benchmarks, well chosen to challenge and probe free energy calculations, practitioners and developers will be able to better assess and drive progress in binding free energy calculations. Our primary goals in this work are to explain how benchmark systems can be used to advance the field, to encourage adoption of a standard set of benchmark systems, and to propose some candidates for this set.
We envision two classes of benchmark cases: “hard” benchmarks, which are simple enough that well-converged results can readily be computed; and “soft” benchmarks, for which convincingly converged results cannot readily be generated, but which are still simple enough that concerted study by the community can delineate key issues that might not arise in the simpler “hard” cases. The following subsections provide examples of how hard and soft benchmark systems may be used to address important issues in free energy simulations.
It is crucial yet nontrivial to validate that a simulation package correctly implements and applies the desired methods (111), and benchmark cases can help with this. First, all software packages could be tested for their ability to generate correct potential energies for a single configuration of the specified molecular system and force field. These results should be correct to within rounding error and the precision of the physical constants used in the calculations (111). Similarly, different methods and software packages should give consistent binding free energies when identical force fields are applied with identical simulation setups and compositions. The benchmark systems for such testing can be simple and easy to converge, and high precision free energies (e.g., uncertainty ≈ 0.1 kcal/mol) should serve as a reference. Test calculations should typically agree with reference results to within 95% confidence intervals, from established methods (110, 35), For this purpose, the correctly computed values need not agree with experiment; indeed, experimental results are unnecessary.
As discussed above, free energy calculations require thorough sampling of molecular configurations from the Boltzmann distribution dictated by the force field that is employed. This sampling is typically done by running molecular dynamics simulations, and for systems as large and complex as proteins, it is difficult to carry out long enough simulations. Calculations with inadequate sampling yield results that are imprecise, in the sense that multiple independent calculations with slightly different initial conditions will yield significantly different results, and these ill-converged results will in general be poor estimates of the ideal result obtained in the limit of infinite sampling. Advanced simulation methods have been developed to speed convergence (118, 112), but it is not always clear how various methods compare to one another. To effectively compare such enhanced sampling methods, we need benchmark molecular systems, parameterized with a force field that many software packages can use, that embody various sampling challenges, such as high dimensionality and energetic and entropic barriers between highly occupied states, but which are just tractable enough that reliable results are available via suitable reference calculations. Again, experimental data are not required, and the point of comparison may be, at least in part, sampling measures.
Some molecular systems are small and simple enough that current technology allows thorough conformational sampling, and hence well converged calculations of experimental observables. This has long been feasible for liquids (57); for example, it is easy to precisely compute the heat of vaporization of liquid acetone with one of the standard force fields. More recently, advances in hardware and software have made it possible to compute binding thermodynamics to high precision for simple molecular recognition systems (50), as further discussed below. In such cases, absent complications like uncertain protonation states, the level of agreement with experiment reports directly on the accuracy of the force field. Thus, simple molecular recognition systems with reliable experimental binding data represent another valuable class of benchmarks. Here, of course, experimental data are needed. Ideally, the physical materials will be fairly easy to obtain so that measurements can be replicated or new experimental conditions (such as temperature and solvent composition) explored.
Enhanced sampling techniques (Section 18.104.22.168), designed to speed convergence of free energy simulations, may not be adequately tested by any hard benchmark, because such systems are necessarily rather simple. Thus, despite the fact that reliable reference results are not available for soft benchmarks, they are still important for method comparisons. For example, it may become clear that some methods are better at sampling in systems with high energy barriers, and others in high-dimensional systems with rugged energy surfaces. Developers should test methods on a standard set of benchmark systems for informative comparisons.
Although it is still very difficult to convincingly verify convergence of many protein-ligand binding calculations, it is still important to compare the performance of various methods in real-world challenges. Appropriate soft benchmarks are likely to be cases which are still relatively tractable, involving small proteins and simple binding sites. We need a series of benchmark protein-ligand systems that introduce various challenges in a well-understood manner. Systems should introduce none, one, two, or N of the following challenges in various combinations:
We envision these more complex benchmark systems proceeding through stages, initially serving effectively as a playground where major challenges and issues are explored, documented, and become well-known. Eventually, some will become sufficiently well characterized and sampled that they become hard benchmarks.
Standard benchmark systems along the lines sketched above will allow potential solutions to be tested in a straightforward, reproducible manner. For example, force fields may be assessed by swapping new parameters, or even a new functional form, into an existing workflow to see the impact on accuracy for a hard benchmark test. Sampling methods may be assessed by using various enhanced sampling methods for either hard or soft sampling benchmarks, here without focusing on accuracy relative to experiment. And system preparation tools could be varied to see how different approaches to assigning protonation states, modeling missing loops, or setting initial ligand poses, affect agreement with experiment—with the understanding that force field and sampling also play a role. Such studies will be greatly facilitated by well-characterized standard benchmarks.
At the same time, there is a possibility that that some methods will inadvertently end up tuned specifically to generate good results for the set of accepted benchmarks. In such cases, the results for systems outside the benchmark set might still be disappointing. This means the field will need to work together to develop a truly representative set of benchmarks. This potential problem can also be mitigated by sharing of methods to enable broader testing by non-developers, and by participation in blinded prediction challenges, such as SAMPL and D3R, which confront methods with entirely new challenge cases.
No molecular systems have been explicitly accepted by the field as benchmarks for free energy calculations, but certain host molecules (see below) and designed binding sites in the enzyme T4 lysozyme have emerged as particularly helpful and widely studied test cases. Here, we describe these artificial receptors and propose specific host-guest and T4 lysozyme-ligand combinations as initial benchmark systems for free energy calculations. We also point to several additional hosts and small proteins that also have potential to generate useful benchmarks in the future (Section 4). The present focus is on cases where experimental data are available and add value, rather than ones chosen specifically to test conformational sampling methods, where experimental data are not required (Section 2.1).
Chemical hosts are small molecules, often comprising fewer than 100 non-hydrogen atoms, with a cavity or cleft that allows them to bind other compounds, called guests, with significant affinity. Hosts bind their guests via the same basic forces that proteins used to bind their ligands, so they can serve as simple test systems for computational models of noncovalent binding. Moreover, their small size, and, in many cases, their rigidity, can make it feasible to sample all relevant conformations, making for “hard” benchmarks as de- fined above (Section 2.1). Furthermore, experiments can often be run under conditions that make the protonation states of the host and guest unambiguous. Under these conditions, the level of agreement of correctly executed calculations with experiment effectively reports on the validity of the force field (Section 22.214.171.124). For a number of host-guest systems, the use of isothermal titration calorimetry (ITC) to characterize binding provides both binding free energies and binding enthalpies. Binding enthalpies can often also be computed to good numerical precision (50), so they provide an additional check of the validity of simulations. A variety of curated host-guest binding data is available on BindingDB at http://bindingdb.org/bind/HostGuest.jsp.
Hosts fall into chemical families, such that all members of each family share a major chemical motif, but individuals vary in terms of localized chemical substitutions and, in some families, the number of characteristic monomers they comprise. For example, all members of the cyclodextrin family are chiral rings of glucose monomers; family members then differ in the number of monomers and in the presence or absence of various chemical substituents. For tests of computational methods ultimately aimed at predicting protein-ligand binding affinities in aqueous solution, water soluble hosts are, arguably, most relevant. On the other hand, host-guest systems in organic solvents may usefully test how well force fields work in the nonaqueous environment within a lipid membrane. Here, we focus on two host families, the cucurbiturils (36, 85); and the octa-acids (more generally, Gibb deep cavity cavitands) (41, 52), which have already been the subject of concerted attention from the simulation community, due in part to their use in the SAMPL blinded prediction challenges (93, 91, 136).
The cucurbiturils (Figure 1) are achiral rings of glycoluril monomers (36). The first characterized family member, cucurbituril, has six glycoluril units, and subsequent synthetic efforts led to the five-, seven-, eight- and ten-monomer versions, cucurbit[n]uril (n=5,6,7,8,10) (71), which have been characterized to different extents. Of note, the n=6,7,8 variants accommodate guests of progressively larger size, but are consistent in preferring to bind guests with a hydrophobic core sized to fit snugly into the relatively nonpolar binding cavity, along with at least one cationic moiety (though neutral compounds do bind (134, 63)) that forms stabilizing interactions with the oxygens of the carbonyl groups fringing both portals of the host (71). Although derivatives of these parent compounds have been made (64, 124, 3, 24), most of the binding data published for this class of hosts pertain to the non-derivatized forms. A fairly extensive set of data is available in BindingDB at http://bindingdb.org/bind/HostGuest.jsp.
We propose cucurbituril (CB7) as the basis of one series of host-guest benchmark systems (Figure 1, Tables 1 and and2).2). This host is convenient experimentally, because it is reasonably soluble in water; and computationally, because it is quite rigid and lacks acidic or basic groups. In addition, it has attracted particular interest because of the high binding affinities of some guests, exceeding even the tightest-binding protein-ligand systems (71, 102, 86, 15). Finally, CB7 is already familiar to a number of computational chemistry groups, as it figured in two of the three SAMPL challenges that included host-guest components (93, 91), and it is currently the focus of the “hydrophobe challenge” (108).
Despite the simplicity of CB7, calculations of its binding thermodynamics are still challenging, with several known complexities:
For CB7, we have selected two sets of guests that were studied experimentally under uniform conditions (50 mM sodium acetate buffer, pH 4.74, 298K) by one research group (71, 15). Each series is based on a common chemical scaffold, making it amenable to not only absolute but also alchemical relative free energy calculations (Section 1.3). One set is based on an adamantane core (Table 1), and the other on an aromatic ring (Table 2). These systems can be run to convergence to allow detailed comparisons among methods and with experiment. Their binding free energies range from -5.99 to -17.19 kcal/mol, with the adamantane series spanning a particularly large range of free energies.
Sampling of the host appears relatively straightforward in CB7 as it is quite rigid and its symmetry provides for clever convergence checks (50, 88). Due to its top-bottom symmetry, flips of guests from “head-in” to “head-out” configurations are not necessary to obtain convergence (33). However, sampling of the guest geometry can be a challenge, with transitions between binding modes as slow as 0.07 flips/ns (88), and flexible guests also presenting challenges (88). As noted above, water sampling can also be an issue, with wetting/dewetting transitions occurring on the 50 ns timescale (105).
Salt and buffer conditions are also key. In addition to the strong salt-dependence of binding (87), acetic acid (such as in a sodium acetate buffer) can compete with guests for the binding site (86). This may partially explain systematic errors in some computational studies (94, 54). Indeed, the difference between 50 mM sodium acetate buffer and 100 mM sodium phosphate buffer impacts measured binding free energies by 2.5–2.8 kcal/mol (94, 91). Cationic guests could also have substantial and differing interactions with the counterions in solution as well, potentially lowering affinity relative to zero-salt conditions (91). Thus, one group found a 6.4–6.8 kcal/mol dependence on salt concentration (54), possibly impacting other studies as well (88)
Despite these issues, CB7 appears to be at the point where careful studies can probe the true accuracy of our force fields (50, 39, 135), and the results can be sobering, with RMS errors in the binding free energies as high as 8 kcal/mol (50, 88). More encouragingly, the values of R2 values can be as high as 0.92 (50). Some force fields appear relatively worse than others (54, 92). Calculated values are in many cases quite sensitive to details of force field parameters (88, 87, 92). For example, modest modification of some Lennard-Jones parameters yielded dramatic improvements in calculated values (135), and host-guest binding data has, accordingly, been suggested as an input for force field development (135, 50, 39). Water structure around CB7 and calculated binding enthalpies also appear particularly sensitive to the choice of water model (105, 33, 39), and water is clearly important for modulating binding (97). The water model also impacts the number of sodium ions which must be displaced (in sodium-based buffer) on binding (39, 50).
Despite its apparent simplicity, CB7 is still a challenging benchmark that can put important issues into high relief. For example, in SAMPL4, free energy methods yielded R2 values from 0.1 to 0.8 and RMS errors of about 1.9 to 4.9 kcal/mol for the same set of CB7 cases. This spread of results across rather similar methods highlights the need for shared benchmarks. Potential explanations include convergence difficulties, subtle methodological differences, and details of how the methods were applied (91). Until the origin of such discrepancies is clear, it is difficult to know how accurate our methods truly are.
The octa-acids (OA) (Figure 1) are synthetic hosts with deep, basket-shaped, hydrophobic binding sites (41). The eight carboxylic acidic groups for which they were originally named make these hosts water-soluble, but do not interact directly with bound hosts; instead, they project outward into solvent. Binding data have been reported for the original form of this host (OA) (41) and for a derivative with four added methyl groups at equivalent locations in the entryway, where they can contact a bound guest (TEMOA) (38, 116). (Note that OA and TEMOA have also been called OAH and OAMe, respectively (136).) Additional family members with other substituents around the portal have been reported, as has a new series in which the eponymic carboxylic groups are replaced by various other groups, including a number of basic amines (52). However, we are not aware of binding data for these derivatives. In view of these other hosts, however, we propose the more general name Gibb deep cavity cavitands (GDCCs) for this family of hosts. The binding cavities of the GDCCs are fairly rigid, though less so than the cucurbiturils. Some simulators report “breathing” motions that vary the diameter of the entry by up to 8 Å(77); and, in some studies, the benzoic acid “aps” around the entry occasionally ip upward and into contact with the guest (137, 120), though this motion has not been verified experimentally. Additionally, the four priopionate groups protruding into solution from the exterior base of the cavity are all flexible.
The octa-acids tend to bind guest molecules possessing a hydrophobic moiety that fits into the host’s cavity and a hydrophilic moiety that projects into the aqueous solvent. Within these specifications, they bind a diversity of ligands, including both organic cations and anions, as well as neutral compounds with varying degrees of polarity (42, 44). Compounds with adamantane or noradamantane groups display perhaps the highest affinities observed so far, with binding free energies ranging to about -8 kcal/mol (117). Much of the experimental binding data comes from ITC, so binding enthalpies are often available.
Two experimental aspects of binding are particularly intriguing and noteworthy. First, the binding thermodynamics of OA is sensitive to the type and concentration of anions in solution. Although NaCl produces relatively modest effects, 100 mM sodium perchlorate, chlorate and isothiocyanate can shift binding enthalpies by up to about 10 kcal/mol and free energies by around 2 kcal/mol (43). These effects are due in part to binding of anions by the host; indeed, trichloroacetate is reported to bind OA with a free energy of −5.2 kcal/mol (115), and competition of other guests with bound anions leads to entropy-enthalpy tradeoffs. Second, elongated guests can generate ternary complexes, in which two OA hosts encapsulate one guest, especially if both ends of the guest are not very polar (42).
As a core benchmark series for this family, we propose two sets which formed part of the SAMPL4 and SAMPL5 challenges, based on adamantane derivatives (Table 3) and cyclic (aromatic and saturated) carboxylic acids (Table 4) binding to hosts OA and TEMOA with free energies of −3.7 to −7.6 kcal/mol. These cases offer aqueous binding data with a reasonably broad range of binding free energies, frequently along with binding enthalpies; the hosts and many or all of their guests are small and rigid enough to allow convincing convergence of binding thermodynamics with readily feasible simulations; and, like the cucurbiturils, they are already emerging as de facto computational benchmarks, due to their use in the SAMPL4 and SAMPL5 challenges (91, 136).
Issues deserving attention when interpreting the experimental data and calculating the binding thermodynamics of these systems include the following:
As noted, two different host conformational sampling issues have been observed, with dihedral transitions for the proprionate groups occurring on 1–2 ns timescales (77)); motions of the benzoic acid aps were also relatively slow (137, 120) though perhaps thermodynamically unimportant. Guest sampling can also be an issue, at least in TEMOA (136), and this hosts’s tight cavity may also have implications for binding entropy (137).
Salt concentration strongly modulates binding affinity, at least for anions, and the nature of the salt also plays an important role (16). Co-solvent anions can also increase or decrease binding depending on their identity (43). Some salts even bind to OA themselves, with perchlorate (43) and trichloroacetate (115) being particularly potent, and thus will compete with guests for binding. Computationally, including additional salt beyond that needed for system neutralization changed binding free energies by up to 4 kcal/mol (120).
Naively, protonation states of the guests might seem clear and unambiguous. But since OA can bind guests of diverse net charges, the protonation state may not always be clear. One study used absolute binding free energy calculations for different guest charge states, coupled with pKa calculations, and found that inclusion of pKa corrections and the possibility of alternate charge states of the guests affected calculated binding free energies by up to 2 kcal/mol (120). As noted above, experimental evidence also indicates major pKa shifts on binding so that species such as acetate, formate and others would bind in neutral form at neutral pH (126, 115). Even the host protonation state may be unclear; while OA is often assumed to have all eight carboxylic acids deprotonated at the basic pH of typical experiments, the four at the bottom are in close proximity, and these might make hydrogen bonds allowing retention of two protons (32). Thus, there are uncertainties as to the host protonation state (91, 32), which perhaps also could be modulated by guest binding. Several groups used different methods but the same force field and water model in SAMPL5, with rather varied levels of success because of discrepancies in calculated free energies (136, 10, 8). However, some of these issues were resolved in follow-up work (8), bringing the methods into fairly good agreement for the majority of cases (137, 10).
Although we seek ultimately to predict binding in systems of direct pharmaceutical relevance, simpler protein-ligand systems can represent important stepping stones in this direction. Two model binding sites in T4 lysozyme have been particularly useful in this regard (Figure 2). These two binding sites, called L99A (89, 90) and L99A/M102Q (130, 47) for point mutations which create the cavities of interest, are created in artificial mutants of phage T4 lysozyme, and have been studied extensively experimentally and via modeling. As protein-ligand systems, they introduce additional complexities beyond those observed in host-guest systems, yet they share some of the same simplicity. The ligands are generally small, neutral, and relatively rigid, with clear protonation states. For most ligands, substantial protein motions are absent at room temperature and ambient pressure, allowing calculated binding free energies to apparently converge relatively easily. However, like host-guest systems, these binding sites are still surprisingly challenging (80, 81, 82, 12, 56, 37, 68). In addition, precise convergence is sometimes difficult to achieve, and it is in all cases essentially impossible to fully verify. As a consequence, these are “soft benchmarks” as defined above (Section 2.1). The importance of the lysozyme model sites is also driven by the relative wealth of experimental data. It is relatively easy to identify new ligands and obtain high quality crystal structures and affinity measurements, allowing two different rounds of blind predictions testing free energy calculations (82, 12).
These binding sites do exhibit some surprising experimental complexities which make them interesting ongoing topics of study, such as the fact that the L99A site is empty of water when ligands are not bound (96, 66, 22), yet the protein can undergo pressure-induced filling (22, 66) or denaturation (96), which can be inhibited by binding of ligand (96, 66). Pressure may also cause the protein to populate an excited state (73, 61) (but see (129)) which is already present to a very limited extent at equilibrium (11). Still, as noted below, these issues do not seem to dramatically impact our ability to calculate binding free energies at standard temperature and pressure, probably in large part because these are effects which come into play only at high pressures (96, 66, 73), though as we discuss below, some ligands do induce a protein conformational change which affects the same helix as the proposed excited state (74). It seems likely that the conformational hetereogeneity observed experimentally will make lysozyme even more of a valuable benchmark system, as test cases here can range from simple to challenging, depending on the ligand and the pressure.
The L99A site is also called the “apolar” cavity. It is relatively at and elongated, and binds mostly nonpolar molecules such as benzene, toluene, p-xylene, and n-butylbenzene: basically, a fairly broad range of nonpolar planar five- and six-membered rings and ring systems (such as indole). The polar version, L99A/M102Q, introduces an additional point mutation along one edge of the binding site, providing a glutamine that introduces polarity and the potential for hydrogen bonding. It still binds a variety of nonpolar ligands such as toluene (though not benzene). One small downside of these binding sites is that the range of affinities is relatively narrow: about −4.5 to −6.7 kcal/mol in the apolar site (89, 82), and about −4 to −5.5 kcal/mol in the polar site (12). Thus, even the strongest binders are not particularly strong, and the weakest binders tend to run up against their solubility limits. Still, these sites offer immensely useful tests for free energy calculations.
For both sites, fixed charge force fields seem to yield reasonably accurate free energies, with RMS errors between 1–2 kcal/mol, and some level of correlation with experiment, despite limited dynamic range (28, 82, 12, 37, 125). System composition/preparation issues also do not seem to be a huge factor. Instead, sampling issues predominate:
Tables 5 and and66 introduce proposed benchmark sets for the apolar and polar cavities, giving ligands potentially amenable to both absolute and relative free energy calculations, and spanning the range of available affinities. Co-crystal structures are available in most cases, and the PDB IDs are provided in the tables. The selected ligands span a range of challenges and levels of difficulty, ranging from fairly simple to including most of the challenges noted above. Essentially all of them have been included in at least one prior computational study, and some have appeared in a variety of prior studies. Additional known ligands and non-binders are available, with binding affinities available for 19 compounds in the L99A site (31, 89, 82) and 16 in L99A/M102Q (130, 47, 12). Because of the extent of the sampling challenges in lysozyme, binding of most ligands will currently constitute a soft benchmark, though long-timescale simulations to turn these into hard benchmarks may already be feasible.
Early work on the lysozyme sites focused on the difficulty of predicting binding modes (80, 82, 12) because of the slow interconversions noted above. Docking methods often can generate reasonable poses spanning most of the important possibilities (80, 82, 12, 48) but do not accurately predict the binding mode of individual compounds (82, 12, 48). Thus it appears necessary to consider the possibility of multiple binding modes; this is also important since some ligands actually populate multiple binding modes (12). In a number of studies, candidate binding modes from docking are relaxed with MD simulations, then clustered to select binding modes for further study. It turns out an effective binding free energy for each distinct candidate binding mode can be computed separately (80) and combined to find the population of each binding mode and determine the overall binding free energy. However, this is costly since each candidate binding mode requires a full binding free energy calculation.
Relative binding free energy calculations do not dramatically simplify the situation. Introduction of a ligand modification can leave the binding mode uncertain (e.g., introducing a chlorine onto phenol leaves at least two possible binding modes even if the binding mode of phenol is known) (12). A naïve solution is to consider multiple possible binding modes in relative free energy calculations (12), but this generates multiple results; determining the true relative binding free energy requires additional information (83). Enhanced sampling approaches provide one possible solution to the binding mode problem. Particularly, with λ or Hamiltonian exchange techniques, ligands can easily switch between binding modes when they are non-interacting unless they are restrained, and then moves in λ space can allow transitions back to the interacting state. Thus, approaches employing this strategy can naturally sample multiple binding modes (37, 125).
While sidechain sampling has been a significant challenge, it is possible to use biased sampling techniques such as umbrella sampling to deliberately compute and include free energies of sampling slow sidechain rearrangements (81). However, this is not a general solution, since it requires knowing what sidechains might rearrange on binding and then expending substantial computational power on sampling free energy landscapes for these rearrangements. An apparently better general strategy is including sidechains in enhanced sampling regions selected for Hamiltonian exchange (56, 60) or REST (127), allowing sidechains to be alchemically softened or torsion barriers lowered (or both), to enhance sampling at alchemical intermediate states. With swaps between λ values, enhanced sidechain sampling at intermediate states can propagate to all states, improving convergence (56, 127).
Larger protein conformational changes in lysozyme have received less attention, partly because until very recently they seemed to be a peculiar oddity only rarely observed; i.e., for ligands 4,5,6,7-tetrahydroindole and benzyl acetate in the polar site (12). However, recent work noted above highlighted how a helix in the apolar cavity can open to accommodate larger ligands (74). Timescales for this motion appear to be on the order of 50 ns, so it can pose sampling challenges, even for relative free energy calculations (68). Including part of the protein in the enhanced sampling region via REST2 provides some benefits, but sampling these motions will likely prove a valuable test for enhanced sampling methods.
This work has so far presented a small set of benchmark systems for binding free energy calculations, and has highlighted some of the ways in which they have already proven their utility. However, the scope of these sets is still quite limited. More, increasingly diverse, host-guest systems will help probe the strengths and weaknesses of force fields, and to drive their improvement. At the other end of the spectrum, we need more complex and challenging benchmark sets for proteins including simple models, like T4 lysozyme as well as candidate drug targets. And there may be community interest in test systems specifically selected to challenge sampling algorithms, without reference to experimental data.
Several candidate hosts and proteins are worth mentioning in this regard. Among host-guest systems, there is a particularly extensive experimental literature on cyclodextrins (46, 101), and they are tractable computationally (50, 132). As to artificial protein binding sites, the two variants of the CCP protein model binding site (34, 4, 5, 103, 95, 106) offer a modest increase in difficulty relative to the T4 lysozyme sites discussed above. And thrombin and the bromodomains appear to be promising examples of candidate drug targets for inclusion in a growing set of benchmark systems. Thrombin is a serine protease that has received prior attention from free energy studies (127, 128, 14). Experimental data exhibits interesting trends (7) that can partly be explained by simulations (14); but challenges remain (13). Bromodomains may also be interesting, especially given that relatively high accuracies have been reported, relative to experiment. At the same time, binding modes may be non-obvious and the diversity of ligands could pose problems for relative free energy calculations (1). Other systems will undoubtedly emerge as promising benchmarks as well, and we seek community input to help identify these.
In order to provide for updates of this material as new benchmark systems are defined, and to enable community input into the process of choosing them, we have made the LaTeX source for this article on GitHub at http://www.github.com/mobleylab/benchmarksets. We encourage use of the issue tracker for discussion, comments, and proposed updates. We plan to incorporate new material via GitHub as one would for a coding project, then make it available as preprints via bioRxiv. Given substantial changes to this initial version of the paper, it may ultimately be appropriate to make it available as a “perpetual review” (84) via another forum allowing versioned updates of publications.
Binding free energy calculations are a promising tool for predicting and understanding molecular interactions and appear to have enough accuracy to provide substantial benefits in a pharmaceutical drug discovery context. However, progress is needed to improve these tools so that they can achieve their potential. To achieve steady progress, and to avoid potentially damaging cycles of enthusiasm and disillusionment, we need to understand and be open and honest about key challenges. Benchmarks are vital for this, as they allow researchers in the field to rigorously test their methods, arrive at a shared understanding of problems, and measure progress on well-characterized yet challenging systems. It is also worth emphasizing the importance of sharing information about apparently well thought-out and even promising methods that do not work, rather than sharing only what does appear to work. Identifying and addressing failure cases and problems is critically important to advancing this technology, but failures can be harder to publish, and may even go unpublished, even though they serve a unique role in advancing the field. We therefore strongly encourage that such results be shared and welcomed by the research community.
Here, we proposed several benchmark systems for binding free energy calculations. These embody a subset of the key challenges facing the field, and we plan to expand the set as consensus emerges. Hopefully, these systems will serve as challenging standard test cases for new methods, force fields, protocols, and workflows. Our desire is that these benchmarks will advance the science and technology of modeling and predicting molecular interactions, and that other researchers in the field will contribute to identifying new benchmark sets and updating the information provided about these informative systems.
DLM appreciates financial support from the National Institutes of Health (NIH; 1R01GM108889-01) and the National Science Foundation (NSF; CHE 1352608). MKG thanks the NIH for partial support of this work through grant R01GM061300. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or the NSF.
We also appreciate helpful discussions with a huge number of people in the field, including a wide variety of participants at recent meetings such as the 2016 Workshop on Free Energy Methods in Drug Discovery. Conversations with John Chodera (MSKCC), Chris Oostenbrink (BOKU), Julien Michel (Edinburgh), Robert Abel (Schrödinger), Bruce Gibb (Tulane), Matt Sullivan (Tulane), and Lyle Isaacs (Maryland) were particularly helpful. We thank David Slochower (UCSD) for a critical reading of the manuscript.
D.L.M. is a member of the Scientific Advisory Board for OpenEye Scientific Software.
M.K.G. is a cofounder and has equity interest in the company VeraChem LLC.