Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Comput Chem. Author manuscript; available in PMC 2010 August 1.
Published in final edited form as:
PMCID: PMC2783641

Absolute free energies estimated by combining pre-calculated molecular fragment libraries


The absolute free energy — or partition function, equivalently — of a molecule can be estimated computationally using a suitable reference system. Here, we demonstrate a practical method for staging such calculations by growing a molecule based on a series of fragments. Significant computer time is saved by pre-calculating fragment configurations and interactions for re-use in a variety of molecules. We employ such fragment libraries and interaction tables for amino acids and capping groups to estimate free energies for small peptides. Equilibrium ensembles for the molecules are generated at no additional computational cost, and are used to check our results by comparison to standard dynamics simulation. We explain how our work can be extended to estimate relative binding affinities.


The use of a reference system for free energy calculations has a long history in physics and chemistry.1 The basic idea is to employ a reference system (“ref”) for which the absolute free energy is available, and which is as similar as possible to the physical system of interest (“phys”):


where Fx is the absolute free energy of model “x,” and ΔFref→phys is the free energy difference between the systems. In essence, this paper is about practical choices for the both the reference system and strategies for calculating ΔFref→phys when the physical system is a large molecule.

Historically, Stoessel and Nowak applied the reference-system strategy to a molecular system for the first time, using a solid harmonic reference system in Cartesian coordinates.2 Zuckerman and Ytreberg extended that work in two ways designed to improve overlap between the reference and physical systems:3 (i) by using internal coordinates; and (ii) by using a more flexible, numerically exact reference system based on histograms from a short dynamics simulation, rather than an artificial analytically tractable reference state. Huang and Makarov also employed the reference-system approach embodied in (1), but in a different way.4

Other efforts to calculate absolute free energies for molecular systems have been ongoing for years in the groups of Meirovitch58 and Gilson912 and more approximately using harmonic and quasi-harmonic methods.13 The work of Meirovitch builds on long-standing polymer-growth methodologies for estimating partition functions which date to the work of Rosenbluth and Rosenbluth.14 The original Rosenbluth work was generalized for higher efficiency and more realistic models by many workers.1527 Ideas from these polymer-growth sampling methods also inform the present work, in the sense that each of our fragments is a kind of “monomer”.

The present paper significantly extends the previous work by Ytreberg and Zuckerman3 by estimating absolute free energies for molecules built up gradually from molecular fragments. Larger molecules can be treated, compared to our previous paper3, because a series of staged intermediate systems are adopted. In essence the free energy difference of Eq. (1) is sub-divided (staged) into a sum of easy-to-calculate terms. Staging increments are highly tunable, based on the choice of fragment sizes and even by selection of subsets of interactions as detailed below. The use of fragments in other types of molecular mechanics calculations has a long history.2830

A key contribution of this work is a practical strategy of pre-calculation which minimizes the number of energy terms which need to be computed at each stage. Specifically, for each fragment, a statistical library — i.e., an ensemble of configurations and their energies — is stored; we have also used such libraries in Monte Carlo sampling31. Additionally, for each covalently bonded fragment pair, we store the full interaction energy (based on all atoms) for every possible pair of configurations. Such storage is quite practical on typical modern computers with > 1 GB of RAM. During production simulations it is only necessary to compute interactions between fragments separated by one or more other fragments. Needless to say, the stored libraries and interaction tables can be re-used in future simulations of the same or different molecules. The pre-calculation strategy, which has early conceptual roots3234, appears to represent a significant practical advance over earlier polymer-growth calculations. The use of non-statistical libraries has been popularized in the Rosetta protein folding program.35

There is substantial flexibility in the division of a molecule into fragments. We have used single amino acids as fragments in this study, but larger segments and even different interaction subsets as detailed below – may also be practical. The fragment-based approach could also be used to study protein-ligand binding, by growing small molecules into receptor binding pockets and estimating the free energies. This can be seen as a statistical mechanical generalization of fragment-based ideas developed earlier.29,30

Our results, which employ single amino-acid fragments, are extremely encouraging. The data indicate that absolute free energies for small peptides can be calculated rapidly and reliably. Specifically, high-precision free energy estimates, with fluctuations of ~ 0.3 kcal/mole, are obtained for 52-atom tetra-alanine in less than an hour of single-processor computing time, with a simple dielectric "solvent". We check our data by comparing the equilibrium ensembles (obtained simultaneously with the free energy estimates) with independent Langevin simulations. As a further check, in one case, the free energy results are verified by an independent calculation using different fragments.

The remaining sections of the paper describe the methods, results, and our conclusions. Our methods section provides full details for performing the calculations, including the generation of fragment libraries and interaction tables. We also correct a technical error in our earlier study.3 Our results describe both the free energy values and the analysis of the equilibrium ensembles. Our discussion section describes possible improvements to the method and extension to the estimation of relative binding affinities using absolute free energies.


Our basic approach is to calculate the free energy of the physical system of interest based on the difference from a known reference system, as in Eq. (1), and also to stage the calculation using molecular fragments. The fragments not only permit the gradual staging of the calculation but also a tremendous savings of computer time based on the storage of (i) fragment configurations, (ii) energies internal to each fragment configuration, and (iii) interaction energies between covalently bonded fragments. The low cost and high precision of the resulting estimates suggests we are far from the practical limit of the approach in the present implementation. However, a number of improvements to the implementation appear to be within easy reach, as described in our Discussion. All fragment libraries used in the present calculations are available at our website (

A. Model and systems

All calculations employ a standard atomistic forcefield, OPLS-AA36 at T = 298K. In the present report, our fragments are individual amino acids and capping groups. For simplicity in this initial investigation, we model the solvent by a simple uniform dielectric constant ε = 60. We compute free energy estimates for alanine dipeptide (Ace-Ala-Nme), di-alanine (Ace-(Ala)2-Nme), and tetra-alanine (Ace-(Ala)4-Nme). Following standard conventions, Ace is Acetyl (CH3-CO), Ala is Alanine (NH-CH(CH3)-CO), and Nme is N-methylamide (NH-CH3).

B. A simple example

Consider the calculation of the configurational free energy of alanine dipeptide based on a division into three fragments (Ace, Ala, Nme) which can be denoted (A, B, C) respectively (see Fig. 1). We will discuss this example qualitatively, to give the reader a sense of what to expect. Full details and all equations will be presented in succeeding subsections.

Figure 1
Stages for calculating the absolute free energy of a molecule by combining three fragments, based on Eq. (9). Connecting lines schematize full interactions between fragments, including both bonded and non-bonded atomistic terms, (a) The first intermediate ...

In advance, we calculate statistical libraries of configurations for each fragment, which are constant-temperature OPLS-AA ensembles based only on the atoms within the given fragment. The libraries additionally include the six degrees of freedom necessary for joining the fragments, based on the use of “dummy atoms” as described below. During the library generation process, the absolute free energy for each fragment is also calculated using a reference system based on independent internal coordinates, as described previously.3 A typical library will contain 10,000 configurations. We also pre-calculate every possible interaction energy between covalently bound fragments — i.e., a table of 108 interaction energies for the A-B and B-C fragment pairs.

The calculation proceeds as schematized in Fig. 1, where the presence of a line connecting two fragments indicates that all interactions between the fragments is included. The reference system (not shown) consists of fully independent coordinates, so that the fragments are not yet constructed. The first intermediate consists of the three non-interacting fragments, which include, however, all interactions within each fragment. Thus the fragment free energies, which are calculated and stored in advance, properly include the interactions among all degrees of freedom internal to each fragment. Other interactions are added in three stages: A-B interactions first, followed by B-C, and completed by A-C couplings.

In the first intermediate stage, the absolute free energies for the individual fragments are retrieved from disk. (They are initially calculated following reference3 as detailed below.) Next, A-B interactions are added by a standard free energy difference calculation. Specifically, an ensemble of non-interacting A-B configurations is generated by random combination of fragments from the A and B libraries, and the resulting energy change is exponentially averaged in the usual way — via Eq. (6) below. The energy changes due to the combination do not need to be calculated in our scheme, however, because they have been tabulated in advance. Additionally, the now interacting A-B fragments are “resampled”37 to correspond to the full potential for all degrees of freedom in both fragments. The details of resampling are given below — see Eq. (8) — but the bottom line is that one obtains 10,000-configuration ensemble of the partially grown molecule consisting of the A-B fragments.

The calculations then proceeds as if there are two fragments, A-B and C. The two libraries are joined combinatorially but only accounting for the B-C interactions at this stage. The A-C interactions will be handled at a later stage. Once again, the free energy change is calculated and the ensemble is resampled to reflect B-C interactions. The resulting ensemble contains 10,000 configurations of the full molecule reflecting all interactions except those between fragments A and C.

In the final stage sketched in Fig. 1, the A-C interactions are added in a standard free energy difference calculation based on the the ensemble of the previous stage. However, a standard energy call (for the whole molecule) is not required to save CPU time. Rather, our code only computes energy terms specific to the A and C fragments — i.e., electrostatic and van der Waals interactions between atoms of A and those of C. Once the energy changes have been obtained, the full free energy is rapidly estimated. Resampling into the fully interacting ensemble can also be performed rapidly without additional energy calculations.

It is not difficult to imagine generalizing this example to systems with more fragments. It is also worth noting that, strictly speaking, the final stage was not necessary. That is, we could have added the A-C interactions simultaneously with the B-C combination since the full molecular configurations were constructed at that point. These choices illustrate the flexibility intrinsic to staging the calculation with fragments, as we detail further in our Discussion. Additional staging flexibility results, of course, from the initial choice of the fragments — i.e., smaller fragments lead to staging in finer increments.

C. Basic formalism

Our calculation of the absolute free energy Fphys for a molecule divided into fragments is based on standard, straightforward equations. The only novel aspect of the formalism is our particular choice of stages based on the addition of fragments and/or inter-fragment interactions. Although our heavy reliance on pre-calculated information has very significant practical implications, it does not affect the formalism.

Definition of the free energy

The fundamental object of interest is the absolute classical free energy Fphys for an implicitly solvated molecule. The molecule is taken to consist of N atoms, and its internal-coordinate configurations are denoted by x = (x1,x2,…,x3N−6). The potential energy function will be a standard forcefield (here, OPLS-AA36), possibly augmented by an implicit solvation model; the full potential energy including any solvation will be denoted by Uphys(x). The free energy, which is a functional of Uphys, is defined by the dimensionless configurational partition function at temperature T = 1/kBβ via


where the Jacobian is given by


with Ji=xi2 for bond lengths, Ji = sin(xi) for bond angles, and Ji = 1 for dihedrals. Kinetic energy terms have already been integrated out. Both the dimensionless character of the partition function in Eq. (2) and the angstrom-based normalization result from a particular choice for the standard concentration C° (defined in references3,10) — or equivalently, for the volume containing our implicitly solvated molecule. In particular, we have chosen C° [equivalent]2(1 Å) 3N−3Qp/σ, where Qp=i=1N(2πmikBT/h2)3/2 results from the momentum integrals, h is Planck’s constant, and σ is the molecule’s symmetry number. We note that our chosen standard concentration varies based on the molecule (i.e., based on the number and masses of its atoms), and also eliminates the temperature dependence of Qp. However, in almost every application of interest (see Discussion, below), the absolute free energy calculated here ultimately will be used to estimate a free energy difference and eliminate any artifacts due to C°.

Our single-molecule formulation, as noted, allows for “implicit solvation” using an effective solvent term in Uphys that is solely a function of the internal molecular coordinates x. The present calculations employ a simple uniform dielectric constant (ϵ = 60). In our Discussion, we address the minor technical issues involved with using a more realistic implicit solvent model.

One issue of dimensionality is worth emphasizing. Although there are 3N − 6 internal coordinates for a molecule consisting of N atoms, the integral of Eq. (2) has dimensionality of length to the power 3N − 3. This is because N − 1 bond lengths remain in the full set of internal coordinates x, each of which contributes three powers of length. Put another way, of the six excluded rigid-body/center-of-mass coordinates, the three orientation angles are dimensionless; more specifically, the angles integrate to the factor of 8π2 included in the definition of C°

Staging the free energy calculation

As illustrated in the example of Sec. IIB, we will calculate the free energy in a series of stages. These can be understood most easily by adding and subtracting the free energies corresponding to k intermediate models,



where ΔFj = FjFj−1 and Fj[Uj] is defined in analogy to Eq. (2) for the intermediate models defined by Uj. The Uj potentials will be specified below.

All free energy difference calculations will be performed here using the “perturbation” formulation.38 Explicitly, for two arbitrary potential energy functions Ua and Ub, one has



where the subscript a denotes an average performed over configurations distributed according to the Ua ensemble and Na is the number of configurations in that ensemble. Eq. (7) is used to estimate the free energy differences required in Eq. (5), and it is exact in the limit Na → ∞.

We emphasize that succeeding intermediates are constructed to have progressively narrower distributions as more interactions are added, as in the alanine dipeptide example. In other words, we ensure good overlap and reliable ΔF estimates by proceeding in the generalized “insertion” direction.39,40

Resampling to obtain staged equilibrium ensembles

As our calculation proceeds through the various stages, we will require the correspoinding equilibrium ensembles for each stage, primarily for use in Eq. (7). These are obtained by “resampling,” the process of converting an ensemble for one distribution into another by eliminating, duplicating, and/or adjusting the weights of configurations in the original distribution.37 In our case, we primarily use elimination of configurations from a larger ensemble (e.g., all combinations of fragments A and B) to create a smaller one (e.g., the interacting A-B ensemble); we do not adjust weights. More specificially, to resample an ensemble of configurations xa generated according to Ua into a Ub ensemble, the original configurations are randomly selected with probability proportional to the ratio of Boltzmann factors,


Operationally, we select configurations by forming a cumulative distribution function (cdf) based on the normalized set of ratios (8), and then choosing from this cdf as many times as desired.

D. Choice of intermediate models

As already noted, the set of intermediate models {Uj} can be chosen in a variety of ways. In the present study, we employ k intermediates for a molecule divided into k fragments. This was exemplified for alanine dipeptide, which is divided into three fragments.

The present study uses a uniform staging strategy for all molecules examined, as exemplified in Fig. 1 and Fig. 2. The reference system consists of all coordinates fully independent — both within and between fragments — as in our previous work.3 The first intermediate stage adds interactions within fragments, so that one has true molecular fragments but no interactions between fragments. We then add interactions between neighboring, covalently bound fragments — i.e., among all the atoms in the neighboring fragment pair — one fragment pair at a time. The final stage of this simple scheme involves the addition of all remaining interactions, which occur solely between non-adjacent fragments. The result is a molecule with atoms interacting fully according to a standard forcefield and possibly continuum solvent model.

Figure 2
Stages used in the free energy calculation of a four-fragment molecule, corresponding to Eq. (10). The initial stages proceed in analogy to Fig. 1, with pair-wise interactions added one at a time for neighboring (“bonded”) fragments. In ...

Because interactions among previously non-interacting components are added at every stage, it is expected that the configuration space will become increasingly narrow. Such a progressive narrowing justifies the use of the perturbation expression (7).

To explicitly illustrate the staging scheme employed here, consider the case of a molecule divided into the three fragments A, B, and C, as in Fig. 1. We denote by uiref the reference potential for internal coordinate xi, where the full set is x = (x1, x2,…). For the fragments, we let Uy be the potential energy for all interactions internal to fragment y, and Uyz is the potential energy for all interactions between the y and z fragments. A three-fragment molecule would be staged as follows:


The choice of the reference potentials {uiref} is guided by the forcefield, as detailed below in Sec. IIG. Building on our earlier work,3 we will use independent internal coordinates with simulation-guided distributions.

A four-fragment molecule, such as di-alanine (Ace-(Ala)2-Nme) schematized in Fig. 2, would be staged according to:


As described in the Discussion, it is also possible to stage the final (“non-bonded”) pairwise interactions separately.

We anticipate that significant optimization can be obtained by adjusting fragmentation and staging schemes. While our Discussion, below, describes more gradual staging strategies, the present initial report is limited to the single staging strategy given above.

E. The non-interacting reference system

The computation of the (absolute) reference free energy Fref is perhaps the most technically involved step of the calculation. The remaining free energy differences in the decomposition of Fphys in Eq. (5) are estimated using a simple, standard method. For the reference free energy, however, great care must be taken with the normalization and Jacobian factors of the chosen probability distributions. Indeed, our previous report3 includes an error in this regard, as explained at the end of this subsection.

As described in our discussion of staging (Sec. IID), the reference system for all molecules studied here consists of the set of non-interacting internal coordinates. The reference potential energy function will be constructed, following our previous work3, so that the reference partition function is normalized to one. That is, we will construct our reference model Uref so that


where the same standard concentration as in Eq. (2) has been used implicitly. (From this point forward, we will omit writing the length units, but they should be implicitly associated with every bond-length integration.) The motivation for the unit normalization of Zref is that application of a logarithm leads to the simplifying value,


for every system.

While there are many ways to construct Uref to satisfy the required normalization of Eq. (11), we primarily use the strategy of employing independent internal coordinates as in our earlier work.3 As usual, the full set of 3N − 6 internal coordinates, x = (x1, x2,…,x3N−6) consists of N − 1 bond lengths, N − 2 bond angles, and N − 3 dihedrals. So long as the distribution of each individual coordinate is normalized when integrated with the appropriate Jacobian factor J, the full distribution will be normalized.

Because total reference energy is given by a simple sum of independent terms,


the desired normalization (11) is ensured by enforcing


where the inclusion of inverse length units is understood for bond-angle integrals. In words, then, each individual potential uiref must include suitable normalization as described in Sec. IIG.

(As detailed in Sec. IIG, peptide ϕ and ψ angles were, in fact, sampled together from a single distribution based on a pairwise energy function uϕψref. These angles are independent from all other coordinates, however. We emphasize that this exception does not alter the basic formalism, which has been simplified very slightly for clarity.)

It is very useful to observe that normalization of the coordinate distributions via Eq. (14) can be achieved either using standard analytic forms — e.g., Gaussians — or via numerical histogramming procedures.3 Thus, there is great flexibility in the choice reference distributions embodied in the reference potentials. In addition to forcefield terms, prior knowledge, such as from a simulation, can be used in constructing the set {uiref}. The reference potentials chosen for the present study are described below in Sec. IIG on library construction.

One word of warning is appropriate here. Although it is possible to describe the internal configuration of a molecule using additional bond angles to substitute for dihedrals in some cases, the Jacobian for such a description appears not to be well-defined. Therefore, it is necessary to use the standard description with N − 2 bond angles and N − 3 dihedrals. Unfortunately, we were unaware of this point during our original study,3 and therefore an erratum will be prepared correcting the resulting numerical errors.

F. First intermediate: non-interacting fragments

The first intermediate stage adds only localized interactions to the non-interacting reference model, as illustrated in Fig. 1 and Fig. 2. Specifically, once a molecule is divided into fragments (A, B, C, …), the first intermediate includes only interactions occuring within fragments. The fragments exactly divide all coordinates so that we can write x = (xA, xB,…) and the potential energy function for this stage is given by


where Uy includes all interactions from the full forcefield (OPLS-AA, in our case) among the fragment coordinates xy for y = A, B,…. Importantly, the fragment potential Uy includes all non-bonded interactions — electrostatic, van der Waals — among the atoms of the fragment. (Sec. IIG on our libraries describes the treatment of connecting “dummy” atoms.)

The free energy for this stage — i.e., F1[U1] for use in the key equation (5), recalling Fref = 0 — can be calculated by using the standard perturbation relation (7). For such a computation, one would use Ua = Uref and Ub = U1 along with an ensemble distributed according to the Boltzmann factor of Uref.

In practice, once the libraries are generated, no calculation of energies needs to be done. As detailed below the libraries are generated (just once, for repeated use in many systems) based on the Uref distribution. Thus, during the library generation process, it is a trivial matter to calculate the absolute free energy for each fragment using Eq. (7). Thus, individual fragment free energies Fy are calculated in advance that exactly sum to the first-stage free energy:


Further, the independent-coordinate distributions are subsequently resampled based on Eq. (8) to generate the library distributions — i.e., ensembles for the Uy Boltzmann factors — for use in subsequent stages.

G. Construction of fragment libraries

As just described, fragment libraries are critical to the calculation of the free energy of the first intermediate stage, F1. The libraries also greatly facilitate computations for the rest of the intermediates.

In general terms, fragment configurations are generated by sampling internal coordinates according to the independent probability distributions which constitute the reference system. The generated configurations are then used to calculate fragment free energies, Fy for y = A, B,…. The configurations are also reweighted into an ensemble distributed according to the full forcefield for xy, the degrees of freedom internal to fragment y. Typically, such a procedure can be applied only to systems with a sufficiently small number of degrees of freedom. For large systems with enough correlated degrees of freedom, there tends to be insufficient overlap with the reference system of independent coordinates. That is, only a tiny fraction of the reference-distributed configurations will be important in the interacting ensemble. Therefore, the choice of the generating probability is essential for the efficient generation of libraries.

We found that for fragments the size of alanine residues, rather simple probability functions were sufficient for generating tens of thousands of (statistically independent) configurations in weeks of single-CPU time. This is a negligible cost because once a library is generated, it can be used in multiple simulations.

Different coordinate types are best sampled with different distributions, as is suggested by the forcefield terms. In our initial tests we employed the same functions and parameters as in the OPLS-AA forcefield — i.e. bonds and angles were described by harmonic functions and dihedrals by a linear combination of several cosine functions. However, this choice of energy functions and parameters did not work well because of strong correlations between coordinates. Better results were obtained when we extracted parameters from a short Langevin simulation of alanine dipeptide, using harmonic functions for bonds, angles and “stiff” dihedrals such as in the relatively planar peptide groups. So for bonds, angles and “stiff” dihedrals the reference potential used in Eq. (14) is


where ûi is simply a harmonic function and zi is a constant that enforces the normalization of uiref. The harmonic function is described by


where [x with macron]i is the equilibrium position and σi2 is the variance of a corresponding coordinate. The normalization constant is a single-coordinate partition function


where inverse length units are implied for all bond-angle integrals.

For “soft”, rotatable dihedrals — such as ϕ, ψ and χ angles in amino acids — we extracted histograms from a Langevin simulation of alanine dipeptide, as described in Ref.3 Thus, a two-dimensional (correlated) probability function was used for the (ϕ, ψ) dihedral pair. In particular, the joint histogram for the (ϕ, ψ) dihedrals consisted of 360 × 360=129,600 bins with a width of one degree. Histograms for “soft” dihedrals employed 6,000 bins with a width of 0.06 degree.

Coordinates were sampled according to their probability distributions by constructing a numerical cumulative distribution function (cdf) for each:


For the histogrammed variables (ϕ, ψ and χ), the cdf will exactly reproduce the distribution in the histogram. For variables with harmonic energy functions, we used a finely discretized numerical cdf to generate a close approximation to the desired distribution. Specifically, bond-length and bond-angle cdf’s employed 3,000 bins with a bin-width of 0.001 Å and 0.06 degree respectively. Histograms for dihedrals used 6,000 bins with a width of 0.06 degree. To sample from a given cdf, a uniform random number (0,1] is generated and equated to the cdf to find the corresponding xi value. Within each bin, the xi value is chosen uniformly. Parameters for all fragment distributions are available in the Supplementary Material.

Based on the distributions just described, internal coordinates were sampled independently (except for pairwise sampling of ϕ and ψ dihedrals) using an in-house program written in C. Generated configurations were converted to Cartesian coordinates and the potential energy was evaluated using the “analyze” module of the Tinker software package.41 Based on these values and the known reference energies, the individual fragment free energies were calculated using Eq. (7). A simple resampling procedure37 was used to generate a fragment ensemble distributed according to the forcefield; see Eq. (8). Only a small fraction ([greater, similar] 10−4) of reference-ensemble configurations remain after resampling, requiring extensive sampling of the reference ensemble and weeks of CPU cost, as mentioned earlier.

For this study, we generated libraries consisting of 10,000 configurations. All fragment libraries were sampled according to OPLS-AA forcefield at T = 298K, with a simple dielectric constant (ε = 60) modeling solvent. The choice of dielectric constant was motivated by the reasonable behavior observed in separate Langevin simulations of poly-alanine systems (data not shown).

As noted earlier, for all possible 108 covalent (neighboring) pairings of fragments, we also tabulated the interaction energies from the forcefield, accounting for all atoms in the fragment pair. Suitable corrections for dummy atoms (see below) were made. In other words, for a simple two-fragment system, all interactions are stored.

Use of dummy atoms

Because fragments are sampled independently from each other, the six degrees of freedom that specify the relative orientation of neighboring fragments are included with the fragments. For this purpose “dummy” atoms are used to provide the extra coordinates. When fragments are joined, the dummy atoms are simply removed because they are replaced with the “real” fragment atoms. Although fragment combination is simplest when dummy atoms do not interact with real atoms, we chose to have dummy atoms interact with the fragment atoms for better overlap with subsequent ensembles. Interacting dummy atoms can mimic the neighboring fragment atoms and improve the efficiency of the method. When fragments are joined, however, the interaction energy of the dummy atoms must be substracted from the full fragment energy because they are replaced with the neighboring fragment atoms. When interacting dummy atoms are employed care must be taken to avoid intoducing extra degrees of freedom (e.g., certain bond lengths and angles) because they can affect the ensemble distributions.

The dummy atoms used at the N-terminus of a fragment are carbonyl C, carbonyl O and terminal alpha-C with valence set to one. The dummy atoms used at the C-terminus are amide N, amide H, and terminal alpha-C with valance also set to one. The dummy atoms were assigned the same forcefield parameters as used in the corresponding fragment atoms.

H. The second and subsequent intermediates: adding neighboring fragment interactions

Returning again to the scheme embodied in Eq. (5), as well as in Fig. 1 and Fig. 2, the next intermediates add interactions between neighboring fragments. These can be considered the “bonded” interactions in the space of fragments, but non-bonded interactions among all atoms in the neighboring pair are included. Explicitly, the models for the remaining intermediates are described by


where Uyz is the full interaction energy — based on the forcefield and solvent model — between fragments y and z.

Formally, it is clear what needs to be done. The ensemble of the previous stage j − 1 should be used to calculate ΔFj using the perturbation relation (7) with Uj−1 and Uj.

Again, however, possession of the libraries and interaction tables leads to dramatic practical implications. For instance, by construction, the energy U2U1 is simply the pre-stored energy UAB; similarly U3U2 = UBC These tabulated energies are used directly in Eq. (7) without the need for additional energy calls. The required ensembles for each stage are generated by the rapid resampling procedure of Eq. (8). In this way, one readily generates the free energy differences ΔF2, ΔF3,… required for the evaluation of Fphys via Eq. (5).

Caution is required when the molecule of interest contains repeated fragment pairs. While the same libraries can be used for the repeats, say at intermediate stages j and m, the corresponding values of ΔFj and ΔFm will be different in general. To see the reason, consider the case of the tetra-alanine peptide studied below. The term ΔF2 corresponds to including the interaction of the already combined fragments Ace-Ala with the next Ala. Note that the free energy difference ΔF2 is calculated via Eq. (7) using the Ace-Ala ensemble as the “a” system. By contrast, consider the calculation of ΔF3 for the addition of the next Ala — now to the Ace-(Ala)2 ensemble. Although the free energy change will be based upon the identical (tabulated) interactions, the associated Boltzmann factors in Eq. (7) will be weighted differently — i.e., occur with different frequencies — due to the differing initial “a” ensembles. In turn, this will lead to different free energy changes, so that ΔF3 ≠ ΔF2.

I. The final free energy difference: non-neighboring interactions

As described in the master scheme of Eq. (5) and illustrated in Fig. 1 and Fig. 2, the final calculation needed to obtain Fphys entails the inclusion of all remaining interactions in the forcefield and solvent model. These interactions, excluded until now, occur between atoms in non-neighboring pairs. As described in Sec. II D, for a molecule of k fragments, the full physical potential energy function (i.e., the forcefield) can be written as the difference from the final (kth) intermediate:


where the sum is over non-neighboring pairs of fragments — i.e., AC, AD, BD, ….

In this case, the necessary energy terms for use in the calculation of ΔFk→phys via Eq. (7) must be calculated. They cannot readily be stored in advance, due to the combinatorial explosion of possible configurations. For instance, with libraries of 104 configurations, there are 1012 possible configurations for three fragments, which is beyond the range of current commerical machines.

J. Generating an equilibrium ensemble without additional energy calls

The physical ensemble, distributed according to the Boltzmann factor of the forcefield, can be generated by resampling the Uk ensemble — the last intermediate — using Eq. (8). In this case, the “a” ensemble corresponds to Uk and the “b” ensemble to the full forcefield and (implicit) solvent model. Because all energy terms have already been calculated, no additional energy calls need to be made. The necessary resampling computation is extremely fast compared with preceding stages of the protocol.

K. Checking the code and estimating uncertainty

Although the formalism governing the present study is mostly straightforward, our in-house computer program not only needs to reproduce standard forcefield results, but also requires complicated “dissections” of various subsets of forcefield terms. We therefore performed three types of checks on our code. (i) We checked that the forcefield energy for full molecular configurations exactly reproduces the results reported in Tinker (data not shown). This verifies that we have correctly accounted for our dummy-atom energy terms, (ii) Using our previously developed “structural histograms” for analyzing configuration-space distributions,42,43 we checked that the equilibrium ensembles produced during our free energy calculations agree with independent Langevin simulations. This data is shown in the Results section, and generated as explained below, (iii) Finally, we performed a check to ensure that our final free energy values are independent of the choice of fragments. This data, for two- and three-fragment decompositions of alanine dipeptides is also shown in the Results section.

Statistical error

Statistical uncertainties were calculated by running 20 independent computations for every free energy value reported. Twice the standard deviation among these 20 values is reported, which quantifies the scale of expected statistical error for a single simulation. The repeated simulations were run using 20 independent sets of libraries for the various fragments — i.e., the calculation was started all the way at the beginning in each repeat. However, because the overlap between various stages is the limiting factor in the quality of the free energy results, rather than our fairly large libraries, we anticipate similar error estimates would be obtained for one set of libraries.

Analyzing equilibrium ensembles/distributions

In two previous studies,42,43 we have developed methods for comparing equilibrium distributions for molecular systems of arbitrary complexity. The central idea is to employ a “structural histogram” which simply classifies (divides) configuration space into a number of “bins” (regions). Two correct simulations should yield the same results for the fractional populations of the bins, within statistical error, regardless of whether the bins correspond to physical states/free energy basins. (Furthermore, the statistical uncertainty in the population estimates can be used to quantify the “effective sample size”.)43 In the present work, we compare equilibrium distributions from fragment combination and from standard Langevin simulations based on structural histograms. The particular histograms employed in the present study have five bins derived from a Voronoi construction;44 the reference structures for the Voronoi procedure are derived from the equi-probability scheme described in reference.43 Although the resulting bins are not exactly equally probable, each is guaranteed to represent a contiguous region in configuration space due to the Voronoi construction.


The absolute configurational free energy Fphys was calculated for the monomer, dimer, and tetramer alanine peptides: alanine dipeptide (Ace-Ala-Nme), di-alanine (Ace-(Ala)2-Nme), and tetra-alanine (Ace-(Ala)4-Nme). For alanine dipeptide, the free energy was estimated based on two different fragment sets as a check on our code. Twenty independent calculations for every Fphys estimate were performed to quantify uncertainty, as described above. Additionally, every free energy calculation also yields an equilibrium ensemble, which is compared to independent Langevin simulations.

The results are very encouraging in every regard, and rather rapid as reported at the end of the Results section. The amount of memory used, which is a key to the present calculations, is also reported.

The results reflect the uniform protocol adopted here. First, absolute free energies for non-interacting fragments are calculated. Then free energy changes resulting from interactions among covalently bound fragments are added (“bonded” terms, in the space of fragments), one at a time in sequence. Finally, all remaining interactions are added, which account to (“non-bonded”) interactions among all non-sequential fragment atoms. The final free energy values reflect the full OPLS-AA forcefield36 as implemented in Tinker.41

A. Alanine dipeptide using two different fragmentations

Because of the complexity of the fragmentation procedure and the lack of reference standards for absolute free energy values, we wanted to ensure our code was introducing no artifacts. We were particularly concerned about the interacting dummy atoms which introduce “temporary” energy terms, that must be corrected for properly at every combination stage. We find good agreement between free energy estimates based on two- and three-fragment decompositions.

“Standard” three-fragment decomposition

In our standard decomposition for the present study, we separate peptide and amino acid groups. For alanine dipeptide (AD), then, the three standard fragments are Ace, Ala, and Nme, and the corresponding stages for the free energy calculation are given in Eq. (9). Recalling our convention that Fref [equivalent] 0, the free energy terms from Eq. (5) can be written as


where Fy is the absolute free energy (including dummy atoms) for fragment y and ΔFx→y indicates the free energy change of combining fragments x and y (which includes all bonded and non-bonded terms, as well as the correction of dummy terms). Finally, ΔFnonbonded denotes the free energy change in going from an ensemble where sequentially separated fragments do not interact to a fully interacting ensemble (in this case, the Ace-Nme interactions are added).

Two-fragment decomposition

As an alternative decomposition, we used Ace-Ala as one fragment and Nme as the other. Importantly, the Ace-Ala library and absolute free energy were not generated from a combination of the two smaller libraries, but instead from a ground-up calculation based on independent coordinates as described in the Methods section.

In this case, then, the free energy terms from Eq. (5) become


where it is notable that in the two-fragment case, all interactions are included in the libraries and interaction tables. In other words, no energy calls at all are needed.

Comparison of free energies

There is essentially perfect agreement between free energies estimate via the two independent decompositions, which provides a reassuring check on our computer program. The full results are given in Table I. Notably, the two-fragment decomposition has a higher variance, which probably results from a decreased “precision” in the pre-generated Ace-Ala ensemble. In the composite pre-generated Ace-Ala ensemble, the whole configuration space is represented by 104 configurations, whereas when Ace and Ala from separate 104-member libraries are combined, there is a much denser coverage of configuration space.

Table I
Comparison between the absolute free energy for alanine dipeptide estimate using two different fragmentation schemes. Our “standard” three-fragment decomposition (Ace, Ala, Nme) is compared to a two-fragment grouping (Ace-Ala, Nme). The ...

Equilibrium ensemble compared to standard simulation

Our free energy computation produces an equilibrium ensemble through repeated resampling procedures at each stage, as explained in the Methods section (Sec. IIJ). As a further check on our data, we compare the equilibrium ensembles generated from our fragment combination procedure to those produced by long Langevin simulations performed in Tinker.41 The results, shown in Fig. 3(a), indicate that our computation is indeed producing correct equilibrium ensembles. The graph shows the populations of different regions of configuration space, which was divided up using a Voronoi procedure explained above (Sec. IIK). The alanine dipeptide equilibrium distribution was generated from the three-fragment protocol, and the 1 µsec. Langevin simulation (20×50 nsec) was performed in Tinker using a friction constant of 10.0 ps−1 at T = 298K.

Figure 3
Comparison of equilibrium distributions from fragment combination and Langevin simulation. The graphs show the fractional population in different regions of configuration space, as described in Sec. II K. Three peptides are considered: (a) alanine dipeptide, ...

B. Di-alanine

Using the same libraries as for the alanine monomer above, we now calculate the absolute configurational free energy for the di-alanine peptide (Ace-(Ala)2-Nme). The staging used is described in Eq. (10), which corresponds to the following free energy terms for use in Eq. (5):


It is important to note that ΔFAla→Ala for di-alanine differs in principle from the term with the same name in Eq. (24) for alanine-dipeptide because the prior ensemble is different. This point was explained in Sec. IIH. However, in practice, the difference among the three systems is not statistically significant.

The free energy values are once again calculated with high precision: fluctuations are a fraction of one kcal/mole. The data for all free energy terms is given Table II, where we see a significant change in the ΔFnonbonded term, reflecting the increased number of attractive interactions in this larger molecule (compared to alanine dipeptide).

Table II
Free energy terms used in calculating the absolute free energy for di-alanine and tetra-alanine. The table gives free energy values in kcal/mole, as well as two standard deviations (in parentheses) based on 20 independent calculations.

Similarly, the agreement among bin populations for di-alanine in Fig. 3(b) is good, which provides an independent reason for having confidence in the free energy results. The Langevin simulation for di-alanine was performed with exactly the same parameters as for alanine-dipeptide.

C. Tetra-alanine

Our results are of high precision (~ 0.1 kcal/mole standard deviation) even for tetra-alanine. The staging follows our standard procedure, with the only subtlety in the present case is that the addition of every Ala residue is different, because the “growing” ensemble is different in every case. Thus we consider the first alanine (Alal) separate from the second (Ala2), and so on.


Our data for each of these terms is given in Table II. Although the different alanine additions are based on different ensembles, the results show they are statistically indistin-guishable in this case. However, the “non-bonded” term ΔFnonbonded again is significantly different from the previous molecules, as expected.

In comparing the equilibrium distributions from fragment combination and Langevin simulation, once again there is good statistical agreement. For Langevin simulation of tetra-alanine, all parameters were set as before, except for a friction constant of 5.0 ps−1, which does not alter the equilibrium distribution. The contrast between the large fluctuations in the bin populations pi and the high precision of Fphys in Table II reveals an important lesson: sampling is harder than free energy calculation. Future work will describe efforts to improve efficiency for fragment-based equilibrium sampling.

D. Timing and memory usage

The calculations were reasonably inexpensive, taking 20 minutes for alanine dipeptide, 30 minutes for di-alanine, and 50 minutes for tetra-alanine using one processor of an Intel Xeon 3.20 GHz machine. Concerning memory, a single library containing 10,000 configurations requires 11 MB for Ala, 12 MB for Ace-Ala complex and 5.7 MB for Ace and Nme. An interaction table containing 108 pair-wise interactions uses 1.3 GB.


A. The overall strategy and results

Overall, the precision of our free energy estimates was very high, which we attribute to two related factors. First, our ensembles in the reference and intermediate stages were of good statistical quality — i.e., characterized by a large effective sample size (data not shown).43 Second, there was good overlap between the stages, which indeed contributed to maintaining the sample size throughout the stages. The overlap is present by design, as interactions were always added between stages. The addition of interactions or, equivalently, correlations among degrees of freedom is guaranteed to reduce the entropy.45 This progressive narrowing of configuration space is consistent with Kofke’s proposal to calculate free energy differences in the “insertion” direction.39,40 For larger systems, however, one expects limitations to maintaining the sample size using the present protocol, as explained below.

We note that our present approach differs from our previous work3 primarily because the earlier study did not employ staging. Effectively, it was a single-fragment calculation from the perspective of the present study.

B. Application of fragment combination for estimating relative protein-ligand affinities

Because the fragment combination procedure can be applied to fragments of small molecules, and not just to peptides as in the present report, the approach can be applied to calculate approximate relative affinities. That is, one can grow a ligand into the binding pocket of a protein receptor and calculate its free energy. A number of different approximations can be imagined. Most simply, the receptor can be held rigid and the ligand grown in the fields (van der Waals and electrostatic) of the receptor. In a better approximation, the binding-site side-chains can be grown along with the ligand. One can expect affinities based on such free energy calculations to be superior to their docking counterparts because entropy is included. To produce a relative affinity estimate between two ligands, the respective solvation terms would need to be included as usual.12

C. Efficiency of fragment combination for equilibrium sampling

As we have already noted, the fragment combination protocol we have described produces equilibrium ensembles simultaneously with free energy estimates. It is natural to wonder whether such ensembles are produced more efficiently than by standard dynamics simulation especially given that small peptides have been shown to have multi-nanosecond relaxation times.43 In fact, as we will document in a separate publication, fragment combination can lead to sampling that is faster by several orders of magnitude. However, somewhat more sophisticated resampling schemes and different fragment sizes are useful in reaching the highest levels of efficiency, as will be reported.

D. Use of implicit solvent models

It is interesting and important to consider the additional costs which would be entailed by using a standard implicit solvent model, such as GBSA.46 First, both the libraries and the interaction tables would need to be regenerated using the implicit solvent model. Although this could take several weeks of single-CPU time, it needs to be done only a single time for a given model. The second cost is for additional solvent calculations not included in the libraries and tables. We hope to report on the staging and computational expense in a forthcoming publication.

E. Relaxation simulations for large systems

The protocol described in this report will encounter challenges with larger systems, as our results hint. In the present study, the equilibrium ensemble generated at one stage, say j, is used to calculate the incremental free energy difference to the next stage, j + 1. To continue the process, the ensemble at stage j + 1 is produced by resampling ensemble j as described in the Methods section. However, it is possible the resampled ensemble will contain a small number of distinct configurations in an important part of configuration space. Such configurations will have high weight prior to resampling and thus the problem can be diagnosed by noting whether any configurations are resampled multiple times. Clearly, duplicated configurations will not be statistically independent and lead to increased statistical error in free energy estimates.

One solution to this problem would be to relax duplicated configurations — i.e. to perform short equilibrium simulations to create distinct configurations. The statistical justification of such an approach is somewhat technical47 and will be described in future work as required.

F. Alternative staging using partial interactions

Additional incremental stages can be added by considering only subsets of interactions. For instance, in the case of di-alanine (Ace-(Ala)2-Nme) which is composed of four fragments (A, B, C, D), there are three sets of non-bonded interactions: AC, AD, and BD. Our present implementation adds all three in a single stage, but they could be added one at a time. Undoubtedly, in larger systems, such finer staging will be necessary and probably should be required by relaxation of duplicated configurations as just described.


We have extended our earlier work on computation of absolute free energies for molecular systems 3 in two ways: (i) by staging the calculation using molecular fragments; and (ii) by pre-calculating “statistical libraries” of fragment configurations, intra-fragment energies, and inter-fragment interactions. For a series of test systems – the alanine monomer, dimer, and tetramer — we were able to compute extremely precise free energies, with fluctuations [double less-than sign] 1 kcal/mole. The calculations were quite fast, furthermore, with the slowest requiring less than an hour of single-processor computer time. The speed results from employing (infinitely re-usable) pre-calculated library configurations and interactions, which pre-empt expensive energy calculations. Our statistical libraries of amino-acid and capping-group fragments are available on our website (

A future application of potential importance in computational biochemistry is the estimation of binding affinities of small molecules to proteins. We hope to develop fragment libraries suitable for small molecules — following ideas developed long ago29,30 — to be used in computing free energies within a potentially flexible protein binding site. Protein flexibility could be included using the side-chains of the binding site as fragments in the calculations.

Another application closely related to the present report is the use of library-based fragment combination for equilibrium sampling. While the calculations reported here already yield equilibrium ensembles, we are actively studying more efficient schemes based on advanced resampling approaches37,48 and a range of fragment sizes. This work will be reported in a separate publication.

Supplementary Material

Supp data


We greatly appreciate the advice of Divesh Bhatt, Ying Ding, Marty Ytreberg, and Bin Zhang, as well as the financial support of the NIH (Grants GM076569 and GM070987) and NSF (Grant MCB-0643456).

Contributor Information

Xin Zhang, Department of Physics & Astronomy, University of Pittsburgh, Pittsburgh, Pennsylvania 15260.

Artem B. Mamonov, Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15213.

Daniel M. Zuckerman, Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15213.


1. Frenkel D, Smit B. Understanding Molecular Simulation. San Diego: Academic Press; 1996.
2. Stoessel JP, Nowak P. Macromol. 1990;23:1961.
3. Ytreberg FM, Zuckerman DM. J. Chem. Phys. 2006;124:104105. [PubMed]
4. Huang L, Makarov DE. J. Chem. Phys. 2006;124:64108. [PubMed]
5. Meirovitch H. J. Chem. Phys. 1992;97:5803.
6. Meirovitch H. J. Chem. Phys. 1999;111:7215.
7. Cheluvaraja S, Meirovitch H. Proc. Natl. Acad. Sci. U. S. A. 2004;101:9241. [PubMed]
8. White RP, Meirovitch H. Proc. Natl. Acad. Sci. U. S. A. 2004;101:9235. [PubMed]
9. Head MS, Given JA, Gilson MK. J. Phys. Chem. B. 1997;101:1609.
10. Chang C-E, Potter MJ, Gilson MK. J. Phys. Chem. B. 2003;107:1048.
11. Chang C, Gilson M. J. Am. Chem. Soc. 2004;126:13156. [PubMed]
12. Chen CC, W, Gilson M. Biophys. J. 2004;87:3035. [PubMed]
13. Go N, Scheraga H. J. Chem. Phys. 1969;51:4751.
14. Rosenbluth MN, Rosenbluth AW. J. Chem. Phys. 1955;23:356.
15. Wall FT, Erpenbeck JJ. J. Chem. Phys. 1959;30:634.
16. Meirovich H. J. Phys. A: Math. Gen. 1982;15:L735.
17. Meirovich H. Phys. Rev. A. 1985;32:3699. [PubMed]
18. Garel T, Orland H. J. Phys. A: Math. Gen. 1990;23:L621.
19. Garel T, Niel JC, Orland H, Smith J, Velikson B. J. Chem. Phys. 1991;88:2479.
20. Velikson B, Garel T, Niel JC, Orland H, Smith JC. J. Comput. Chem. 1992;13:1216.
21. Bascle J, Garel T, Orland H, Velikson B. Biopolymers. 1993;33:1843. [PubMed]
22. Grassberger P. J. Phys. A: Math. Gen. 1993;26:2769.
23. Grassberger P, Hegger R. J. Phys.- Condens. Mat. 1995;7:3089.
24. Grassberger P. Phys. Rev. E. 1997;56:3682.
25. Bastolla U, Frauenkron H, Gerstner E, Grassberger P, Nadler W. Proteins. 1998;32:52. [PubMed]
26. Zhang JL, Liu JS. J. Chem. Phys. 2002;117:3492.
27. Zhang J, Lin M, Chen R, Liang J, Liu JS. Proteins. 2007;66:61. [PubMed]
28. Gibson KD, Scheraga HA. J. Comput. Chem. 1987;8:826.
29. Miranker A, Karplus M. Proteins. 1991;11:29. [PubMed]
30. Leach AR. Molecular Modelling: Principles and Applications. Pearson Education Limited; 2001.
31. Mamonov AB, Bhatt D, Cashman DJ, Zuckerman DM. submitted. [PMC free article] [PubMed]
32. Wall FT, Rubin RJ, Isaacson LM. J. Chem. Phys. 1957;27:186.
33. Alexandrowicz Z. J. Chem. Phys. 1969;51:561.
34. Macedonia MD, Maginn EJ. Mol. Phys. 1999;96:1375.
35. Rohl CA, Strauss CEM, Misura KMS, Baker D. Method Enzymol. 2004;383:66. [PubMed]
36. Jorgensen WL, Maxwell DS, Tirado-Rives J. J. Am. Chem. Soc. 1996;118:11225.
37. Liu JS. Monte Carlo strategies in scientific computing. Springer; 2004.
38. Zwanzig RW. J. Chem. Phys. 1954;22:1420.
39. Peter DAK, Cummings T. Mol. Phys. 1997;92:973.
40. Kofke DA, Cummings PT. Fluid Phase Equilibr. 1998;150–151:41.
41. Ponder JW, Richard FM. J. Comput. Chem. 1987;8:1016.
42. Lyman E, Zuckerman DM. Biophys. J. 2006;91:164. [PubMed]
43. Lyman E, Zuckerman DM. J. Phys. Chem. B. 2007;111:12876. [PMC free article] [PubMed]
44. Voronoi G. Journal für die Reine und Angewandte Mathematik. 1907;133:97.
45. Cover TM, Thomas JA. Elements of Information Theory. 2nd Edition. Wiley; 2006.
46. Qiu D, Shenkin PS, Hollinger FP, Still WC. J Phys. Chem. A. 1997;101:3005.
47. Neal RM. Stat. comput. 2001;11:125.
48. Fearnhead P, Clifford P. J. R. Stat. Soc. B. 2003;65:887.