|Home | About | Journals | Submit | Contact Us | Français|
Computational molecular design is a useful tool in modern drug discovery. Virtual screening is an approach that docks and then scores individual members of compound libraries. In contrast to this forward approach, inverse approaches construct compounds from fragments, such that the computed affinity, or a combination of relevant properties, is optimized. We have recently developed a new inverse approach to drug design based on the dead-end elimination and A* algorithms employing a physical potential function. This approach has been applied to combinatorially constructed libraries of small-molecule ligands to design high-affinity HIV-1 protease inhibitors [M. D. Altman et al. J. Am. Chem. Soc. 130: 6099–6013, 2008]. Here we have evaluated the new method using the well studied W191G mutant of cytochrome c peroxidase. This mutant possesses a charged binding pocket and has been used to evaluate other design approaches. The results show that overall the new inverse approach does an excellent job of separating binders from non-binders. For a few individual cases, scoring inaccuracies led to false positives. The majority of these involve erroneous solvation energy estimation for charged amines, anilinium ions and phenols, which has been observed previously for a variety of scoring algorithms. Interestingly, although inverse approaches are generally expected to identify some but not all binders in a library, due to limited conformational searching, these results show excellent coverage of the known binders while still showing strong discrimination of the non-binders.
The computational analysis of protein-ligand complexes is important in a number of fields. Current methods allow binding sites to be located1 and natural substrates to be identified for a given protein.2,3 They also permit the rational design of inhibitors, and in the future will likely allow potential drugs to be screened for harmful side effects. Whilst these methods are extraordinarily useful, a number of difficulties need to be addressed that limit current approaches. The scoring functions used to evaluate and rank compounds are approximate and may be inapplicable to varied libraries of compounds.4 This can lead to predictions of both false positives and false negatives. False positives are a problem in all computational methods. However, whilst false negatives are not desirable, it could be argued that from a design perspective they are less problematic than false positives, where the aim is to predict a set greatly enriched in high-affinity compounds. Compound scoring is also inextricably linked to conformational sampling, and inadequate sampling can limit the effectiveness of these methods. Recent work from this laboratory has focused on the development of a method for molecular design of inhibitors using an inverse approach with physical potential functions.5 The approach considers a library of potential ligands created combinatorially from a set of scaffolds and side groups. Each chosen scaffold is first placed systematically in the binding site in many different acceptable conformations and orientations. All scaffolds contain a number of attachable positions where side groups can be substituted, generally at hydrogen atoms. Every attachable position is associated with a library of side groups, each member of which exists as a set of discrete rotameric conformations. This problem framing is analogous to that used for inverse protein design.6-15 We use a pairwise decomposable energy function with the dead-end elimination (DEE)8,16 and A*13 algorithms to prune poor scoring compounds and poses to produce an energetically ordered list of the best computed binders. This ranked list is then re-evaluated using more sophisticated energy functions, which need not be pairwise additive, yielding a set of compounds predicted to bind tightly. This method has been applied to the rational design of HIV-1 protease inhibitors, which have shown high affinity binding in experimental validation.5 Here we evaluate the method using a test system, which has been used with other design approaches, and look for problems and potential solutions.
Small engineered binding sites have proven very useful for the study of protein-ligand binding. In the 1990s, a number of such sites were engineered. A hydrophobic binding site was created in a stable mutant of T4 lysozyme, L99A.17,18 Ninety-one different compounds were tested for binding affinity, providing information on the relative specificity of the site. This data was then used in computational studies.19-21 Later work yielded a double mutant, L99A/M102Q, that was able to bind small polar ligands.20,22 A charged binding site was also created from the protein cytochrome c peroxidase (CCP) by a mutation near the active site, W191G.23 The resulting pocket is buried within the protein close to the heme group and was found to bind a variety of small, typically singly-charged ligands. The protein and the binding site can be seen in Figure 1. This CCP mutant has been used subsequently to examine binding specificity, both experimentally24 and computationally.25,26 This has proven useful in identifying problems in molecular docking, in particular with the scoring functions that are employed, but inverse design methods have, thus far, not been tested in this way. The CCP site is particularly attractive due to the availability of large amounts of data. There are forty-three known binders and thirty-five of these have been crystallized with the protein at high resolution. Here we report the application of our inverse design approach to the engineered W191G CCP site with a set of small-molecule scaffolds and side groups. The aim was to identify areas in which the algorithm performs well and to find deficiencies and potential remedies.
Parallel designs were carried out in unbound and bound structures of the CCP W191G site. The unbound structure was from PDB ID 1CMQ23 at 2.30 Å resolution. The ligand-bound structure in complex with pyridine-2,6-diamine was selected from PDB ID 2ANZ26 at 1.75 Å resolution. The relatively large ligand in the bound state yields a cavity that can accommodate ligands of varying sizes. The use of unbound and bound structures permits the evaluation of induced fit effects. However, we are more concerned with the results from the bound structure because the inverse design method would commonly be used with a ligand-bound structure.
The two protein structures were then prepared as follows. Coordinates for the protein, the heme, the iron, and all the water atoms were taken from the PDB. The ligand was removed from the bound state. The protein and water hydrogen-atom positions were built using the HBUILD facility of the CHARMM27 program package with the CHARMM22 energy function. All asparagine and glutamine residues were then checked manually for potential hydrogen bonding and analyzed by NQ-Flipper.28 For 1CMQ, residues Q20, N24, N208, N220, Q240, and Q292 were altered by swapping the coordinates of the nitrogen and oxygen atoms to improve the hydrogen bonding patterns. For 2ANZ, residues Q20, N24, N208, Q240, and Q292 were altered. Histidine residues were checked for orientations and protonation state in the same manner, which resulted in all being assigned as delta protonated in each case and none being flipped. The residues lysine, arginine, aspartate, glutamate, cysteine, and tyrosine were also analyzed in this manner to check their protonation state. There was no evidence of any unusual protonation states and thus all lysine and arginine residues were assigned as positively charged, all aspartate and glutamate residues were assigned as negatively charged, and all cysteine and tyrosine residues were assigned as neutral. In the ligand-bound structure 2ANZ, the active site water molecule 1000 was retained within the site, as it is tightly bound and is present in all known complexes. This water was also retained in previous studies.25,26 In the unbound structure the retained water is number 308. The heme group present in CCP contains a ferric iron atom but the CHARMM22 heme parameters correspond to a ferrous iron atom. Calculations were thus performed to determine the partial charges on the heme atoms in this case. Calculations were performed on the entire heme group from the structure 2ANZ along with the iron-coordinating water molecule (1415) and the iron-coordinating histidine residue His175 capped with an acetylated N-terminus and an N-methylamido C-terminus. Hydrogen atoms were initially built using Gaussview.29 The partial atomic charges on the water and the protein atoms were constrained to the standard CHARMM22 charges to prevent charge transfer and then calculations were performed with Jaguar.30,31 The restricted Hartree-Fock model and the 3-21G basis set was used for all the atoms except the iron atom, for which a 6-31G+ basis set was used to account for d-orbitals. The system was geometry optimized to refine the hydrogen-atom positions and charges for the heme group were calculated using a RESP fit.32,33 The hydrogen-atom positions were then rebuilt using HBUILD27, consistent with the changes to the conformations of asparagine and glutamine residues. After minimization of the hydrogen-atom positions, the resultant structures were then rotated into a new coordinate frame and the protein atoms were assigned PARSE charges34 for use with Delphi.35 Precomputations for a grid-based estimation of van der Waals and electrostatic interactions were carried out. The methodology for creating these grids is discussed in reference 5.
The sets of scaffolds and side groups describing the chemical and conformational space of potential ligands was constructed from a set of known binders and non-binders published by the Shoichet laboratory based on work performed for this site.26 The data can be found at http://shoichetlab.compbio.ucsf.edu/ccp_binders_decoys.xls and the structures are included here as supplementary figures 1 and 2. Each molecule was stripped to a core plus a number of side groups, yielding the scaffold shown in Figure 2. There were thirty-five core pieces, and these were combined with the twelve side groups shown in Figure 3, to give a total of 5843 possible compounds. This scheme provided for the computational construction of every binder and non-binder from the original data. Due to the combinatorial nature of ligand construction, some molecules not in the original data set and not previously characterized were also modeled and studied here. Both the scaffolds and the side groups were initially created using Gaussview29 with a hydrogen placed at each attachment site. The molecules were then geometry optimized with Gaussian 0336 using the restricted Hartree-Fock model employing the 3-21G basis set. The resulting structures were then re-evaluated with a single-point Gaussian 03 energy calculation using the restricted Hartree-Fock model with the 6-31G* basis set. The partial atomic charges were then assigned using a RESP fit.32,33 Previous work has shown that this scheme performs well compared to performing both the minimization and single-point calculation with the 6-31G* basis set.37 Protonation states were assigned as in previous studies.26 A systematic set of conformations was created for each scaffold and for each side group by rotating each rotatable torsion angle in increments of 30°. Van der Waals radii were scaled by 0.75 and each conformation where the scaled radii of any two non-bonded atoms overlapped was discarded. For all of the work with small molecules, CHARMm22 torsional parameters, van der Waals parameters and van der Waals radii were used throughout.38 This yielded an ensemble of conformations without steric clashes for the molecular building blocks of the chemical library.
The general methodology for carrying out the design has been discussed previously5 and is displayed as a flowchart in Figure 4. Here we will briefly sketch the approach and stress differences from previous work. The first step was to define the active site. This was done by defining a box encompassing the entire site and placing spheres at each grid point in a grid within this box with a resolution of 0.25 Å. Any spheres overlapping with the protein were removed from the grid and thus the resultant shape followed the surface of the active site and limited where atoms could be placed within the design. This differs from previous work in which the active site was defined by the coordinates of substrates. The next stage involved creating van der Waals and electrostatic grids for grid-based energy calculations. The scoring function employed in the combinatorial search procedure contains three primary components, a van der Waals packing term, a screened electrostatic interaction term, and desolvation penalties for both the designed ligand and the receptor. Grids for van der Waals energies are computed by placing a particular parameterized CHARMM atom type at each grid-point and computing its van der Waals interaction energy with the rest of the receptor. This is repeated for every atom type. To derive the van der Waals binding energy for a given molecule, the energetic contribution of each atom is calculated by trilinearly interpolating energies from the surrounding eight points of the appropriate grid. The target shape, in which the design is performed, is fixed throughout the combinatorial search portion of the algorithm, and calculations of grid-based potentials for the evaluation of electrostatic interaction and desolvation are performed using the linearized Poisson—Boltzmann equation. As shown previously, the electrostatic component of the binding free energy given a fixed shape for the bound and unbound states, fixed charges on the receptor, and a set of basis points within the ligand can be written as the sum of the screened electrostatic interaction term and desolvation penalties for both the designed ligand and the receptor.39 The basis points within the ligand are set to a regular cubic lattice and the electrostatic binding energy of any molecule can be approximated by trilinearly projecting each partial atomic charge to the grid points. This approximation estimates the electrostatic binding free energy of the molecule within the target shape, rather than the correct molecular surface of the ligand derived from radius parameters. Keeping the target shape constant allows for precomputation at each grid-point and is the basis for the fast grid-based electrostatics and solvation approximation.
In addition to van der Waals and electrostatic solvation terms, additional components of the score during the combinatorial search include a bump check against the target shape, ensuring that any molecule outside it has an infinite energy, as well as functional group/scaffold and functional group/functional group bump checks to ensure that designed molecules are not self-intersecting. Bump checks were chosen over traditional molecular mechanics internal energies because it is unclear when designing a small molecule how much internal strain is paid upon synthesis rather than binding. In order to use existing implementations of combinatorial search algorithms such as DEE and A*, the scoring function must be pairwise decomposable in functional group conformation and identity. This means that the total energy of a given scaffold with added functional groups consists of a constant term, a sum over the self energy contributions of each functional group alone, and a sum over contributions for each pair of functional groups. The constant term includes the grid-based van der Waals and electrostatic binding energy of the blunt scaffold, as well as the constant receptor desolvation term. The self energy for each attached functional group in a discrete conformation is the grid-based van der Waals contribution for all atoms, and the self electrostatic binding energy of the functional group can be computed through grid projection. In addition to electrostatic interaction and self desolvation, the indirect desolvation between the functional group and the scaffold must also be added to the self energy. Any functional group geometry that fails a bump check with the scaffold or with the shape is removed from further consideration in the combinatorial search. The contribution of a pair of functional groups to the binding free energy only contains the indirect solvation effects between them. If the two functional groups fail a bump check and clash, their pair energy contribution is infinite. The pairwise energy decomposition presented above is crafted such that the energy sums to same value as if the entire molecule, complete with scaffold and functional groups, was evaluated with the grid-based van der Waals and electrostatics/desolvation functions.
We used a van der Waals grid spacing of 0.125 Å and an electrostatic grid spacing of 0.5 Å. A finer grid size was used for the van der Waals component because the potential varies more rapidly in space and higher resolution was necessary to capture the steep repulsion when an atom of the ligand approaches the protein. On the other hand, the electrostatic potential is smoother and a lower resolution grid was sufficient. The van der Waals component is relatively inexpensive to compute at finer resolution in comparison to the Poisson--Boltzmann electrostatics. Each conformation of each scaffold was then subject to systematic placement in the active site with a translational enumeration of 0.25 Å and rotational enumeration such that the maximum arc length of atoms from the centroid swept out a distance of 1.0 Å between orientations. Scaffold poses were discarded if their calculated van der Waals binding energy was greater than zero or any two non-bonded atoms 0.75 scaled radii overlapped. The DEE/A* algorithm was then run on the set of low energy scaffold poses, which placed all side groups in all conformations and combinations on each scaffold pose. The torsion joining the scaffold to the side group was enumerated at 10°. The rigid binding free energy was computed for all the species created with the low resolution energy function detailed in Table I. This yielded an energy ranked list of molecules with the guarantee that no solutions were missed. For every scaffold pose, only the lowest-energy conformation of each particular designed molecule was retained. The ranked list was then pared down to a computationally feasible size by applying an energy cutoff. Compounds greater than 25.0 kcal/mol above the lowest energy result were removed. However, every copy of each binder and non-binder was retained to provide an estimate of binding affinity for each one. The resultant set of molecules was then re-analyzed using the medium resolution energy function detailed in Table I. At this stage another energy cutoff was applied and compounds greater than 15.0 kcal/mol above the lowest energy result were removed. All binders and non-binders were again retained. This final set of molecules was then evaluated using the two high-resolution energy functions, one of which includes a geometry optimization using CHARMM.27 The minimization is performed on the ligand in a rigid protein structure for 1,000,000 steps using the adopted basis Newton-Raphson method. Minimization did not radically alter any of the ligand conformations, with the RMSD between the initial and minimized structures varying between 0.24 Å and 1.16 Å. The exact details of these two energy functions can again be seen in Table I. These calculations provided a predicted binding free energy for each known binder and non-binder in both the unbound and the bound state as well as some molecules that have not previously been reported. These predictions were then compared with experiment to assess the enrichment power of the method.
The aim of this work is to assess the ability of the design methodology to accurately separate binders from non-binders. We used a guaranteed discrete search method to systematically examine poses and conformations of a selected set of scaffolds and combined these combinatorially with a selected set of side groups to compute a binding free energy for all the known binders and non-binders plus a new set of untested compounds. The analysis presented here evaluates the ability of the method to accurately predict the structure of the complexes and the ability to discriminate binders from non-binders. After analysis with the four scoring functions, the results of all the compounds were compiled and the predicted binding free energies are given for known binders in Table II and non-binders in Table III.
The calculated relative binding free energies of all the predictions were between 0 and -30 kcal/mol. These are not absolute binding free energy values due to neglect of some factors, such as translational and rotational entropy. Forty-one of the forty-three known binders had a predicted binding free energy better than -15 kcal/mol using the energy minimization scheme in both the bound and unbound state structures. In both cases, the compound indoline (B9) was removed from consideration during the scaffold placement stage because its predicted binding free energy was greater than 0.0 kcal/mol. Repeating the design for indoline with a translational interval of 0.1 Å and rotational interval of 0.5 Å caused the scaffold to be retained and the minimization scheme predicted a binding free energy of -22.9 kcal/mol using the bound structure. For the compound quinoline (B41), no favorable placements were found in either case, even with a translational interval of 0.1 Å and a rotational interval of 0.25 Å. This molecule is the largest of the known binders and does not fit within either of the sites tested using tractable searches with this method. Unfortunately, the crystal structure with quinoline is unavailable for study. We made one deviation from the protocol described above. The small molecules dimethylamine (N16) and methylamine (N17) dominated the initial designs in the bound structure, with over 120,000 placements for dimethylamine and over 90,000 placements for methylamine. This compares with approximately 13,000 placements for the next most prevalent scaffold. This dominated the design without providing significant new information about the algorithm; these two compounds were thus analyzed separately with a translational interval of 0.5 Å.
As expected, the minimized energies were uniformly lower than the unminimized high resolution energies. The mean difference is 2.5 kcal/mol, but differs by as much as 8.6 kcal/mol in the case of indoline. This is a highly significant difference in this context and suggests that the minimization scheme may be important to properly rank potential binders. This is shown in Table IV and discussed later in this section. To illustrate the ability of the method to place molecules correctly, Figure 5 shows the predicted and experimental positions of 2,4-diaminopyridine (B6) bound within the site. The hydrogen bonding interactions are well predicted and the search and energy function clearly are able to identify good interactions. This undoubtedly contributes to the high scoring of the true binders.
These results will now be analyzed to determine the ability of the algorithm to discriminate binders from non-binders. This property is known as enrichment and is a highly desirable ability for any design approach. Recent work suggests that a good way to examine enrichment is with a receiver operating characteristic (ROC) curve plot.40 A ROC curve is calculated by first ranking all the predicted compounds in order of increasing binding free energy. Compounds are then selected in order from this list and the fraction of true positives is plotted against the fraction of false positives. Figure 6 shows ROC curves for all four scoring functions for the design in the bound structure. We chose to look at the bound structure as this reflects how our inverse design method would usually be used. However, the results for the unbound structure are very similar. The minimized high resolution scoring is the best, but selects the non-binder 4-aminosorcinol (N3) very early, causing worse enrichment scores of intermediate compounds. 4-aminosorcinol is discussed later in this section. Despite this, the minimization scheme is best able to partition the data into binders and non-binders. Using a cutoff of -16.5 kcal/mol, the predicted set of binders contains forty-two binders and only three non-binders, which is very good enrichment and compares favorably with previous studies.22 However, also consistent with other studies was an inability to reproduce the experimentally determined binding affinities. The set of known binders has been experimentally tested24,26 and covers a narrow range of affinities (Kd0.006 mM to 4.1 mM). Unfortunately, currently used scoring functions yield predicted affinities that do not correlate well with experimental affinity41 and this is also the case here. It is interesting to note that the energies predicted using the unbound structure of the CCP mutant (1CMQ) yield very similar results to the bound structure (2ANZ). The R2 correlation between the predictions is 0.895. In this case there are no major conformational changes upon binding and relatively little induced fit in the great majority of complexes. However, it is important that the design methodology is not sensitive to slight differences in structure.
We also assessed the importance of the SASA and desolvation terms by calculating ROC curves for the set of predicted binders and non-binders with lowest total energy excluding first the SASA term and then the desolvation term (data not shown). The enrichment without the SASA terms yields an identical ROC curve and the enrichment without the desolvation term slightly alters the order of some binders and non-binders but essentially has the same enrichment power. This is an interesting result as it suggests that these terms may be unnecessary in this particular case. However, it is expected that both SASA and desolvation terms would be significantly more important in analyzing molecules of differing sizes and charge distributions and thus must be included in this scoring function.
Previous computational protein design work employing a similar set of energy functions found that the predicted electrostatic contribution to binding affinity correlated better with the experimental binding free energies than did the predicted total interaction.15 The van der Waals contribution was actually misleading, enriching for bulkier groups at the expense of smaller but more electrostatically optimal groups. We evaluated whether this finding also applies in this case by plotting the enrichment using only the predicted electrostatic binding free energy (data not shown). However, in this case the total binding free energy provides a slightly better enrichment, although the electrostatic contribution alone is nearly as good. This may reflect the difference between analyses at large, solvated protein interfaces and within small, buried binding sites. Figure 7 shows a plot of the total electrostatic portion of the binding free energy against the total non-electrostatic portion. It is interesting to note that the total electrostatics free energy shows excellent discrimination between binders and non-binders. This finding contributes to an ongoing discussion about the relative importance of terms within physical energy functions, such as the one employed here. The question is still open, and requires experimental evidence and computational analysis to inform the debate.
We also tested the ability of the method to correctly predict the pose of each ligand in the binding site, by comparison of the energy minimized structure with the actual crystal structure. We chose to use two metrics for this purpose. The first is the root mean squared deviation (RMSD) of the heavy atom positions for the predicted and crystal structure atom positions. In all but one case (2ANZ), the crystal complexes are not the same as the structures used to generate the results. We thus aligned each structure with the test structure using the McLachlan algorithm42 as implemented in the program ProFit (http://www.bioinf.org.uk/software/profit/). The alignment was performed on only protein backbone atoms, but the entire protein and all heteroatoms, including the crystallographic ligands, were rotated. The rotated ligands could then be used as a reference to compare with the predictions made by the design method. The RMSD of the protein backbone atoms had an average of 0.26 Å with a maximum of 0.87 Å. We also used another metric which is perhaps more applicable to the methodology. We termed this the distance matrix similarity. This value was calculated by taking the crystal structure and flagging all contacts between protein and ligand atoms that are below a defined distance (5.0 Å in this case). The same pair distances were then measured for the predicted protein-ligand complex. The similarity was defined as the RMSD of these inter-atomic distances for all contacts. This metric assesses the similarity of the contacts made by the ligands and helps to avoid the problem of comparing predictions in non-native structures. Both metrics are given in Table II. The metrics are highly correlated, with a regression coefficient of 0.964, but there are a few deviations. The most significant is the case of benzimidazole (B40), where the similarity metric is lower than expected. Analysis shows that this is due to a major loop repositioning that accommodates the large ligand. A similar rearrangement may also be important in allowing quinoline (B41) to bind. The alignment of predicted structures and crystal structures for all thirty-five compounds with available data are shown in Figure 8. The majority of compounds were predicted well, with RMSD values below 2.0 Å. The RMSD results from inverse design were very similar to those obtained by docking in a previous study.22 However, in that case each ligand was docked into its own crystal structure, which is likely to improve results.
To examine the effect of sampling, we repeated the design with an increased translational interval of 0.5 Å. This led to a reduced run time for the algorithm but the results highlighted a problem. Unsurprisingly, indoline and quinoline were again removed at the scaffold placement stage; additionally, six known binders were present within the initial results but with poor energies (compounds B3, B15, B20, B33, B39, and B40 in Table IV). The predicted binding free energies were also predicted to be poor at both medium and high resolution. However, using the energy minimization scheme, these compounds recover improved interactions and then eventually score very well. Within a normal design, compounds predicted to be so poor at medium resolution may be eliminated and not be analyzed at the high resolution of detail. These six compounds could thus have been lost. This is again due to the large size of these molecules and thus difficulty in placing them within the binding site. Finer sampling of conformational space, while computationally more expensive, produces better recovery of known binders and is expected to provide better coverage in new discovery projects.
Despite the successes using the inverse design method, there are a common set of problems that led to less accurate structural predictions. These include ring flips that interchange carbon atoms with heteroatoms and neglect of structural waters that are relevant for some ligands. Compounds B12, B15, B25, and B31 illustrate the recovery of a ring-flipped version of the correct structure (Figure 8). Examination of these cases shows that the correct orientation is always a viable structure but commonly has a computed affinity that is 2-3 kcal/mol worse than the incorrect orientation. These energy differences arise mainly from van der Waals interactions. Analysis of these results suggests that sampling is not the problem and that it is likely the van der Waals potential, particularly involving sulfur, is responsible.
The structures for compounds B18 and B43 were both not well predicted due to the presence of structural water molecules observed in their co-crystal structure that were absent in the designs. Figure 9 shows the crystal structure and predicted positions for 1H-imidazol-2-ylmethanol (B18). The actual pose has three good hydrogen bonds, one with a structural water molecule. Our calculations did not include this water and the molecule makes one good and two poor hydrogen bonds, one which is made by the water in the true structure. The effect of structural water molecules on predicted ligand binding geometry has been recognized as an important issue in molecular design, along with the role of induced fit and binding site relaxation. We have recently developed a method for placing structural water molecules during the design process.43 When this methodology was applied to this case, the algorithm was able to find the correct placement for the ligand and the water molecule, as shown on the right of Figure 9. The predicted binding free energy is -22.8 kcal/mol compared to -21.5 kcal/mol when using the normal design scheme. The RMSD of the ligand is 0.55 Å compared to 2.82 Å with the normal design scheme. This shows that structural discrepancies can result not from a problem with the sampling scheme or scoring function directly, but due to a neglect of structural waters. This is not an uncommon problem, but one that can be solved with inverse design methods as shown here, albeit at some computational cost.
Having analyzed the results for how well the algorithm is able to discriminate binders from non-binders, it is useful to look at the cases when it makes errors. In each case, the lowest energy minimized score is taken as the prediction of the binding free energy. There are four non-binders we will analyze in this way. The molecules methylamine and dimethylamine are both predicted to bind very well using this method, due to favorable van der Waals and electrostatic interactions. The issue here is likely an inadequate desolvation penalty. Both molecules are highly charged and may well form strong hydrogen bonds with the solvent. The continuum electrostatic treatment used to compute the ligand electrostatic desolvation penalty may not accurately treat this effect. The errors for methylamine and dimethylamine could also be attributable to these relatively small molecules not adequately filling the active site and thus leaving vacuous space when bound. One other issue that arises is with the molecule toluene (N2), which is a known non-binder but is grouped with the binders based on its computed affinity of -18.2 kcal/mol. In this case the difficulty may be an assumption of the model, that binding takes the ligand from a fully solvated state to a bound state. Toluene is only very sparingly soluble in water (approximately 0.53 g/L) and is unlikely to be dissolved and dispersed in water. The prediction that toluene has a better interaction with the protein than with the solvent may well be correct, but it is still not a true binder and may adopt an incompletely solvated unbound state. The most problematic compound is the non-binder 4-aminosorcinol (N3). In its protonated form it is predicted to have a binding free energy of -26.2 kcal/mol. This is only 3 kcal/mol worse than the compound with the lowest predicted energy. The experimental binding affinities in the original study were determined at pH 4.5 and pKacalculations with Jaguar suggest that the nitrogen atom will be protonated below pH 6.1. However, we performed calculations on the unprotonated version of 4-aminosorcinol as a test. The predicted binding free energy in this case is -17.5 kcal/mol, still placing it in the category of predicted binders. This compound also appeared to cause consternation when it was first tested and was predicted then as a binder.26 A likely explanation for the problem with 4-aminosorcinol is a poorly predicted desolvation penalty. It is known that atomic charges for amines calculated by commonly employed quantum mechanical (QM) methods44,45 can lead to estimations for hydration free energies that vary as much as 5.0 kcal/mol from experiment37 or by as much as 8.4 kcal/mol for charged amines and anilinium ions.46 One of the three incorrectly predicted non-binders in this test set contains a charged amine group and one contains an anilinium ion and two phenol groups. Predictions of hydration free energies are also poor for phenols37. Recently reparameterised force-fields such as OPLS-AA47 and GROMOS48,49 have proven more effective at dealing with solvation calculations for neutral amines. We thus made predictions of the electrostatic portion of the desolvation energy for ammonium (NH4+), methylammonium (MMA+), dimethylammonium (DMA+), trimethylammonium (TMA+), anilinium, and phenol using partial atomic charges from Gaussian,36 PARSE,34 and OPLS-AA.47 The calculations were performed using one solvated calculation and one vacuum calculation with a salt concentration of 0.145 M. The experimental desolvation energies were taken from a previous study.50 The desolvation cavity terms were calculated by multiplying the molecular surface area calculated using CHARMM by the empirical constant 0.005 kcal/mol/Å2.51 The results in Table V show that the Gaussian scheme used in this work (highlighted in italics) is not especially accurate at predicting the solvation energies for phenol or charged amines, and this is particularly true for anilinium. These problems are compounded for the molecule 4-aminosorcinol, which might be construed as consisting of two phenols and one anilinium and could thus be underpredicted by over 13 kcal/mol. The OPLS-AA charges perform better than the Gaussian charges but are still inaccurate for the anilinium ion. The PARSE charges do poorly at dielectric 4 but very well at dielectric 2, though they were parameterized at dielectric 2. The Gaussian scheme also performs better at dielectric 2. This result is initially promising but unfortunately does not solve the problem. When the calculations on 4-aminosorcinol were repeated using a dielectric value of 2, the results predict very similar binding free energies (-26.2 kcal/mol compared with -27.2 kcal/mol at dielectric 4) due to the balance between desolvation penalty and the electrostatic interaction.
A number of difficulties clearly remain with the design methodology, as four false positives are predicted amongst the set of binders. The issue of solubility can be addressed relatively easily by estimating the water solubility of each predicted binder. There are many accurate techniques designed for this purpose.52 This issue of correctly predicting desolvation penalties for phenols, charges amines and anilinium ions is more problematic and cannot be easily solved with our current methodology. In future studies, it would be beneficial to flag such troublesome groups and fix the problem with a more empirical method. However, that is a non-physical solution and this study again highlights the need for more accurate modeling of these species, for improved scoring functions and greater enrichment in drug discovery.
Despite problems with a few individual compounds, the majority of predictions are excellent. It is thus useful to look at the compounds designed for which experimental data is not present. Defining the cutoff for binding as -16.5 kcal/mol, there are 17 new compounds predicted (Table VI). Many of these predictions seem sensible, in that they exhibit similarity with known binders. Compounds K and N contain three and four nonmethyl rotatable bonds, which might decrease the affinity of these compounds. Cyclopentane (Q) is also predicted as a binder due to favorable van der Waals and hydrophobic interactions. However, like toluene, it may have solubility issues that diminish its binding in water. Compound P contains two phenol groups and based on what we have learned in this study, we might perform further analysis on this molecule before fully classifying it as a predicted binder. Compounds H, L, and M are all 2-thiolimidazole derivatives and are unlikely to be protonated at pH 4.5 and are thus likely to be non-binders, like their parent compound. This highlights the importance of pKa values and pH of binding assays. This is another problem that can be addressed relatively easily by analyzing the suggested binders using post-processing techniques to predict protonation states.52,53 However, care must be taken when making these predictions, as protonation states are known to change upon binding54 and this may markedly affect affinity. Despite these issues, the remaining eleven compounds contain no protonated amines, phenol groups, or anilinium groups, which have been problematic, and are excellent predictions for true binders. This analysis highlights some of the lessons we have learned in this study and how they can be used to eliminate false positives.
We have recently developed a molecular design approach that employs the DEE and the A* algorithms to design potential inhibitors for drug discovery. The aim of this work was to investigate the performance of the algorithm by assessing its ability to score and rank a set of known binders and non-binders. The extensive experimental data available for this system allowed us to identify cases where the algorithm performed well and cases where it performed less well, with the aim of making further improvements. The study highlighted a number of issues with the process of separating binders from non-binders.
The first issue is with sampling. It is clear that to capture compounds that bind well, they must be placed favorably within the binding site. The number of true positives increased with increased sampling and to capture all the known binders required a translational interval of 0.25 Å. In cases such as dimethylamine and methylamine where this generated too many placements to feasibly search, new techniques will be required to prune the solutions found to a reasonable size, whilst retaining enough to ensure good coverage of the space.
The second issue involves subtleties with the scoring function used. The low, medium, and high resolution functions all do a good job of predicting binding proclivity, but minimization is often able to find significantly better poses in many cases. Our results show that a combination of a fine level of sampling and a hierarchical energy function can yield excellent results. For the work here, the minimization scheme led to further improvement. Using this, we were able to select a set of the top scoring compounds, which included forty-two of forty-three binders and only three non-binders. This is an excellent enrichment.
The third issue is with the scoring of charged amines, anilinium ions, and phenols. Previous studies have shown that it is difficult to correctly model the solvation of these molecules and to predict their affinity for this site. Most analyses conclude that the problem is due to incorrectly estimated partial charges and that this is a result of the failure of commonly used QM methods in this particular case. Calculations have shown that the errors in solvation can be miscalculated by up to 8.4 kcal/mol for charged amines. This is a large enough error to account for the misclassification of dimethylamine and 4-aminosorcinol as binders.
The study also highlights the importance of two other issues that have been identified by others previously. Firstly, it may be important to make an estimate of the solubility for any predictions made. A compound may have a good predicted binding energy and truly fit the steric and electrostatic requirements of the site but be too insoluble in water to actually qualify as a binder. In this case, toluene (N2) may be misclassified for this reason. Secondly, it is important to ensure that molecules are assigned the correct protonation states. The binding free energy of a neutral molecule and its charged state may be markedly different. This is shown here computationally by the difference between predicted binding energies for 4-aminosorcinol (-17.5 kcal/mol) and its protonated state (-26.2 kcal/mol). The problem is also highlighted by the prediction of protonated analogues of 2-thiolimidazole as binders. 2-thiolimidazole is not protonated and is a non-binder. However the design creates it by combining imidazole, which is protonated, with a thiol side chain.
The design method detailed in this paper also performs well when compared to other methods tested using this site. Direct comparison of enrichment is not possible due to differences in approach, but the enrichment is very good, as it was in the previous study26. Recovery of the crystal structure pose yields predictions which are very similar to previous studies, with an average RMSD of 1.09 Å compared with 1.04 Å25,26 and 1.06 Å.26 However, it should be noted that the approach used here uses a discrete rather than continuous search and is tailored for enrichment rather than structural recovery.
From the results of this study, it can be concluded that the approach used for searching conformational space and for compound scoring is generally very good and can provide excellent enrichment, despite problems with a few individual compounds. It is encouraging that inverse design methods can perform as well as forward design methods in enriching for true binders and recapitulating known binding modes.
We thank Brian Shoichet for making the experimental data freely available and Catherine Drennan for helpful discussions. The work was partially supported by the National Institutes of Health (GM065418 and GM066524).