3.1. Model of the relationship between NRC and RMSD
The convergence of the GMDH RR criterion was achieved at the complexity level of three. The best regression equation obtained for Eq. 3
is depicted in Eq. 5
is the number of rotatable bonds; nnara
is the number of atoms in nonaromatic rings; NRC0
is the actual number of conformers generated at RMSD0
=2.0 Å; and NRC is the total number of conformers produced at RMSD.
For Eq. 5
, the training set yielded an R2
of 0.92 and an RSE of 0.56 with N being 3,414 and the test set yielded an R2
of 0.91 and an RSE of 0.57 with N being 3,352. The variable nha
was not included in the equation by the GMDH procedure, presumably because the NRC0
variable already encodes information about the size of the molecule. The variable nr
is primary in determining the influence of the molecule's degrees of freedom on the NRC value. The measure of pseudo-rotors, nnara
, adds a significant contribution due to the flexibility of ring systems.
After the form of the model was determined, coefficients were estimated more precisely using the combined training and test sets, yielding Eq. 6
with an R2
of 0.92 and an RSE of 0.57, where N is 6,766.
Examination of Eq. 6
suggests an “effective number of rotors” relationship, in that five non-aromatic ring atoms are roughly equivalent to a single rotatable bond. Re-running the regression using nreffective
, where nreffective=nr+nnara/5,
in place of the parameters nr
in Eq. 6
yields Eq. 7
with an R2
of 0.92, an RSE of 0.56, and N being 6,766.
displays the predicted versus actual values of ln(NRC) for Eq. 7
for the combined training and test sets. The distributions of number of heavy atoms and of effective rotors for the combined set () demonstrate the diversity of size and flexibility of the compounds used for model development.
Figure 1 Plot of predicted versus actual ln(NRC) using the model in Eq. 7. Each dot represents one of the 6,766 structures in the combined training and test sets. The solid line depicts an ideal fit of predicted to actual values.
Distributions of the number of heavy atoms and effective rotors for the combined training and test sets.
To demonstrate how Eq. 7
performs, Figs. through provide predicted and actual numbers of conformations, depicted as ln(NRC), as a function of RMSD for typical chemical structures.
Figure 3 Comparison of ln(NRC), actual and predicted, using Eq. 7, as a function of RMSD for PubChem CID 2941020, depicted above, which contains 32 non-hydrogen atoms, nine rotors, and six “non-aromatic” ring atoms. Effective rotors: 10.2.
Figure 5 Comparison of ln(NRC), actual and predicted, using Eq. 7, as a function of RMSD for PubChem CID 520865, depicted above, which contains 11 non-hydrogen atoms, four rotors, and zero “non-aromatic” ring atoms. Effective rotors: 4.0.
As expected, the NRC value grows faster or slower depending on the molecule's flexibility as a function of RMSD, with the rate of change of NRC depending mostly on the number of effective rotors. One can see, for example, that for molecules with four effective rotors, the ln(NRC) value increases much slower than for molecules with 5.4 or 10.2 effective rotors. The molecule with the PubChem Compound identifier (CID) 1212923 (33 heavy atoms) is nearly the same size as CID 2941020 (32 heavy atoms). These two molecules have three (ln(NRC0)=1.1) and six (ln(NRC0)=1.79) conformers at RMSD=2.0Å, respectively. However, due to their different number of effective rotors, 5.4 for CID 1212923 versus 10.2 for CID 2941020, already at RMSD=0.6Å the respective NRC values are 121 (ln(NRC)=4.8) and 1,536 (ln(NRC) = 7.33) conformers.
Applying the model in Eq. 7
to the MMDB data set, we use prediction intervals instead of single point predictions, with the prediction intervals for NRC being calculated according to Eq. 8
where ln (NRC)calc
is a value calculated by Eq.7
is the residual standard error of the regression; n
is the number of degrees of freedom equal to that used for the RSE calculation; t
is the critical value for the Student's distribution; and α is the significance level. In this study, we use α=0.1 to obtain a 90% prediction interval. For Eq. 7
=0.56 and t0.1;6763
=1.645, therefore, the 90% prediction interval is:
3.2. Analysis of NRC and coverage for MMDB dataset
The diversity of the MMDB dataset in terms of molecular size and flexibility is shown in .
Distributions of the number of heavy atoms and effective rotors for the 14,504 MMDB ligands.
shows the distribution of RMSDX-ray values for the 14,504 MMDB ligands versus RMSD resolution used.
Figure 7 Upper graph: Plot of the RMSD resolution of all 14,504 conformer models in the MMDB data set versus the smallest RMSD from the experimentally derived 3-D “biologically active” conformer to a single conformer in the conformer model. Each (more ...)
The graphs suggest a statistical dependence between the average RMSD of X-ray structure reproduction and the average ensemble size across the range of ensemble RMSD resolution of coverage. The correlation between average ln(NRC) and average RMSDX-ray is −0.98. Therefore, conformational coverage as represented by the ensemble size is deemed to be a determining factor in providing reproduction of the X-ray structure. also shows that, despite the average RMSDX-ray being lower than the RMSD value, in many cases RMSDX-ray is greater than RMSD. In these cases, the conformational space coverage is insufficient in the region of the X-ray conformation.
shows the success rates of X-ray structure reproduction for the MMDB ligands falling into the indicated effective rotor ranges. It also shows prediction intervals of the average ln(NRC) using Eq. 9
and the actual average ln(NRC) values. The numbers of ligands for which the averages were calculated are shown in .
Figure 8 Curves: actual average ln(NRC) values. Intervals: prediction intervals of the average ln(NRC). Bars: success rates of X-ray structure reproduction. The dotted horizontal line marks the 80% threshold of X-ray structure reproduction. Each individual panel, (more ...)
Count of structures in the MMDB dataset binned by effective rotor range and RMSD valuea
For compounds with 0.2-12 effective rotors, the graphs in show that the predictions are of reasonable accuracy, given that the average actual ln(NRC) data lie within the prediction interval. However, for compounds with 12.2 – 20 effective rotors, the predicted NRC values significantly exceed the actual NRC values at low RMSD resolutions, as indicated by the average actual ln(NRC) being below the prediction interval. Furthermore, this effect appears to correspond to a significant decrease in the reproducibility of X-ray structures within generated conformer ensembles. These major differences can be explained by limitations placed on and within the Omega application, relative to the maximum number of conformers in ensembles. In addition, this may suggest that one may need to consider different parameter settings when attempting to generate high quality ensembles at low RMSD values for flexible compounds.
The Omega parameters, MAXCONFS=10,000 and NUMCONFS=KEEPCONFS=25, used in this study allow for a maximum of 250,000 conformers (ln(NRC)=12.43). However, this maximal total conformer value is an upper limit and can only be reached: if all initial 25 conformers generated by random geometry are unique; if there is no overlap with respect to minimum RMSD spacing of conformations between the conformations produced during torsion driving of the initial 25 conformers; and if the torsion driving step reaches the maximum of 10,000 produced conformations.
For highly flexible molecules, including those with more than 12.0 effective rotors, it is possible that the 10,000 limit on the maximum number of conformers during torsion driving is reached. For example, among compounds in the range of 12 - 20 effective rotors, there are many dinucleotides, such as flavinadenine dinucleotide (FAD). These molecules have several hundred conformers at an RMSD=2.0 Å and, with decreasing RMSD value, their NRC grows extremely fast, almost certainly hitting the 10,000 conformation numeric limitation at the lower RMSD values examined in this study. Hitting the maximum conformer limit during torsion driving implies a truncation of the explored conformational space, likely creating gaps in coverage, which decrease the reproducibility of X-ray structures within generated conformer ensembles and decrease the NRC.
Conformational exploration is inherently combinatoric with respect to the number of degrees of freedom. It is conceivable that a limit of 100 000, 1 000 000, or even a much larger value of maximum conformations will still be inadequate as an upper limit of the number of conformations produced during torsion driving. Consider the earlier example of FAD that has thirteen rotatable bonds. If only three torsion values are considered for each rotatable bond, this yields 313 = 1,594,323 potential conformations to be explored. A simplistic algorithmic approach that iterates over the allowed torsion angles will rapidly hit a “reasonable” conformation limit for molecules like FAD. More advanced approaches and techniques are clearly needed to rigorously explore the expansive conformational space of highly flexible molecules.
The results in the graphs of clearly demonstrate the strong effect of compound flexibility on the reproducibility of bound ligand X-ray structures, with respect to RMSD. If one sets a statistical threshold of an 80% success rate (four out of five) of reproduction for a given RMSD value (indicated in the graphs by a dotted line), then X-ray conformation reproduction will be successful at virtually any RMSD value for compounds in the flexibility range of zero to four effective rotors using the methodology employed in this study. A transition occurs at 4.2-6.0 effective rotors where the success rate dips below 80% for the lowest RMSD value studied, 0.6 Å. As the compound flexibility further increases, the success rate of X-ray conformation reproducibility becomes progressively worse, even at more “moderate” RMSD values of 1.0 Å or greater. The apparent systematic deterioration in the ability to routinely reproduce the X-ray conformation as a function of increasing flexibility and decreasing RMSD may suggest an aggregate lower bound RMSD resolution possible for a molecule of given flexibility due to the methodology employed within Omega. Considering that this trend is consistent across the full range of flexibility and our choice of parameters does not restrict the conformation exploration of moderately flexible structures by Omega, this is likely the result of a combination of insufficient sampling of the torsion angles used in the Omega ring and torsion knowledge base and the use of an inappropriate or inadequate forcefield, suboptimal energy thresholds, or other inherent additive or accumulative systematic errors specific to the implementation of Omega.
shows the prediction intervals for the average NRC as a function of effective rotor range for the entire MMDB dataset. The italicized values show the progressive breakdown in the ability to routinely reproduce the X-ray structure as a function of RMSD and of flexibility, i.e., where conformational coverage “holes” at a particular RMSD resolution may be an issue. The values in italics and boldface indicate where prediction overestimates actual NRC, presumably due to Omega conformation production limits being reached, discussed previously.
Prediction intervals for average NRC as a function of RMSD and effective rotor range for the MMDB dataseta
As shown in , the conformational space of less flexible molecules is small and may be represented by a relatively small number of conformers for RMSD values all the way down to 0.6 Å. Obviously, the lower bound for any estimated average NRC is 1, which is indeed found for all RMSD values for the least flexible compounds (nreffective = 0.2 – 2). This implies that, in many cases, a single conformer may be sufficient to reproduce the biologically active structure of the least flexible molecules within the highest resolution of coverage studied.
In contrast, the conformational space of more flexible molecules becomes large quickly, especially at smaller RMSD values. The NRC prediction intervals given in provide only conservative estimates of the actual interval size, considering Eq. 7
is not based on the results of an exhaustive conformational search. However, even these estimates show that, to represent the conformational space of flexible molecules with high resolution of coverage, a large number of conformers may need to be generated. For those italicized domains in where the success rate of X-ray structure reproduction was low, NRC values reported are obviously too low for regular coverage. In other words, larger numbers of conformations should be expected to achieve regular coverage at those resolutions.
Interestingly, can help estimate the accuracy of bioactive structure reproduction expected if one uses conformer ensembles of a particular size. For example, using ensembles of up to 500 conformers, one might expect accuracy of reproduction for compounds to be: 0.6 Å with 0.2 - 8 effective rotors; 0.8 Å with 8.2 - 10 effective rotors; ≥1.0Å with 10.2-12 effective rotors; ≥1.2Å with 12.2 – 14 effective rotors; and ≥1.4Å with more than 14 effective rotors.
These accuracy estimates are in line with results by Kirchmair, et al.25
Analyzing X-ray structure reproduction of 778 high-resolution PDB ligands by two different conformer generators, Omega and Catalyst26
, these authors demonstrated that ensembles of a size of 50-500 conformers provide an average RMSDX-ray
of <0.7Å for compounds with less than 3 rotors; 0.6 – 0.8Å for compounds with 3 - 5 rotors; 0.9 – 1.2Å for compounds with 6 – 8 rotors, 1.1 - 1.6Å for compounds with 9-11 rotors; 1.3 - 1.7Å for compounds with 12 - 14 rotors and 1.6 - 2.3 Å for compounds with more than 14 rotors.
In a number of studies, authors have used ensembles limited to 50-500 conformers25,27-29
or used RMS deviation from X-ray structures averaged over all investigated compounds, regardless of flexibility, as a criterion of the ensemble quality27-28,30
. We would contend that this is a questionable practice. Achieving low RMSDX-ray
, for example, 0.6 Å, using small-sized ensembles is trivial for inflexible structures and practically unreachable, other than by sheer luck, for compounds with more than eight effective rotors. By the same token, our findings may seem to contradict earlier conclusions31-32
that a small number of conformations are always sufficient to represent conformational spaces of drug-like molecules. As this work shows, this is only true if the minimum RMSD spacing between conformations is increased as a function of flexibility, implying a decrease in the degree of accuracy. We emphasize that the sufficiency of the conformational ensemble size very much depends on the accuracy required of the ensemble.
Our results suggest that, for the assessment of the quality/effectiveness of conformational models generated using different methodologies and applications, it may be useful to compare these models with conformer ensembles providing regular coverage of the entire conformational space of molecules, i.e., those produced by an “ideal” systematic search where there are no “holes” in conformational coverage. A methodology that would allow a significant decrease of the number of conformers without any loss of the bioactive structure reproduction accuracy should be considered to be effective in the above sense. For example, it can certainly be argued that use of an energetic threshold may significantly decrease the number of conformers in a regularly-spaced ensemble without any loss in the X-ray structure reproduction accuracy by eliminating biologically irrelevant parts of the conformational space.