1. Molecular size and flexibility of the PDB ligands
The size and flexibility of a molecule are important factors affecting the conformer model for a molecule. While the molecular size is approximated by the number of non-hydrogen atoms in the molecule, the molecular flexibility can be expressed in terms of the number of effective rotors [20
], which is given as the following:
is the number of effective rotors, NR
is the number of rotatable bonds, and NNARA
is the number of "non-aromatic" sp3
-hybridized ring atoms. Note that the value of NER
is not necessarily an integer, but, in this study, is frequently rounded to the nearest whole number. Effective rotors take into account the flexibility of rings as well as rotatable bonds. Figure shows the distributions of the non-hydrogen atom counts and the effective rotor counts (binned by whole numbers) for the experimental structures in the Molecular Modeling Database (MMDB) [21
] ligand dataset as downloaded from the PubChem Substance database (the MMDB contains only experimentally determined data found in the PDB). On average, the molecules in the MMDB ligand set have 17.1 non-hydrogen atoms and 4.9 effective rotors. Although molecules with up to 50 non-hydrogen atoms are considered in the present study, ~90% of them have 30 or less non-hydrogen atoms. In addition, as shown in panel (b) of Figure , molecules with greater than 16 effective rotors rarely occur, due to limiting the MMDB dataset to 15 rotatable bonds.
Distribution of (a) the non-hydrogen atom count and (b) the effective rotor count, binned by whole numbers, for the 25,972 experimentally determined structures in the MMDB ligand data set.
2. Parameter validation for conformer generation
] was used in the present study. It is known to be among the fastest and most accurate conformer generation programs [17
] available. In addition to high quality, it was the only commercially available program that had a non-windows-only C++ application programming interface (API) at the time of project initiation, a critical consideration given the computing environments at the National Center for Biotechnology Information (NCBI). A brief overview of the conformer generation algorithm of OMEGA is given in the Materials and Methods
section. A more detailed explanation can be found elsewhere [11
OMEGA has many adjustable parameters to generate 3-D conformations with particular attributes. Some important ones are listed in Table . To find an optimal set of values, the effects of primary parameters upon conformer generation were tested: (1) the fragment sampling rate for determination of fragment conformation, (2) the type of molecular force field used for the model building and torsion search, and (3) the size of the energy window that determines the energy range of conformers generated. As detailed in the Materials and Methods section, only a maximum of 100,000 conformations were considered for a given molecule, meaning that the conformer space of some chemical structures was not fully explored due to this "100-k limit." Therefore, the occurrence of such cases was considered while testing optimal values of parameters.
OMEGA parameters modified or non-default during conformer generation
In addition to the non-hydrogen atom pair-wise root-mean-square distance (henceforth termed simply RMSD), the Shape-Tanimoto (ST) value was also used as measure of the accuracy of the conformer models. The ST value between any two molecules A and B is given by the following equation:
are the self-overlap volume of A and B, respectively, and VAB
is the overlap volume between A and B [24
]. Note that the ST score is a molecular similarity measure ranging from 0 (for no similarity) to 1 (for identical molecules), whereas the RMSD value is a molecular dissimilarity measure ranging from 0 (for identical molecules) to infinity. Among all theoretical 3-D conformers generated for a given molecule, the one with the least RMSD and the one with the greatest ST to the experimental 3-D coordinates were considered to be the most similar to the "bioactive" conformation. Therefore, the accuracy of a conformer model generated by OMEGA was evaluated using the minimum RMSD and the maximum ST values between the experimental conformation and a single theoretical 3-D conformation in the conformer model. In further discussion, the RMSD and ST values for a conformer model
indicates the RMSD and ST values between the corresponding experimental structure and the most similar theoretical conformer in the conformer model
2.1. Effects of fragment sampling rate
Fragment sampling in the OMEGA model building stage helps to ensure that flexible ring systems are appropriately examined to find all unique ring conformations. In OMEGA, a float-point value for the "-startfact
" option determines the fragment sampling rate, using the following equation [23
where Nsamples is the number of samples, and Σ(nrb = 3) and Σ(nrb = 4) are the sums of the number of atoms in the molecule that have three and four ring bonds, respectively. To investigate effects of the fragment sampling rate upon conformer generation, the conformer models generated using the default value of 20.0 for the "-startfact" option were compared with those generated using the value of 50.0, and the results were summarized in Table . In general, using the value of 50.0 rather than the default (= 20.0) was found not to have any significant effect on the overall average RMSD and ST values between the computationally generated conformers and the experimentally determined structures. There were also no significant effects of the increased sampling rate upon the number of conformers generated and the 100-k limit counts. A similar insensitiveness to the fragment sampling rate was observed in Figure , which shows the average RMSD as a function of the non-hydrogen atom count and the binned effective rotor count. Thus, the default fragment sampling rate was deemed to be sufficient and the default value of 20.0 was used in the remainder of this study.
The average root-mean-square distance (RMSD), the average Shape-Tanimoto (ST), the average number of conformers per chemical structure, and the count of the 100-k limit cases for conformer models for all 25,972 3-D reference structures
Figure 2 Effects of the fragment sampling rate upon the average RMSD as a function of (a) non-hydrogen atom count and (b) effective rotor count. The MMFF94s_NoEstat and MMFF94s_Trunc force fields were used for the model building and torsion search stages, respectively, (more ...) 2.2. Effects of force field choice
OMEGA has several pre-defined molecular force fields and it is possible to choose different force fields for model building and torsion search, using the "-buildff
" and "-searchff
" options, respectively. Three force fields were tested: (1) the 94s variant of the Merck molecular force field (MMFF94s
], (2) MMFF94s without coulombic interaction terms (MMFF94s
), and (3) MMFF94s without coulombic interaction terms and without the attractive part of the van der Waals interaction terms (MMFF94s
). The latter two force-field variants attempt to remove a perceived bias towards conformers that lower their energy by intra-molecular
interactions, which are assumed to not be significant when making inter-molecular
interactions. This consideration is critical as conformer generation is performed in vacuum, which is a very different environment than a protein binding pocket. Varying the force-field terms allows this hypothesis to be tested as improved accuracy should be found if intra-molecular interactions are removed.
Table shows the effects of the force field employed upon the overall average RMSD and ST of the conformer models at increasing energy window values. The type of fragment force field used during the model building stage caused minor variations (less than 0.01) in the average RMSD of conformer models for energy windows 5 and 10 kcal/mol and these disappear entirely at 15 kcal/mol, indicating an overall insensitivity to the type of fragment force field used. Relative to the earlier hypothesis, this suggests that the fragments produced during model building phase were too small to have intra-molecular interactions. On the contrary, at the energy window of 5, 10, and 15 kcal/mol, the overall average RMSD of the conformer model generated using MMFF94s_NoEstat for the torsion search force-field was smaller by 0.21, 0.10, and 0.06, respectively, than those that used MMFF94s_Full, indicating that exclusion of electrostatic terms from the MMFF94s_Full increased the overall average accuracy of the conformer models significantly, but less so as energy window increased. However, almost no perceptible changes in RMSD or ST averages were found upon the additional removal of van der Waals attractive terms, as shown by nearly imperceptible changes in MMFF94s_NoEstat and MMFF94s_Trunc results.
Overall effects of the choice of the force field and energy window upon the average RMSD and ST values between the computationally generated conformations and the experimental conformations for the 25,972 3-D reference structures
One potential explanation for the higher accuracy of the conformer models generated using the MMFF94s_NoEstat and MMFF94s_Trunc force fields arises from their ability to provide more conformations than MMFF94s in the same energy window threshold. The MMFF94s_Full force field includes additional terms that can lower the energy of conformations with intra-molecular interactions. Removal of such force field terms can increase the energy of conformers with intra-molecular interactions, allowing conformers without these interactions to be represented in a conformer model. This explanation is consistent with the data in Table , which shows that MMFF94s_NoEstat and MMFF94s_Trunc produced significantly more conformers per compound on average than MMFF94s_Full. The increased number of conformers per molecule conceptually gives a better chance to have a conformer close to an experimentally determined structure, resulting in the smaller RMSD and greater ST values in Table . Because of this, however, the MMFF94s_NoEstat and MMFF94_Trunc were also found to more frequently result in 100-k limit cases. Considering that each 100-k limit case suggests a truncation of energetically possible conformations, their substantial frequency increase (by about a factor of five) as a function of increasing energy window may play a role in the reduction of RMSD differences between MMFF94s_Full force field and MMFF94s_NoEstat variants.
The overall average conformer count per molecule and 100-k limit count for different force field types and energy window sizes for the 25,972 3-D reference structures
Another potential explanation for the superiority of MMFF94s_NoEstat and MMFF94_Trunc over the MMFF94s_Full force field is that the lack of intra-molecular interaction is an important characteristic of a biologically relevant conformation of a small-molecule ligand found in its complex with the protein. When a small-molecule ligand binds to its target protein, the intra-molecular interaction in the ligand molecule that exists in its unbound state is likely to disappear because of a conformational change which enhances inter-molecular interaction between the molecule and the target protein. Regardless of the exact reason why, employing the MMFF94s_NoEstat for conformer model generation appears to be a more sensible choice than the MMFF94s_Full due to the accuracy improvement.
Overall, it appears that removal of electrostatic terms has a rather favorable effect on improving overall accuracy of reproduction of experimental bound ligand geometries. Removal of attractive van der Waals terms does not appear to have any significant effect. As such, the remainder of this study will only consider the MMFF94s_NoEstat force field variant.
2.3. Effects of energy window
In the model building step, the energy window, specified with the "-ewindow" flag, primarily limits the ring conformations allowed in the torsion search step. If the strain energy of a ring conformation is greater than the least-energy ring conformation plus the value of the energy window, then the ring conformation is discarded. When this "-ewindow" parameter is used in the torsion search step, the energy window determines the maximum energy range of conformers, relative to the global-minimum conformer. Figure shows the effect of energy windows on the overall average accuracy of the conformer models generated. When both the model building and torsion search energy windows were 1 kcal/mol, the overall average RMSD and ST of the conformer model generated were 0.69 and 0.885, respectively. On the other hand, the use of an energy window for both stages of 30 kcal/mol resulted in a substantially improved overall average RMSD of 0.39 and an ST of 0.945. This indicates that a larger energy window gives more accurate conformer models, consistent with the data in Table . The increased energy window allows more conformational diversity of a molecule to be considered, but also results in more conformers per molecule (and more 100-k limit cases), as shown in Table . Note in Figure that a rapid near-convergence in the overall average RMSD and ST occurs at the energy windows of 10 and 15 kcal/mol for model building and torsion search, respectively. The overall average RMSD and ST at these energy windows were 0.40 and 0.944, respectively. Employing bigger energy windows provided only small improvements to the overall accuracy of the conformer models. Therefore, when looking at overall average results, it initially appears reasonable to use an energy window of 10 kcal/mol for model building and 15 kcal/mol for torsion search without significant reduction in overall accuracy.
The overall average RMSD and Shape-Tanimoto (ST) values of the conformer models for all 25,972 structures as a function of the model building energy window and the torsion search energy window (in kcal/mol).
3. Accuracy of conformer models and 100-k limit cases
Figures and show the average RMSD and the average ST values of the conformer models, respectively, as a function of the non-hydrogen atom count and the effective rotor count for different energy window values. An increase in the non-hydrogen atom count and the effective rotor count causes a linear increase in the RMSD and a linear decrease in the ST values, indicating that the accuracy of conformer model decreases as a function of both the molecular size and flexibility.
Figure 4 Average RMSD values for all 25,972 3D reference structures as a function of non-hydrogen atom count [(a) and (b)] and effective rotor count [(c) and (d)]. In Panels (b) and (d), the 100-k limit cases, in which the number of conformers reached the maximum (more ...)
Figure 5 Average Shape-Tanimoto (ST) values for all 25,972 3D reference structures as a function of non-hydrogen atom count [(a) and (b)] and effective rotor count [(c) and (d)]. In Panels (b) and (d), the 100-k limit cases, in which the number of conformers reached (more ...)
The conformer models that reach the 100-k limit cases may exclude important conformational diversity due to the truncated description of the molecule. Indeed, as Figures and show better accuracy of reproduction when truncated conformer models are excluded. Figure shows effects of the 100-k limit cases upon the distributions of the non-hydrogen atom count and the effective rotor counts, and Figure displays effects of the 100-k limit cases upon the average number of conformers for a molecule as a function of energy window. As one can see in Figure , the 100-k limit cases begin appearing for molecules with moderate size and flexibility (e.g., with ~15 non-hydrogen atoms and ~7 effective rotors). As a molecule becomes bigger and more flexible, the OMEGA conformer generation hits the 100-k limit more frequently. Therefore, removal of these cases from the dataset of 25,972 3-D reference structures leaves only a relatively small number of "non-100-k cases" for >30 non-hydrogen atoms and >10 effective rotors, causing a greater variability in the average RMSD and ST in these regions of panels (b) and (d) of Figures and . Despite the increased variability, one can still see a slightly noticeable trend of improved accuracy in panel (d) of Figures and . One can also see in all panels of Figures and that the energy window for torsion search of 25 kcal/mol gives a clear improvement over 15 kcal/mol for larger values of non-hydrogen atoms and, to a lesser extent, effective rotors. As such, for the remainder of the study 25 kcal/mol will be used for both fragment model and torsion search.
Figure 6 The effect of the 100-k limit cases upon the distributions of (1) the non-hydrogen atom count and (2) the effective rotor counts. The MMFF94s_NoEstat force field and the energy window of 5 kcal/mol were used for both the model building and torsion search (more ...)
Figure 7 The average number of conformers for a molecule as a function of the effective rotor count. The first two numbers in the legend box correspond to the energy-window sizes used in the model building (first) and torsion search (second). While the solid data (more ...)
4. Prediction of RMSD accuracy of the conformer ensemble
So far, we have found ways to get the best average RMSD and ST accuracy using OMEGA as the conformer generator. Now we are faced with the problem of data reduction, as it is not practical to store or use all conformers produced when considering millions of chemical structures. Naturally, one would like to maximally reduce the conformer count without significantly sacrificing accuracy, e.g., by a minimum RMSD separation between conformers. The lower the separation RMSD used, the greater the count of conformers that must be kept. Conversely, too large of an RMSD separation may reduce the ensemble accuracy. In an ideal world, one would know a priori precisely which conformers are needed and discard the rest. In the real world, this is not possible to know; however, if one can reliably predict the RMSD accuracy of a conformer model, then one can devise an RMSD separation value that could be used as a sampling threshold with some statistical assurances that the sampled conformer model should have at least one conformer within the sampling RMSD some significantly large percentage of the time.
With the aim to find a way to derive an RMSD sampling threshold for a given chemical structure, Figures and show that the average RMSD accuracy of the theoretical conformer ensembles has a very linear correlation with both the non-hydrogen atom count and the effective rotor count. Therefore, a linear regression analysis was performed to derive an equation that predicts the RMSD accuracy of the conformer models using just the NNHA and the NER, yielding Eq. (4).
where RMSDpred is the predicted RMSD of a theoretical conformer ensemble for a molecule, and NNHA and NER are its non-hydrogen atom and effective rotor counts, respectively. The R2 value of the regression equation for all 25,972 chemical structures was 0.65 and the standard deviation of RMSDpred was 0.19. The correlation between the RMSDpred and the actual RMSD (RMSDactual) is shown in Figure .
Figure 8 Comparison of the predicted and actual RMSD of the theoretical conformer models for the 25,972 experimentally determined structures. While the predicted RMSDs in panel (a) were computed using a single equation [Eq. (4)] for all 25,972 structures, those (more ...)
By design, Eq. (4) overestimates the RMSD value for half of the experimental structures and underestimates the other half. We consider it acceptable to use an RMSD sampling value where 90% of conformer ensembles have the same or lesser RMSD accuracy value on average. This is simply Eq. (4) to which we add the first standard deviation value of 0.19 to yield Eq. (5).
To highlight this, we plotted in Figure the distribution of the difference between RMSDactual and RMSDpred using Eq. (5), binned in 0.1 increments. Figure shows that Eq. (5) yields an RMSDpred that is greater than or equal to RMSDactual more than 90% of the time.
Figure 9 Frequency distribution of the RMSD differences between the actual RMSD and the predicted RMSD values for the conformer models of the 25,972 experimental structures, binned in 0.1 increments using Eq. (5) for panel (a) and Eqs. (8) and (9) for panel (b). (more ...)
As shown in Figures and , the average RMSD increases and the average ST decreases as a function of both NER and NNHA, regardless of the inclusion of 100-K limit cases. Given how the variability increases as chemical structures become larger and more flexible, potentially due, in part, to their lower populations, it may be helpful to partition data into separate groups according to their NER and NNHA values and perform separate regression analyses. A regression analysis of the 22,587 structures with NER≤10 and NNHA≤35 yields Eq. (6), while a regression analysis of the 3,385 structures with NER>10 or NNHA>35 yields Eq. (7):
The R2 values of the regression formula Eqs. (6) and (7) were 0.52 and 0.33, respectively, and the standard deviation of RMSDpred was 0.17 and 0.29 for Eqs. (6) and (7), respectively. (Attempts to partition NER and NNHA values using different partitioning schemes had similar R2 results.) In the same manner as used to derive Eq. (5), one can add these first standard deviations to Eqs. (6) and (7) to get RMSD sampling formulas Eqs. (8) and (9), respectively, for the two individual groups. As shown in Figure , the RMSDpred values from Eqs. (6) and (7) are comparable with those from Eq. (4), despite the poor R2 values for Eqs. (6) and (7). As shown in Figure , where the RMSD sampling values from Eq. (5) are compared with those from Eqs. (8) and (9), the frequency distributions of the difference between RMSDpred and RMSDactual are similar to each other. The RMSDpred value is greater than or equal to RMSDactual for ~91.0% of the time when Eq. (5) is used, and for ~91.6% of the time when Eqs. (8) and (9) are used.
In the recent study of Hawkins et al.
], the quality of theoretical conformations from OMEGA was evaluated by comparing a set of 197 high-quality PDB ligand structures with corresponding OMEGA-generated theoretical conformers. They used the MMFF94_Trunc force field to generate conformers and then reduced their count to a maximum of 200 by sampling. The mean RMSD between the 197 ligand set and their corresponding theoretical conformers was 0.67 Å. According to their bootstrapping tests, the OMEGA-generated conformers are expected 90% of the time to have an RMSD value between 0.647-0.688 Å for chemical structures with similar properties to those tested. The average rotatable bond count and non-hydrogen atom count of this 197 ligand set were 6.3 and 24.4, respectively, indicating that they are, on average, slightly bigger and flexible than the 25,972 set used in our study (4.9 rotatable bonds and 17.1 non-hydrogen atoms on average). With these average counts as the NNHA
values in Eq. (5), RMSDpred
is predicted to be 0.71 Å, which is comparable to 0.67 Å, the mean RMSD between the 197 ligand set and the corresponding theoretical conformers. Considering that the OMEGA parameters used in both studies were not exactly identical and that we used their reported rotatable bond count rather than effective rotor count, this may suggest that Eq (5) may have applicability beyond the parameter sets used in this study.