|Home | About | Journals | Submit | Contact Us | Français|
PubChem is a free and publicly available resource containing substance descriptions and their associated biological activity information. PubChem3D is an extension to PubChem containing computationally-derived three-dimensional (3-D) structures of small molecules. All the tools and services that are a part of PubChem3D rely upon the quality of the 3-D conformer models. Construction of the conformer models currently available in PubChem3D involves a clustering stage to sample the conformational space spanned by the molecule. While this stage allows one to downsize the conformer models to more manageable size, it may result in a loss of the ability to reproduce experimentally determined “bioactive” conformations, for example, found for PDB ligands. This study examines the extent of this accuracy loss and considers its effect on the 3-D similarity analysis of molecules.
The conformer models consisting of up to 100,000 conformers per compound were generated for 47,123 small molecules whose structures were experimentally determined, and the conformers in each conformer model were clustered to reduce the size of the conformer model to a maximum of 500 conformers per molecule. The accuracy of the conformer models before and after clustering was evaluated using five different measures: root-mean-square distance (RMSD), shape-optimized shape-Tanimoto (STST-opt) and combo-Tanimoto (ComboTST-opt), and color-optimized color-Tanimoto (CTCT-opt) and combo-Tanimoto (ComboTCT-opt). On average, the effect of clustering decreased the conformer model accuracy, increasing the conformer ensemble’s RMSD to the bioactive conformer (by 0.18 ± 0.12Å), and decreasing the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores (by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively).
This study shows the RMSD accuracy performance of the PubChem3D conformer models is operating as designed. In addition, the effect of PubChem3D sampling on 3-D similarity measures shows that there is a linear degradation of average accuracy with respect to molecular size and flexibility. Generally speaking, one can likely expect the worst-case minimum accuracy of 90% or more of the PubChem3D ensembles to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected accuracy improves linearly as the molecule becomes smaller or less flexible.
The advent of combinatorial chemistry and high-throughput screening technology has made it possible to perform a rapid test of biological activity on a vast number of small molecules, generating a massive amount of biological activity data. While this explosion of information presents scientists with great opportunities to facilitate the identification of potential drug candidates and chemical probes, its benefit is enhanced when this data is combined with that of the others and made available to all. Dissemination of such information requires a public repository that collects and stores the heterogeneous data from various contributors. An example of such a repository is PubChem [1-4] (http://pubchem.ncbi.nlm.nih.gov), launched in 2004 as a component of the Molecular Libraries Roadmap Initiatives of the U.S. National Institutes of Health. PubChem archives biological activity screening data and other information from diverse data sources and offers its contents free of charge to the biomedical research community, facilitating research that benefits human health.
PubChem consists of three primary databases: Substance, Compound, and BioAssay. The PubChem Substance database contains sample descriptions provided by individual depositors and the PubChem Compound database contains the unique standardized chemical structure contents extracted from the PubChem Substance database. The PubChem BioAssay database contains biological assay descriptions and results. As of June 2012, PubChem contains more than 92 million substance descriptions, 32 million unique small molecules, 600 thousand biological assays, and 170 million biological assay outcomes (each outcome is a set of results from a substance being tested in an assay). PubChem provides search, analysis, and download tools for the efficient use of this vast amount of chemical information. Many of these tools exploit the concept of molecular similarity at some level. One method in which PubChem evaluates chemical similarity between two molecules is to use a two-dimensional (2-D) dictionary-based fingerprint  and the Tanimoto equation [6,7]:
where A and B are the respective counts of the set binary fingerprint bits for the two molecules and AB is the count of set bits in common to both molecules. Because the 2-D molecular similarity computation is very fast (typically at a rate of one million pair-wise comparisons per second per CPU core), it is appropriate for searching a large database like PubChem. However, there are many diverse chemical structures with similar biological efficacies against targets available in PubChem that can be difficult to interrelate using traditional 2-D similarity methods [8-11]. To assist in biological activity analysis of these molecules, a new layer called PubChem3D [8-15] was added to PubChem.
PubChem3D generates a 3-D conformer model description for each record in the PubChem Compound database, when it satisfies the following conditions : (1) not too large (with 50 or fewer non-hydrogen atoms); (2) not too flexible (with no more than 15 rotatable bonds); (3) has only a single covalent unit (i.e., not a salt or mixture); (4) consists of only supported elements (H, C, N, O, F, Si, P, S, Cl, Br, and I); (5) contains only atom-types recognized by the Merck Molecular Force Field (MMFF94s) [16,17]; and (6) five or fewer undefined atom (R,S) and bond (E,Z) stereo centers. This 3-D description can be employed to enhance existing PubChem search and analysis methodologies by means of 3-D similarity , helping the user identify useful structure-activity relationships that might go unrecognized by the PubChem 2-D similarity method. A diverse conformer ordering  gives a maximal description of the conformational space of a molecule when only a subset of available conformers is used. A pre-computed search per compound record gives immediate access to a set of 3-D similar compounds (called “Similar Conformers” ) in PubChem and their respective superpositions, augmenting the complementary “Similar Compounds” relationship, computed using the PubChem 2-D similarity method. Systematic augmentation of PubChem resources to include a 3-D layer provides users with new capabilities to search, subset, visualize, analyze, and download data .
All the tools and services in PubChem3D rely upon the quality and applicability of the computationally-derived 3-D conformer models of small molecules. Considering the size of PubChem, all these conformer models by necessity must be pre-computed and stored to allow the user “real-time” access to identify structurally similar conformers and to analyze biological activity patterns. Among many different conformer generation programs that exist [18-24], PubChem3D uses the OMEGA C++ toolkit [25-28] to generate conformer ensembles. In our previous study , an optimal set of adjustable parameters were determined to maximize the “accuracy” of OMEGA (i.e., the ability to reproduce experimentally-determined “bioactive” conformations, for example, found in protein-ligand complexes). Using experimentally determined structures of 25,972 small-molecule ligands found in the Protein Databank (PDB) , the effects of parameter values used in OMEGA upon the root-mean-square distance (RMSD) between the computationally-derived conformer models and their experimentally-determined bioactive conformations were analyzed in terms of the non-hydrogen (heavy) atom count (NNHA) and effective rotor count (NER), as measures of molecular size and flexibility, respectively . Note that NER is given by the following equation and takes into account molecular flexibility due to rotatable bonds and ring flexibility:
where NR is the number of rotatable bonds, and NNARA is the number of non-aromatic sp3-hybridized ring atoms. The root-mean-square distance (RMSD) accuracy of the computationally-derived conformer models was found to strongly depend on molecular size and flexibility, leading to the following formula  that estimates the worst-case RMSD accuracy of nearly all the conformer models using only the values of NNHA and NER:
where RMSDpred is the predicted upper limit of the RMSD accuracy to ensure at least 90% of conformer models generated by OMEGA using the selected PubChem parameter set for the 25,972 PDB ligands had at least one “bioactive” conformer whose RMSD distance from the experimentally determined conformation was closer than the value predicted using Equation (3).
Using this accuracy-calibrated OMEGA parameter set, PubChem3D generates up to 100,000 conformers for each chemical structure stereo configuration. However, it is still not feasible to store all the conformers in a database and use them in a very efficient way. Therefore, the conformers in each ensemble are sampled through clustering with the RMSDpred value, after rounding to the nearest 0.2 increment as defined in the following equation,
where “int( )” gives the whole number, irrespective of any remaining fraction and where this RMSD threshold is referred to as RMSDcluster to emphasize its usage for clustering purposes (rather than accuracy prediction). Each sampled conformer represents a cluster containing all conformers within the designated RMSD threshold, thus reducing the count of conformers per conformer model. If the conformer model after cluster sampling has more than a maximum of 500 conformers, it is re-clustered using an RMSDcluster value incremented by a further 0.2. This process is repeated as many times as necessary to reduce the overall conformer count to 500 or less. Although this clustering process makes the conformer models more manageable in size and better suited for a large database such as PubChem, it may be accompanied with an undesirable loss of overall accuracy of the conformer model. Therefore, in the present study, we investigated the effect of the conformer model clustering upon the accuracy of the conformer models as a follow-up to our previous study  in order to address key questions as to the performance of PubChem 3D sampled conformers to reproduce “bioactive” ligand geometries: as a function of molecular size and flexibility, with respect to the established PubChem3D similarity measures, and with an eye towards their expected performance relative to biological activity data analysis.
This study considers 47,123 small molecules with experimental 3-D coordinates available from the Molecular Modeling Database (MMDB)  deposition in PubChem (Additional file 1). The molecular connectivity of these MMDB ligands is derived from the 3-D coordinates of the protein-bound small molecules taken from PDB  records. Note that the “experimental” structures of small molecules in PDB are known to, at times, have non-trivial issues or uncertainty concerning their precise chemical identity, protein binding geometry, or crystal structure location [13,32-36]. The present study largely ignores such potential issues and considers all the 3-D ligand structures as experimental facts.
The effects of conformer ensemble clustering upon the accuracy of the conformer ensemble were analyzed as a function of NNHA and NR (as measures of molecular size and flexibility, respectively). Additionally, NER [Equation (2)] was also employed to represent molecular flexibility. Although the value of NER is not necessarily an integer, it was rounded to the nearest integer in the present study. Figure 1 shows the distributions of the values of NNHA, NR, and NER for the experimental structures for the 47,123 MMDB ligands considered. On average, the structures had 18.3 ± 10.5 non-hydrogen (heavy) atoms, 4.7 ± 3.5 rotatable bonds, and 5.4 ± 3.7 effective rotors. Approximately 90% of the ligands had less than 31 non-hydrogen atoms, 9 rotatable bonds, and 10 effective rotors.
PubChem3D generates a maximum of 100,000 conformers per compound stereo configuration for efficiency considerations. Reaching this “100-K” limit suggests a loss in conformational space considered, possibly resulting in less accurate conformer models . As shown in Figure 2, the fraction of the molecules hitting the 100-K limit increases rapidly as the molecule becomes larger and more flexible. This suggests that, beyond 25 heavy atoms and six rotatable bonds, exploration of conformational space in some PubChem3D conformer models may be truncated due to this limitation.
After generating conformers for each molecule, PubChem3D samples the conformers in each ensemble, using an RMSD threshold determined according to Equation (3) and Equation (4). Figure 3 shows the distribution of the RMSDcluster values used to cluster the conformers in the conformer ensemble for the 47,123 MMDB ligands considered in the present study. The RMSDcluster values range from 0.4Å to 2.2Å (in discrete 0.2Å increments), with an average and standard deviation of 0.75Å ± 0.33Å. Approximately 85% of the 47,123 PDB ligands have an RMSDcluster value of ≤ 1.0Å.
The distributions of the RMSD “accuracy” of the resul-ting conformer ensembles to the experimental PDB ligand geometries are shown in Figure 4 and their average and standard deviations are summarized in Table 1. It is important to note that, for the purpose of this study, the RMSD “accuracy” value of a computationally-derived conformer ensemble to the experimental “bioactive” conformation is defined as the single best (i.e., least) non-hydrogen atom-pairwise RMSD between the experimentally determined 3-D conformation PDB ligand and a 3-D conformer in the ensemble. This RMSD “accuracy” should not be confused with the clustering RMSD (RMSDcluster).
As expected, the PubChem3D conformer sampling procedure results in a loss of the conformer ensemble RMSD accuracy relative to experiment. On average, this overall loss is 0.18Å (from 0.39Å to 0.57Å). The standard deviation of this average also increases by 0.12Å (from 0.24Å to 0.36Å) and may reflect the rounding of RMSDcluster to the nearest 0.2 increment, potentially suggesting the ± 0.1 nature of such a change. In the aggregate, 90% of all conformer models in this study reflect RMSD accuracies better than 1.1Å after clustering.
In the study by Hawkins et al., an RMSD value of 1.25Å or less was employed as the definition of a “close” reproduction of the experimental conformation. They also pointed out that an RMSD of 2.0Å could have been used as a cut-off because it is a common upper bound for successful reproduction of an experimental structure in molecular docking. With these criteria in mind, the after-clustering conformer models in the present study may be considered to be of high quality, although the choice of the cut-off for “close” reproduction of the experimental structure is still arbitrary and subjective.
Figure 5 illustrates the percentage of the conformer models with accuracy better than the RMSDcluster value (i.e., with an RMSD accuracy value less than RMSDcluster) before and after the conformer sampling procedure. For comparison purposes, those conformer models with the accuracy better than RMSDcluster+0.1Å are also included to show the effects of the rounding of RMSDpred to the nearest 0.2 (i.e., RMSDcluster). It is important for two reasons to note that more than 90% of the conformer models before clustering have an RMSD accuracy better than RMSDcluster. Firstly, this shows that, although the majority of conformer models with more than 25 heavy atoms and six rotatable bonds hit the 100-K limit in conformer generation as shown in Figure 2, there is no significant adverse effect on the conformer model accuracy beyond that already reflected by Equation (4). Secondly, this indicates the resilience of OMEGA to generate biologically relevant conformers, favoring a breadth-first exploration of conformer space as a function of energy threshold (i.e., considering low-energy conformational spaces first), thus ensuring an even coverage of conformer space up to the PubChem3D 100-K conformer limit in the conformer generation phase. For the remaining cases where the conformer models before clustering are not more accurate than the RMSDcluster threshold, OMEGA does not even come close to reproducing the experimental geometry using the PubChem3D choice of parameters, considering that an increase of 0.1Å from RMSDcluster does not find many additional “missed” pre-clustered conformer mo-dels, as Figure 5 shows. Whereas potential reasons for this are numerous, one can likely attribute it to: improper perception of atom hybridization or charge state from PDB atom coordinates in the MMDB deposition (leading to an inaccurate chemical structure, as PDB records tend not to include hydrogen atom or bond order information); some combination of errors, uncertainty, or omissions in the PDB ligand information (as mentioned previously); or a general inability of OMEGA to reproduce some “bioactive” 3-D chemical structure configurations.
As shown in Figure 5, after clustering, the fraction of the conformer models with accuracy better than RMSDcluster decreased by no more than 12% in general, except for RMSDcluster=1.6Å (34%). When the conformer model accuracy was predicted in a more conservative way using the limit RMSDcluster + 0.1Å (rather than RMSDcluster), the difference between the conformer models with accuracy better than this limit before and after clustering was no more than 6%, except for RMSDcluster=1.6Å (23%), sho-wing that the realized sampling effects are local in nature. It appears that most of the structures with decreased conformer model accuracy at RMSDcluster=1.6Å are simply due to an unfortunate culmination of pronounced partition-based clustering edge-effects for a set of flexible di- and tri-phosphate containing structures. In other words, for these particular computationally-derived conformer models, a conformation most similar to the experimental structures happened to be near the boundaries of the clusters generated with the RMSDcluster=1.6Å, and therefore, they were not included in the conformer models after the clustering procedure. As a result, clustering a given conformer model using a different conformer ordering with the same clustering procedure could have yielded results closer to the pre-clustering result.
Figure 6 illustrates the cumulative % distribution of the RMSD accuracy of the conformer models for each discrete value of RMSDcluster. As mentioned above, the RMSDcluster value determination using Equation (3) was intended to ensure that 90% of conformer models have an RMSD accuracy below RMSDclusterbefore sampling; however, the RMSD accuracy after sampling to ensure 90% of conformers are found is expected to be within the range of RMSDcluster ± 0.1, when considering the effects of the rounding of RMSDcluster to the nearest increment of 0.2 [as performed in Equation (4)]. If one looks across the 90% line in panel (a) of Figure 6, the RMSD accuracies of 90% of the conformer models before clustering are smaller than RMSDcluster for the entire range in general, with almost no difference at RMSDcluster=1.4Å. The 90% levels of the after-clustering RMSD accuracies [in panel (b) of Figure 6], are within the expected range in general, except for RMSDcluster=1.4Å and 1.6Å, where the RMSD accuracy for 90% of the conformer models is not reached until 1.6Å and 1.8Å, respectively.
One readily notices for each RMSDcluster value in Figure 6 that conformer model clustering shifts the cumulative % distribution curves toward the right-hand side, indicating a decrease in the conformer model accuracy as a result of the PubChem sampling procedure. Looking at the 90% level of conformer models before and after clustering, there are some variances in the change of the conformer model accuracy, depending on the RMSDcluster value. For example, the difference between the RMSD accuracies at the 90% level before and after clustering with RMSDcluster of 0.4Å and 0.6Å is 0.1Å (0.25Å vs. 0.35Å for RMSDcluster=0.4Å, and 0.5Å vs. 0.6Å for RMSDcluster=0.6Å). However, for the RMSDcluster values between 0.6Å and 1.6Å, the corresponding differences range between 0.2Å and 0.3Å. In general, it is very reassuring to see that most of these conformer models at the 90% level are within the expected range for most RMSDcluster values. Although sampling by its very nature will increase the distance between conformers, this increase does not appear to severely impact the accuracy of the conformer models in PubChem.
Evaluation of the conformer model accuracy using RMSD is an intuitive and convenient choice, as the conformer model clustering in PubChem3D uses an RMSD value as a clustering threshold; however, in practice, PubChem3D primarily uses three measures in 3-D similarity comparison between molecules: shape-Tanimoto (ST), color-Tanimoto (CT), and combo-Tanimoto (ComboT) [37-40]. Therefore, the present study also employed PubChem3D similarity measures as additional conformer model accuracy measures. The ST [37-40] similarity measure, which quantifies the shape similarity between molecules, is defined as the following equation:
where VAA and VBB are respective self-overlap volume of the two molecules, and VAB is the overlap volume between the two molecules. The CT [37,38] similarity measure, on the other hand, evaluates the pharmacophore feature similarity between molecules, by comparing the 3-D orientation of fictitious atoms (also called feature atoms) representing six functional group types (hydrogen-bond donors, hydrogen-bond acceptors, cations, anions, hydrophobes, and rings) by means of the equation:
where the index “f” is one of the six functional-group types, VAAf and VBBf are the self-overlap volume for the functional group type “f” of the two molecules, respectively, and VABf is the overlap volume for the functional group type “f” between the molecules. The ComboT [37,38] similarity measure, which is defined as the arithmetic sum of the ST and CT scores, allows one to consider the two different similarities simultaneously. Because both the ST and CT scores range from 0 (for no similarity) to 1 (for identical molecules), the ComboT score ranges from 0 to 2 (without normalization, due to pre-existing convention).
The present study used two different approaches to compute these three 3-D similarity scores: the shape-optimized (or ST-optimized) approach and feature-optimized (or CT-optimized) approach. In the shape-optimized approach, the superposition of two molecules is optimized to have a maximum ST score and then the CT score is computed in that shape-optimized alignment. In the feature-optimized approach, the color and shape of the two conformers will be considered simultaneously to find the best superposition between them, as in the current version of ROCS . In the present paper, the shape-optimized and feature-optimized methods are denoted using the superscripts “ST-opt” and “CT-opt”, respectively. As a result, there are six different 3-D similarity scores (i.e., STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt). Along with RMSD, four of these six scores are used to analyze the accuracy of the clustered conformer models relative to the experimentally determined 3-D geometries: STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt.
As shown so far in this study, the conformer sampling procedure decreases ensemble accuracy to reproduce experimentally determined ligand geometries, resulting in an increase in the RMSD values. This loss in accuracy is also seen for the four 3-D similarity values as shown in Table 1. On average, whereas the clustering increases the RMSD value of the conformer ensemble by 0.18 ± 0.12Å, it decreases the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt scores by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively. Although the ComboTCT-opt values are (in the aggregate) slightly greater than the ComboTST-opt values, the decreases of the two similarity measures upon clustering are nearly identical in magnitude to each other, suggesting a general insensitiveness of the ComboT scores to the optimization type. Perhaps more interesting is the relatively small change in the STST-opt average of 0.04, whereas the CTCT-opt average difference is more than twice as large, indicating a much greater sensitivity of CTCT-opt to clustering. This is not surprising, as shape is less discriminating than features (e.g., with a nitrogen atom and carbon atom being nearly identical from a shape perspective but completely different in their capability to make intermolecular interactions). As shown in Figure 7, after the conformer sampling procedure, 90% of all the conformer models had accuracies better than 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively.
Figures 8, ,99, ,1010, ,11,11, and and1212 show the five different measures of the conformer ensemble accuracy used (i.e., RMSD, STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt) as a function of molecular size and flexibility. [A further breakdown of RMSD and STST-opt values as a function of molecular size and flexibility and the correlation between RMSD and STST-opt can be found in Additional file 2: Figures S1-S7]. The linear nature of these curves demonstrates a clear association of the average PubChem3D conformer model accuracy with molecular size and flexibility. Least-squares fitting to the form of “y=a+bx” for each data series in the plots from panel (d) in Figures 8, ,99, ,1010, ,11,11, and and1212 is summarized in Table 2. The least-squares fitting was also performed for the other data series in panels (a-c) of Figures 8, ,99, ,1010, ,11,11, and and12,12, but reported in Additional file 3 for brevity. As the RMSDcluster value (as well as NNHA, NR and NER) increases, all five conformer model accuracies before and after clustering linearly changed. With the notable exception of NR, all R2 values for these fits were greater than 0.90. In the case of NR, not taking into account the flexibility of rings reduces the R2 values to as low as 0.78. (In fact, the primary motivation of the development of NER was to account for “noise” in linear fits of NR such as these, to properly account for molecules that are effectively more flexible than their rotatable bond count would suggest.) While all the RMSD and STST-opt average accuracy measures did linearly correlate with the RMSDcluster value (Figures 8 and and9).9). However, for the ComboTST-opt, CTCT-opt, and ComboTCT-opt, the diffe-rence between before and after clustering accuracy values did not always linearly correlate with the RMSDcluster value [namely: Figure 10, panels (a,d); Figure 11 panels (a-d); and Figure 12 panels (a,d)]. In the case of CTCT-opt, the average differences appear to plateau just below 0.2, suggesting that there may be some maximum error as a result of the PubChem conformer clustering procedure. Echoes of this appear to be present in the difference statistics for any 3-D similarity measure that involves the CT measure (i.e., ComboTST-opt, CTCT-opt, and ComboTCT-opt).
Taking this all into account, in general, conformer model clustering increases the conformer RMSD value and decreases the four 3-D Tanimoto values, indicating the reduced accuracy of the conformer ensemble due to conformer clustering. The difference between the ensemble accuracies before and after clustering increases with the values of NNHA, NR, and NER, implying that the effects of conformer clustering become more noticeable in bigger and more flexible molecules, which is expected considering that the RMSDcluster value gets larger [Equation (3)]. As compared in Figures 9 and and11,11, the average conformer CTCT-opt accuracy values show a larger decrease upon clustering than the average conformer STST-opt values, meaning that the conformer CTCT-opt values are more sensitive to clustering than the conformer STST-opt values. However, conformer clustering decreases the average ComboTST-opt and average ComboTCT-opt values (Figures 10 and and12)12) in a similar amount, again showing the insensitiveness of the ComboT value to the optimization type. A similar insensitiveness of the ComboT value to the optimization type was also observed in our previous studies [9,11], in which the distribution of the ComboTST-opt scores between randomly selected conformers were found very similar to that of the ComboTCT-opt.
What does this all mean? The average loss of accuracy of PubChem3D conformer ensembles behaves in a predictable fashion, even after sampling, as a function of molecular size and flexibility across PubChem3D similarity measures. There is a linear degradation of accuracy to reproduce bioactive conformers both before and after sampling procedures. In general, there is a modest amount of degradation of accuracy to reproduce bioactivity as a part of this sampling procedure. Generally speaking, one expects the worst-case minimum accuracy of 90% of the PubChem3D ensembles to be (as stated previously from Figure 7) 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. This expected minimum accuracy improves linearly as the molecule becomes smaller or less flexible.
One may ask “how good or how bad are these worst-case minimum accuracies?” To answer this question, it is necessary to determine an appropriate cut-off value for a “close” reproduction of the experimental structure, and our recent study , which studied the statistical significance of the ROCS-based similarity scores, provides some clues on an appropriate choice of the cut-off values. In this study , the ROCS-based 3-D similarity scores between randomly-selected biologically-tested compounds were computed, and from the distribution of these scores, conversion tables were generated which convert a ROCS-based similarity score to the p-value of getting that particular score by randomly selecting two biologically-tested conformers. According to these conversion tables, the p-value of getting a similarity score equal to the worse-case minimum accuracy by selecting two random conformers is 0.019, 0.002, 0.003, and 0.002 for STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively. If the significance level (α) of 0.05 is employed, these p-values are small enough to reject the null hypothesis of getting a particular 3-D similarity score by chance. Although this interpretation also depends on the significance level one may choose, it is still true that these worst-case minimum accuracies of the conformer models (0.75, 1.09, 0.43, and 1.13, for STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively) are much greater than one may expect from randomly selected conformer pairs (0.54 ± 0.10, 0.62 ± 0.13, 0.18 ± 0.06, and 0.59 ± 0.14, for STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively) [9,11], implying structural similarity between the conformer model and the experimental structure. Also note that this interpretation is consistent with the fact that the 90% of the conformer models considered in this study have RMSD accuracies better than 1.1Å, which is much tighter than the common upper bound (RMSD 2.0Å) for successful reproduction of an experimental conformation in molecular docking, as mentioned above.
When it comes to biological activity data analysis, the present study shows that there will be a definitive upper limit to the PubChem3D conformer ensemble accuracy based on the molecular size and flexibility. While the results of the present study consider all sampled conformers, PubChem3D search and analysis tools use a diverse subset of sampled conformers, where the diverse subset selection criterion is the ComboTST-opt dissimilarity. [The reason for using the ComboT dissimilarity is that it considers both the ST and CT dissimilarity simultaneously. While the choice of the optimization type is somewhat arbitrary, our previous studies [9,11] have shown that the ComboT score is not very sensitive to the optimization type in the aggregate.] The effects of using a diverse set of sampled conformers will likely further decrease performance beyond that reported in this study. In addition, one can expect that, as the desired 3-D Tanimoto threshold increases in a given biological activity analysis, the ability to interrelate larger and more flexible molecules will decrease, not because they necessarily lack common biologically accessible conformer space, but because of the inherent similarity distance between the stored sampled conformers. This analysis also suggests that the use of a single “one-size-fits-all” similarity Tanimoto threshold for PubChem3D molecules may not be an ideal choice for conformer models sampled at different RMSD values. The results from this study suggest that conformer sampling may exacerbate the molecular size/flexibility dependency already present in conformer generation software . Smaller and less flexible molecules in PubChem3D will have a tighter conformational sampling (with a smaller spacing between conformers) than larger and more flexible molecules, and therefore, can interrelate more molecules at a given Tanimoto value. In addition, smaller and less flexible molecules will have fewer sampled conformers in their respective conformer ensemble and will likely have less of a reduction in accuracy due to the use of a diverse subset. As a result, a search using a smaller and less flexible molecule as a query is likely to return more 3-D similar molecules than a search using a larger and more flexible query molecule. Furthermore, even if a large or flexible molecule is used as a 3-D similarity query, an increasing proportion of returned results are likely to be smaller or less flexible as the Tanimoto value is increased. This potential bias towards conformer models with smaller sampling distances may be worth further consideration and study to develop a more reliable 3-D similarity-based biological activity analysis method.
Like any experimentally-derived measurements, the crystal structures stored in PDB have uncertainties in their atomic coordinates, and the interpretation of the accuracy of a computationally-derived conformer model should take into account the positional uncertainty of the corresponding experimental ligand structure. For example, if the positional uncertainty in the experimental structure is greater than the RMSD value of the conformer model, comparison between the experimental and theoretical ligand structures are not particularly meaningful. The average positional errors in atoms in a crystal structure can be evaluated with the diffraction-component precision index (DPI) [41,42], which can be approximated as proposed by Blow , using information commonly contained in the header of a PDB file. In the study by Hawkins et al., the crystal structures with the DPI of<0.42Å were considered to be precise enough for the use as a standard dataset for validation of conformer generators, and in this way, the conformer models with the RMSD value of>0.6Å (= √2 × DPI ) were guaranteed to be meaningful predictions.
Although the present study did not focus on potential issues concerning the experimental uncertainties of PDB structures [13,32-36], it is still valuable to test the conformer model accuracy against a set of highly-reliable experimental structures. Therefore, a set of 157 high-quality ligand molecules (Additional file 4) was constructed from a set of 197 PDB structures recommended in the study by Hawkins et al.  (see the Methods section). The distribution of the RMSD and ComboTCT-opt accuracies for these 157 structures are shown in Figure 13, and their average and median values are compared in Table 3, with those from the 197 molecules considered in the study by Hawkins et al.  [The ComboTCT-opt accuracy is identical to the Tanimoto Combo in their study]. As shown in Table 3, when going from the 47,123-ligand set to 157-ligand set, the average RMSD value of the conformer models increased by 0.08Å (from 0.57Å to 0.65Å) and the average ComboTCT-opt accuracy decreased by 0.06 (from 1.61 to 1.55). These differences do not seem very meaningful, considering the standard deviations for the RMSD and ComboTCT-opt accuracies of the two sets.
Note that the average RMSD value of the 157-ligand set differed only by 0.02 from that of the 197-ligand set from the study of Hawkins et al. (0.65Å vs. 0.67Å) . The difference in the ComboTCT-opt accuracy between the two sets were 0.01 (1.55 and 1.56 for the 157- and 197-ligand sets, respectively). Considering that our study used OMEGA parameters different from those used in their study, the conformer model accuracies from the two studies do not seem very different.
In the present study, conformer ensembles for 47,123 PDB ligand molecules from MMDB were computationally generated using the PubChem3D approach. The accuracy of reproduction of the conformer models was investigated in comparison to the experimentally-derived structures as a function of the RMSD and the PubChem3D similarity scores (i.e., STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt). The PubChem3D conformer sampling procedure increased the RMSD value of the conformer ensemble by 0.18 ± 0.12Å, and decreased the accuracy of the STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt accuracies by 0.04 ± 0.03, 0.16 ± 0.09, 0.09 ± 0.05, and 0.15 ± 0.09, respectively (see Table 1), indicating a decrease in the conformer ensemble accuracy in general. For all five accuracy measures (RMSD, STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt), the conformer model accuracies before and after clustering linearly decreased with the increase in the RMSDcluster value (as well as NNHA, NR and NER), with R2 values to fit these curves greater than 0.91 (see Figures 8, ,99, ,1010, ,1111 and and1212 and Table 2).
Whereas the change in the CTCT-opt accuracy (0.09 ± 0.05) upon clustering was much greater than the STST-opt average difference (0.04 ± 0.03), the ComboTST-opt and ComboTCT-opt changes had similar average and standard deviations (0.16 ± 0.09 vs. 0.15 ± 0.09). This implies that, in general, while the CTCT-opt accuracy is more sensitive to the clustering than the STST-opt accuracy, the effect of the clustering upon the ComboT accuracy is not sensitive to the optimization type. Similarly, while the rate of the decrease of the STST-opt accuracy with the increase in molecular size and flexibility was much slower than that of the CTCT-opt accuracy (Figure 9vs. Figure 11), the ComboTST-opt and ComboTCT-opt accuracies decreased at a similar rate (Figure 10 vs. Figure 12).
This study shows that there is a definitive limit in the ability of the PubChem3D sampled conformer models to reproduce the bioactive conformations found in PDB ligands. This study also suggests that larger and more flexible molecules may be less able to interrelate with other larger and more flexible molecules at a given Tanimoto value than smaller and less flexible molecules do. [This is also supported by our recent study  on the PubChem 3-D neighbors. The PubChem 3-D neighbors (also known as “similar conformers”) are defined as any two compounds that are structurally similar (with STST-opt ≥ 0.8 and CTST-opt ≥ 0.5), and it was found that compounds without 3-D neighbors occur more frequently among larger compounds than among smaller compounds. In addition, smaller molecules tend to have more 3-D neighbors than larger molecules]. As a result, one may want to consider such effects when performing a 3-D similarity search or 3-D biological activity data analysis. Specifically in the case of 90% of the PubChem3D conformer models, in general, one can expect the worst-case minimum accuracy to be 0.75, 1.09, 0.43, and 1.13, in terms of STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt, respectively (see Figure 7). These values are expected to linearly improve as the molecules considered become smaller and less flexible. In addition, these values may become worse if a diverse subset of sampled conformers is used.
The experimental 3-D structures of small molecules were downloaded from the Molecular Modeling Database (MMDB) ligand dataset [45,46] as available from the PubChem Substance database at the National Center for Biotechnology Information (NCBI) (as of July 1, 2010). Ligands too small or too big were discarded by limiting the non-hydrogen atom count to 6 – 50. Ligands too flexible (with an effective rotor count greater than 15) were also eliminated. This filtering stage resulted in a set of 47,123 3-D non-unique, organic (i.e., carbon containing) 3-D experimental reference structures, where a 3-D conformer model could be generated. The distributions of molecular size and flexibility of the dataset are depicted in Figure 1.
To test effects of the experimental uncertainties upon the evaluation of the conformer model accuracies, a subset of the 47,123-ligand set, which contains 157 “high-quality” ligand structures, was constructed in the procedures described below.
(1) Select the PubChem Substance records associated with the MMDB records for the 197 PDB structures determined in the study by Hawkins et al. . These PDB structures were determined by considering the local quality of fit of the ligand to its density, as well as global level metrics of the protein structure. Because some of the 197 PDB structures had multiple ligands, there were 265 PubChem Substance records associated with these protein structures. Note that, because their study provided a list of the PDB identifiers (without a unique ligand identifier), it was difficult to determine what ligands were actually included in the 197-ligand set. Therefore, next filtering steps similar to those used in their study were taken subsequently.
(2) Select the PubChem Substance records that are neither too rigid nor too flexible (3 ≤ NR ≤ 16), and that are neither too small nor too large (8≤ NNHA ≤ 50). This filtering stage resulted in 200 PubChem Substance records.
(3) Select the PubChem Substance records with good “local” quality of fit to the density. Hawkins et al. used three metrics for this purpose: the real-space correlation coefficient (RSCC) , the real-space R-value (RSR) , and the occupancy-weighted B-factor (OWAB). In the present study, the same criteria as used in their study (RSCC > 0.9, RSR<0.2, and 1<OWAB<50) were applied, after downloading these data from the electron density server (EDS) [49,50]. After this step, 176 structures were remained.
(4) Some of the remaining 176 PubChem Substance Records were associated with identical PubChem Compound Records, or had the same three-letter PDB ligand codes, implying that they were the same ligand molecule. In these cases, the one with the largest RSCC value was retained, and the others were removed. After removing the redundancy, there were 164 structures remained.
(5) When any pair of the remaining 164 structures had the PubChem 2-D similarity score of > 0.9 (computed using the PubChem 2-D subgraph fingerprints  and the Tanimoto equation [6,7]), the one with the largest RSCC value was retained and the other was removed. [In the study of Hawkins et al. , the LINGOS method  was used instead of the PubChem fingerprint to remove too similar molecules.] There were 157 ligands remained after this final filtering stage.
The conformer ensemble for each molecule in the dataset was generated using the OMEGA software  from the OpenEye Scientific Software, Inc. The OMEGA application performs conformer generation in two primary stages: fragment generation and torsion driving. The fragment generation stage splits the input structure into smaller pieces that are energy minimized and conformationally sampled to get diverse 3-D representations for each molecule fragment. The torsion driving stage reassembles and iterates over the fragments from the first stage using particular rule-based torsion angles that depend on the molecular environment between connecting fragments. More detailed description of the OMEGA software is given elsewhere [18,52].
OMEGA has many adjustable parameters to generate conformations with particular attributes, and the optimal set of parameter values used for the present study was based on our recent study . The Merck Molecular Force Field (MMFF94s) without coulombic terms (MMFF94s_NoEstat) was used with the "startfact" value of 20. The energy window value of 25kcal/mol was employed for both model building and torsion driving stages. The values used for other parameters were identical to those used in the previous study .
As pointed out in a recent review by Scior et al., because adequate conformational space coverage is an important requirement for reliable 3-D similarity computations, it would be desirable to consider as many conformations per molecule as possible. However, because it would require tremendous computational resources, it is inevitable to find a compromise between computational cost and conformational coverage. The PubChem3D conformer generation procedure generates a maximum of 100,000 conformers for each chemical structure. As demonstrated in our previous study , this limit may not be enough for very flexible and large molecules, resulting in truncation of conformational search. However, in the same study , it was shown that the 100-K limit does not cause a noticeable decrease in the “average” conformer model accuracy for smaller and less flexible molecules (i.e., with NNHA≤35 and NER≤15). Therefore, this 100-K limit seems adequate for these molecules in general.
After conformer models were produced, a data reduction was performed whereby conformers were sampled to identify a random set of conformers that have a minimum RMSD distance to each other. This minimum RMSD distance was determined by rounding the RMSDpred value [in Equation (3)] to the nearest 0.2 increment i.e., Equation (4)]. The conformers in each conformer ensemble were down-sampled using a partition-based clustering scheme, as described in our previous study , with the RMSD as a distance threshold between conformers (that is, RMSDcluster) and the lowest-energy conformer in each partition as an initial “seed” structure for clustering of that partition. The centroid of each cluster was selected as the representative conformer of that cluster to construct a smaller conformer model with 500 conformers or less. If the conformer model had more than 500 conformers after sampling, it was re-clustered with the RMSDcluster value incremented by a further 0.2. This re-clustering process was repeated as many times as necessary to reduce the overall conformer count to be 500 or less. Note that, because the lowest-energy conformer in each partition was used as an initial seed, low-energy conformers are more likely to be included than high-energy conformers when all partitions are combined together for next round of clustering. As a result, the final conformer model sampled though clustering is more likely to include low-energy conformers than high-energy conformers. All RMSD values computed in this study used the OEChem OERMSD function with: “overlay” turned on to allow rotation/translation to yield the lowest possible RMSD value; and “automorph” detection turned on to allow proper treatment of symmetrically equivalent atoms, except when its use resulted in excessive run-time [an extremely rare event (at a rate of about 1 in 10,000) generally caused by large, nearly symmetric molecules].
The accuracy of the clustered ensembles was estimated using five different accuracy measures: RMSD, STST-opt, ComboTST-opt, CTCT-opt, and ComboTCT-opt. The latter four accuracy measures were computed using ROCS [37,38,52]. Note that the generated conformer model had up to 500 conformers, and the accuracy of the conformer model was evaluated by selecting the best conformer that was closest to the experimental structure (that is, the one with the smallest RMSD value or the largest ROCS-based similarity values).
The authors declare that they have no competing interests.
EEB generated the data. SK wrote the first manuscript. Both EEB and SK analyzed the data, and SHB reviewed the final manuscript. All authors read and approved the final manuscript.
List of the 47,123 small-molecule ligands considered in the study. This file contains a list of the PC Substance ID for small molecules considered in this study.
Distribution of the conformer ensemble accuracies. This file contains figures that show the distributions of the RMSD and the STST-opt accuracies of the conformer models as a function of NNHA, NR, and NER (Additional file 2: Figures S1-S6) and correlation between the two accuracy measures (Additional file 2: Figure S7).
Least-squares fitting of the plots of conformer model accuracyvs. NNHA, NR, NER, andRMSDcluster. This file contains results of linear least-squares fitting of the plots (in Figures 8, ,99, ,1010, ,11,11, and and12)12) of the average conformer model accuracies vs. each of three independent variables [i.e., the non-hydrogen atom count (NNHA), the rotatable bond count (NR), and the effective rotor count (NER)] to the form of y=a+bx.
List of the 157 high-quality ligands selected from the 47,123 ligands. This file contains a list of the PC Substance ID for 157 high-quality ligands selected from the 47,123 molecules.
This research was supported in part by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services.