1. Conformer generation
Conformers were generated for chemical structures in the PubChem Compound database[15
as described in the Materials and Methods
section. This resulted in 16,482,382 3-D conformer ensemble models (as of February 2008) and 1,465,813,269 diverse conformers (an average of 89 conformers per compound). The distribution of the non-hydrogen atom count, rotatable bond count, sampling RMSD, and conformer volumes (rounded to nearest integers) for these are shown in Figure . The average count and standard deviation of non-hydrogen atoms was 24.5 +/- 6.8 with a mode of 26 (with 1,033,645 compounds). The average count and standard deviation of rotatable bonds was 5.5 +/- 2.6 with a mode of 6 (with 2,432,059 compounds). The average and standard deviation of the sampling RMSD for the conformer ensembles was 0.82 +/- 0.20 Å with a mode of 0.8 Å (for 6,939,072 conformer ensembles). The average and standard deviation of the conformer volume was 297 +/- 64 Å3
. The most common volume among the conformers was 307 Å3
(for 10,920,699 conformers) and 99% of the conformers have a volume between 130 and 487 Å3
. In further analyses, we focused on the conformers whose volumes were between 75 and 575 Å3
, corresponding to 99.99% of all conformers.
Figure 1 The distribution of non-hydrogen atom count, rotatable bond count, conformer ensemble sampling RMSD, and conformer volumes (rounded to the nearest integer) of 1,465,813,269 conformers generated from 16,482,382 molecules in the PubChem Compound database (more ...)
2. Generation of reference shapes per volume
The shape diversity of a particular conformer volume may be ascertained by clustering conformers of that volume with a certain shape diversity threshold (STthresh), which controls the "minimum" distance between any two clusters, and then by counting the number of reference shapes, each of which represents a cluster centroid and all conformers within STthresh to the reference shape. [Note that the STthresh is the "maximum" ST value between clusters since the ST score is a similarity measure, not a dissimilarity measure.] If the clustering is performed using the same STthresh value for a volume range, the shape diversity as a function of each molecular volume size may be evaluated by the growth of the number of reference shapes. However, when a constant STthresh value is used across a range of volumes, each increase in the molecular volume may result in a very rapid growth of the shape space, and hence, the number of reference shapes per volume. This is not completely desirable as the computational cost of clustering effectively increases as the square (or worse) of the total count of reference shapes (especially when this count is large), when considering N reference shapes must be compared against K conformers and N <K, compelling one to keep the count of reference shapes to a manageable size for tractability purposes.
To avoid excessive computational expense, we took an alternative approach (as described in Figure ), in which the clustering for a given volume was performed with a dynamic STthresh
value such that the resulting reference shape count became less than or equal to a certain number (chosen to be 200). In this manner, the number of reference shapes per volume was kept effectively constant (as an increase of ST by 0.01 would result in reference shape count above 200), while the growth of shape space as a function of volume is manifest by a decrease in STthresh
. The detailed procedure for clustering is explained in the Material and Methods
section and the PubChem Compound ID of the resulting reference shapes can be found on the PubChem FTP site ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D/ReferenceShapes/
Partition-clustering scheme used for generating the reference shapes for a given volume.
Figure shows the STthresh
value and the reference shape counts as a function of the conformer volume. The STthresh
score decreases gradually and uniformly in the 75-575 Å3
range from 0.92 (for V = 75 Å3
) to 0.47 (for V = 558 Å3
). In fact, this decrease is so smooth that one can predict the ST value in the volume range 75-575 Å3
using only the conformer volume (Equation 2
) with an R2
value of 0.997.
The Shape Tanimoto value used as a shape diversity threshold (STthresh) and the resulting reference shape counts as a function of volume.
where V is the conformer volume and STthresh is the shape Tanimoto for the given volume to achieve 200 or fewer reference conformers. The slope of the STthresh curve shows that the increase in the cluster distances becomes slower as the conformer volume increases; however, this reduction may be an artifact of the input. The reason for this is relatively simple. This study only considered chemical structures found in PubChem and was restricted to 50 or less non-hydrogen atoms. Furthermore, the distribution of this non-hydrogen atom count had a maximum of 26. Conceivably, STthresh may decrease at a more rapid rate if the count of chemical structures in PubChem continued to increase as a function of non-hydrogen atom count across the entire range of non-hydrogen atom count, rather than hitting a maximum of 26. The net effect of this input artifact is that the STthresh curve in Figure may be more linear than actually shown. We expect the entire curve as shown may shift and appear more linear as more theoretically possible and diverse chemical structures are considered; however, we believe the trends detailed in this work should still hold true, unless noted otherwise. Irrespective of the explanation provided, one should consider the curve shown in Figure a conservative estimate of the absolute growth of shape space.
The reference shape count per volume was found to range from 83 (for V = 92 Å3) to the maximum allowed of 200 (for V = 380 Å3), and its average was 147.9. Interestingly, the STthresh curve does not reflect the maximum found in Figure for conformer volume. In fact, the decrease in STthresh as a function of volume is very smooth, suggesting that the actual conformer count per volume, as shown in Figure , has little bearing on shape diversity, as shown in Figure . Or, put another way, the shape space of known chemicals is not near as diverse as chemical space, with a relatively small amount of reference shapes able to represent a large number of chemical structure conformers.
Another interesting observation is that a small change in STthresh
has a large effect on reference count, as reflected in the somewhat periodic growth in shape references until the maximum value of 200 reference shapes is reached, cutting the reference shape count nearly in half. This can be roughly seen in the volume range 75-210 Å3
and then again between 275-375 Å3
. This reflects the use of 0.01 decrements in STthresh
but also reflects anecdotal evidence seen when exploring the reference shapes, where each change in STthresh
by 0.01 appeared to change the reference count by about a factor of two, much as observed by Haigh, et al
] This is only roughly seen in the reference shape counts as two things are changing, the volume and the STthresh
value, and volume change involves a potentially variable change in shape space.
3. Generation of unique shapes for each volume
Reference shapes generated for a given volume are guaranteed to not be closer than the corresponding STthresh value so that the ST similarity between any two reference shapes for that volume cannot be greater than STthresh. However, it is still possible that two reference shapes of different volumes may be closer than STthresh, implying that some portion of the shape space covered by reference shapes for V = V1 can also be shared by reference shapes for V ≠ V1. For this reason, we introduced the concept of the "unique shapes" for a given volume, defined as a non-overlapping set of conformers that cover the shape space spanned by the conformers whose volume is smaller than or equal to that volume (that is, V≤V1). As illustrated in Figure , the unique shapes were classified into three groups according to the shape space they cover: (1) the "large unique shapes", which cover the shape space spanned only by the conformers of V = V1, (2) the "small unique shapes", which cover the shape space spanned only by the conformers of V<V1, and (3) the "shared unique shapes", which cover the shape space spanned by the conformers of V = V1 and those of V<V1. When the conformer volume increases from V<V1 to V = V1, the "large unique shapes" for V = V1 explain newly added shape space whereas the "small unique shapes" for V = V1 represent the shape space not present for that volume. The unchanged portion of the shape space is explained by the "shared unique shapes" for V = V1. Figure schematically illustrates the shape space expansion upon a successive increase in the conformer volume. Note that smaller STthresh values were used for clustering as the volume increases (as represented by larger circles) to maintain the number of unique shapes to a manageable size and to reflect the STthresh value used in Figure for V1.
The concept of unique shapes for V = V1, which cover the shape space spanned by the conformers whose volumes are less than or equal to V1.
Figure 5 Schematic illustration of the shape space expansion upon a conformer volume increase. Blue circles represent the shape space spanned by conformers of a particular volume (V), and black dots represents reference shapes (for the individual shape spaces) (more ...)
The "unique shapes" for each volume were computed using two different clustering strategies, the "small-then-large" approach and the "large-then-small" approach, as depicted in Figure , and detailed procedures are described in the Materials and Methods section. In the "small-then-large" approach [Figure ], the shape space of the conformers of V<V1 was first explored at the STthresh value for V1 to look for newly added shape space when the conformer volume increases to V1. That is, the small and shared unique shapes for V = V1, which cover the shape space spanned by conformers of V<V1, were first generated by clustering all reference and basis shapes for V<V1, and then the identified unique shapes were re-clustered with the reference and basis shapes for V = V1 to find the large unique shapes. On the contrary, in the "large-then-small" approach [Figure ], the large and shared unique shapes for V = V1 were determined first, by using the previously determined reference shapes for V = V1, and then the reference and basis shapes for V<V1 were used to re-cluster to identify the small unique shapes.
Two different approaches used to generate the unique shapes between V = V1 and V<V1, depending on which shape space is clustered first.
The two methods resulted in two different sets of the unique shapes for each volume. The unique shape counts for both sets and the ratio between them are plotted in Figure , as a function of the conformer volume. Because both methods deal with the identical shape space, they are expected to give a number of unique shapes similar to each other; however, since reference shapes were selected randomly without any attempt to optimally minimize or maximize their count, these counts cannot be expected to be the same. As shown in Figure , the unique shape counts for the two sets tended to differ by 0-10%, although their ratio varied from 0.7 to 1.3 (especially for V>500, where the conformer populations were not as numerous). This tendency may be explained by the fact that lesser volumes consider reference and basis shapes that may be considerably closer together due to larger STthresh values. This suggests that using the larger volume reference shapes first resulted in a more efficient shape space description (i.e., fewer reference shapes), when considering the union of the collective shape space for the volume range. Nonetheless, Figure shows, as expected, that the total number of unique shapes gradually increases as a function of the conformer volume and its STthresh value, indicating an overall expansion of shape space across the volume range irrespective of the change in ST value used (i.e., shape space is growing faster than the decrease in ST value as a function of volume to achieve a maximum of 200 reference shapes).
Unique shape counts. (a) The number of unique shapes generated by the "small-then-large" method and the "large-then-small" method, and (b) the ratio of "small-then-large" to "large-then-small" unique shapes as a function of conformer volume.
Figure displays the number of large unique shapes, small unique shapes, and shared unique shapes for each volume, while Figure shows their proportions of the total unique shapes, which were estimated using the following equations:
The number of unique shapes, small unique shapes, and large unique shapes generated using (a) the small-then-large method and (b) the large-then-small method.
The percentages of the large unique shapes, small unique shapes, and shared unique shapes, being the percentage of space not covered by either large or small unique shapes [i.e., shared = 1.0 - (large + small)], as a function of the conformer volume.
Note that the value of STthresh affects the counts of reference, basis, and unique shapes, because it determines the distance between clusters. However, the percentages of these counts plotted in Figure are essentially equivalent to the fractions of the shape space that the individual counts represent, and hence, they may be considered to be independent of the STthresh value.
There are a number of interesting observations one can make from these graphs. In Figure and Figure there is a banded behavior, indicated previously in Figure , which looks like a series of lines spaced further apart as the volume increases. This is due to the steady growth in shape space as volume increases and the use of 0.01 decrements of STthresh. Whenever the STthresh decreases by 0.01, a corresponding significant decrease in counts occurs. When the STthresh value changes less, or does not change at all, the lines appear to be wider apart, reflecting just the growth in shape space due to volume.
Another interesting observation in Figure , one can see that the absolute count of large unique shapes stays relatively constant in the volume range, with an average count and standard deviation of 22.2 +/- 7.8 and a mode of 24. There is a shallow maximum at volume 145 Å3 followed by a relatively slow overall decline over the rest of the volume range. This decline appears most evident when the volume is beyond volume 305 Å3, perhaps due to the truncation of shape space considered as represented by the rapid reduction in conformer count at larger volumes and the fact that a maximum of non-hydrogen atom count occurs at 26.
Similar to the large unique shapes in Figure , the large and shared unique shapes in Figure show a similar banded behaviour across most of the volume range, with a reference count mean and standard deviation of 144.4 +/- 23.7 and a mode of 140. There is a barely evident maximum volume at volume 228 Å3 and a slightly noticeable dip at volume 261 Å3, prior to resuming the similar narrow band of large and shared unique shapes. This may suggest that the growth of large and shared shape space is relatively constant as a function of PubChem contents.
The small and shared unique shapes completely dominate in Figure , being nearly the same as the total count of unique shapes across the entire volume; however, the small unique shapes in Figure show a very shallow minimum at about volume 200 Å3 prior to significantly increasing as a function of volume. This may suggest that the overall size of PubChem shape space slows (as a function of the rate of changing ST) after a point, with large unique shapes contributing less and less to the overall shape diversity across the full volume range as the total shape space that can be represented by larger shapes diminishes. One can see this to some extent in Figure , where the percentage of shared shape space is "Λ"-shaped, reaching a maximum of 73% at volume 217 Å3 and then steadily diminishes as a function of volume as the percentage of shape space of smaller shapes dominates. Again, it is reasonable to suggest that this observation is an artifact of the PubChem contents and not representative of what one might find if significantly more larger chemical structures were considered in the range of 30-50 non-hydrogen atoms. (i.e., if the non-hydrogen atom count maximum was not at 26, but continued to grow until the maximum considered of 50.)
To further demonstrate this, Figure shows the ratio of the fraction of the large unique shapes to the sum of the fractions of the large and shared unique shapes, which is a measure of how much of the shape space spanned by the conformers of a particular volume is not shared by the conformers smaller than that volume. For 75 Å3 ≤ V ≤ 100 Å3, the mean value of the ratio was 0.19, indicating that ~20% of the shape space spanned by the conformers of a particular volume in this range is unique to that particular volume, and that the other 80% is shared by the conformers smaller than that volume. The ratio decreases with the conformer volume, and the mean value for 550 Å3 ≤ V ≤ 575 Å3 was 0.11, indicating the rate of the shape space growth decreases as the conformer volume increases, relative to the PubChem chemical structure contents.
The ratio of the percent large unique shapes to the sum of the the percent large and shared unique shapes.