PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of jcheminfoBioMed CentralBiomed Central Web Sitesearchsubmit a manuscriptregisterthis articleJournal of CheminformaticsJournal Front Page
 
J Cheminform. 2011; 3: 25.
Published online Jul 20, 2011. doi:  10.1186/1758-2946-3-25
PMCID: PMC3158422
PubChem3D: Shape compatibility filtering using molecular shape quadrupoles
Sunghwan Kim,1 Evan E Bolton,corresponding author1 and Stephen H Bryant1
1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894, USA
corresponding authorCorresponding author.
Sunghwan Kim: kimsungh/at/ncbi.nlm.nih.gov; Evan E Bolton: bolton/at/ncbi.nlm.nih.gov; Stephen H Bryant: bryant/at/ncbi.nlm.nih.gov
Received May 12, 2011; Accepted July 20, 2011.
Background
PubChem provides a 3-D neighboring relationship, which involves finding the maximal shape overlap between two static compound 3-D conformations, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of pre-filters, based on the concept of volume, to remove approximately 65% of all conformer neighbor pairs prior to shape overlap optimization. Given that molecular volume, a somewhat vague concept, is rather effective, it leads one to wonder: can the existing PubChem 3-D neighboring relationship, which consists of billions of shape similar conformer pairs from tens of millions of unique small molecules, be used to identify additional shape descriptor relationships? Or, put more specifically, can one place an upper bound on shape similarity using other "fuzzy" shape-like concepts like length, width, and height?
Results
Using a basis set of 4.18 billion 3-D neighbor pairs identified from single conformer per compound neighboring of 17.1 million molecules, shape descriptors were computed for all conformers. These steric shape descriptors included several forms of molecular volume and shape quadrupoles, which essentially embody the length, width, and height of a conformer. For a given 3-D neighbor conformer pair, the volume and each quadrupole component (Qx, Qy, and Qz) were binned and their frequency of occurrence was examined. Per molecular volume type, this effectively produced three different maps, one per quadrupole component (Qx, Qy, and Qz), of allowed values for the similarity metric, shape Tanimoto (ST) ≥ 0.8.
The efficiency of these relationships (in terms of true positive, true negative, false positive and false negative) as a function of ST threshold was determined in a test run of 13.2 billion conformer pairs not previously considered by the 3-D neighbor set. At an ST ≥ 0.8, a filtering efficiency of 40.4% of true negatives was achieved with only 32 false negatives out of 24 million true positives, when applying the separate Qx, Qy, and Qz maps in a series (Qxyz). This efficiency increased linearly as a function of ST threshold in the range 0.8-0.99. The Qx filter was consistently the most efficient followed by Qy and then by Qz. Use of a monopole volume showed the best overall performance, followed by the self-overlap volume and then by the analytic volume.
Application of the monopole-based Qxyz filter in a "real world" test of 3-D neighboring of 4,218 chemicals of biomedical interest against 26.1 million molecules in PubChem reduced the total CPU cost of neighboring by between 24-38% and, if used as the initial filter, removed from consideration 48.3% of all conformer pairs at almost negligible computational overhead.
Conclusion
Basic shape descriptors, such as those embodied by size, length, width, and height, can be highly effective in identifying shape incompatible compound conformer pairs. When performing a 3-D search using a shape similarity cut-off, computation can be avoided by identifying conformer pairs that cannot meet the result criteria. Applying this methodology as a filter for PubChem 3-D neighboring computation, an improvement of 31% was realized, increasing the average conformer pair throughput from 154,000 to 202,000 per second per CPU core.
PubChem is an open and free resource of the biological activities of small molecules [1-4]. PubChem has an integrated theoretical 3-D layer, PubChem3D [5-7], which provides a precomputed 3-D neighboring relationship called "Similar Conformers" [7] to help users locate and relate data in the archive. "Similar Conformers" identifies chemicals with similar 3-D shape and similar 3-D orientation of functional groups typically used to define pharmacophores (defined here simply as "features"), complementing a PubChem 2-D neighboring relationship called "Similar Compounds", which identifies closely related chemical analogs using the PubChem 2-D subgraph fingerprint [8]. Effectively, for each PubChem chemical structure, this 3-D neighboring relationship provides (at the time of writing) the results of a 3-D similarity search against 28.9 million compound records using three diverse conformers per molecule.
The PubChem3D neighboring uses as a measure of molecular shape similarity the shape Tanimoto (ST) [9,10], given as the following equation:
A mathematical equation, expression, or formula.
 Object name is 1758-2946-3-25-i1.gif
(1)
where VAA and VBB are the self-overlap volumes of conformers A and B, respectively, and VAB is the common overlap volume between A and B. The 3-D neighboring requires finding the maximum shape similarity between static compound 3-D conformations, as dictated by VAB in Equation 1, to calculate ST, a computationally intensive step. It is highly desirable to avoid this overlap computation, especially if it can be determined with certainty that a conformer pair cannot meet the criteria to be a 3-D neighbor. As such, PubChem employs a series of filters, based on the concept of volume, to effectively ignore approximately 65% of all conformer neighbor pairs during 3-D neighboring, thus dramatically accelerating processing [7].
Volume, although a rather fuzzy concept, is rather effective as a filter between conformers dissimilar in shape and features [7]. Conceivably there are other aspects of molecular shape beyond volume to "recognize" when two shapes are (dis)similar. A characteristic one can readily imagine are descriptors associated with aspects of length, width, and height. Steric shape quadrupoles embody such a concept and attempts have been made to use their differences as a shape similarity metric [11,12]. This leads to the question: can additional simple shape descriptor relationships be identified that improve upon the volume-based filtering efficacy? Or, put another way, can one place an upper bound on shape similarity by identification of some (additional) crude shape compatibility between conformers?
In this paper, we examine the use of shape descriptors as a means to rapidly identify "dissimilar" molecule shapes. As a part of this, we attempt to answer the critical questions: are vague shape descriptors representing the concepts of length, width, and height good discriminators of molecular shape? Can 3-D similarity searching speed be further accelerated using shape descriptors more sophisticated than volume? Is it possible to create a "shape compatibility" mapping indexed to shape similarity?
1. Distribution of shape descriptor components and their volume dependency
The molecular shape quadrupoles in the principal-axes frame [9,13] are given as the following:
A mathematical equation, expression, or formula.
 Object name is 1758-2946-3-25-i2.gif
(2)
where, Qx, Qy, and Qz are the x, y, and z components of the quadrupole moment, respectively. The x, y, and z components are conceptually equivalent to the length, width, and height of a molecule, respectively, with the largest quadrupole component defined as Qx and the smallest as Qz, by convention. An assumption underlying this study is that there is a point whereby, if the shape quadrupole difference between two conformers is too large, they cannot meet the ST ≥ 0.8 threshold required by PubChem3D neighboring, as illustrated in Figure Figure1.1. This relationship, if it actually exists, would allow conformer pairs to be filtered out, avoiding the time-consuming shape superposition optimization step for those pairs and enhancing the throughput of the PubChem 3-D neighboring. To attempt to determine if a relationship can be found, the shape quadrupole differences for known 3-D "Similar Conformers" were analyzed.
Figure 1
Figure 1
Small changes in dimensions can result in large changes in overlap. Using a 2-D rectangle shape with constant area (0.4 in2), one can see that small changes in shape dimensions (length and width) can result in large changes in shape overlap (ST). Note (more ...)
At the time of quadrupole filter project initiation (in October, 2008), 3-D neighboring of 17,143,181 unique molecules, effectively covering the CID range 1-25,000,000, had been completed using a single conformer per compound, yielding 4,182,412,802 3-D neighbors. Table Table11 shows the statistics of the three quadrupole components for those 17.1 million molecules. The mean and standard deviation for Qx, Qy, and Qz were 15.01 ± 8.07 Å5, 3.81 ± 1.80 Å5, and 1.52 ± 0.65 Å5, respectively. Figure Figure22 and and33 display the distributions of Qx, Qy, and Qz, after they were binned into units of 2.5 Å5, 0.5 Å5, and 0.1 Å5, respectively. All three components showed strongly skewed distributions; however, most of the molecules were populated near the mean and relatively few molecules had quadrupole components much larger than the mean values.
Table 1
Table 1
Quadrupole statistics
Figure 2
Figure 2
Quadrupole distribution. The frequency of occurance of the three quadrupole moment components for 17.1 million molecules from the PubChem Compound database, where (a) Qx, (b) Qy, and (c) Qz were binned into units of 2.5 Å5, 0.5 Å5, and (more ...)
Figure 3
Figure 3
Quadrupole interdependence. The distribution of 17.1 million molecules from the PubChem Compound database as a function of (a) Qx and Qy, (b) Qx and Qz, and (c) Qy and Qz, respectively. Qx, Qy, and Qz were binned into units of 2.5 Å5, 0.5 Å (more ...)
The molecular volume and quadrupole moments are correlated with each other according to the following equation:
A mathematical equation, expression, or formula.
 Object name is 1758-2946-3-25-i3.gif
(3)
where Rg is the radius of gyration and Vmp is the monopole volume, which corresponds to the monopole in the shape multipole expansion [13]. Equation 3 implies that the size of a molecule (represented by the molecular volume) is not completely independent of its quadrupole moment. Therefore, at the beginning of this study, the correlation between molecular volume and quadrupole moment was investigated. Note that, because the molecular volume is not a measurable quantity with a clear, unanimous definition, there are many ways to estimate it [13-18]. Therefore, in addition to the monopole volume, the PubChem 3-D information includes two other volumes computed in different ways. One is the analytic volume and the other is the self-overlap volume. The analytic volume is considered to be most consistent to other definitions of molecular volume among the three, but its computation is also the slowest. For this reason, evaluation of the ST score given in Equation 1 uses the self-overlap volume, whose evaluation is considerably faster than the analytic volume; however, it typically overestimates the molecular volume by a factor of three greater than the analytic volume, as shown in Table Table2.2. Each compound conformer record in the PubChem provides all three volumes and they can be downloaded: individually from the Compound Summary pages, using a list from the PubChem Download Facility (http://pubchem.ncbi.nlm.nih.gov/pc_fetch), or in bulk from the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem). To avoid confusion about these three different volumes used in the present paper, we denote the monopole volume, self-overlap volume, and analytic volume as Vmp, Vso, and Van, respectively, whereas the volume in a general sense is denoted as V (without any subscript).
Table 2
Table 2
Volume statistics
Figure Figure44 displays the distribution of the three different volumes of the 17.1 million molecules from the PubChem Compound database. In general, Vso is the largest, and Van is the smallest. As shown in Figure Figure5,5, the quadrupole moment increases with molecular size, implying that the effect of quadrupole difference between two molecules upon their shape similarity may depend on their relative molecular sizes. Therefore, the quadrupole differences of 3-D "Similar Conformer" neighbors as a function of volume need to be considered.
Figure 4
Figure 4
Volume distribution. The frequency of occurance of the three different volume types, analytic volume (Van, blue), monopole volume (Vmp, red), and self-overlap volume (Vso, green), for 17.1 million molecules from the PubChem Compound database, where all (more ...)
Figure 5
Figure 5
Volume-quadrupole interdependence. The distribution of 17.1 million molecules from the PubChem Compound database as a function of the molecular volume type and quadrupole component. Van [in panel (a)-(c)], Vmp [in panel (d)-(f)], and Vso [in panel (g)-(i)] (more ...)
2. Design of 3-D neighbor filters using quadrupole moment differences
As a general premise, if two molecules with the same volume also have identical values for the quadrupole components, they are likely to be shape similar to each other. In addition, as the quadrupole moment difference deviates from zero, the maximum shape similarity is expected to decrease (see Figure Figure1).1). When the quadrupole (and volume) difference becomes greater than some value or threshold, the shape dissimilarity is such that the molecule conformer pair cannot possibly meet the criteria to be a PubChem 3-D neighbor (ST ≥ 0.8). Therefore, if we know these quadrupole difference thresholds for a given volume pair, one may be able to preclude conformer pairs that are not sufficiently shape similar, using only knowledge of the volume and quadrupole moments.
In the present study, the quadrupole moment differences of the 4.18 billion 3-D neighbors, identified from the 3-D neighboring of 17.1 million molecules, were analyzed to find the maximum possible quadrupole differences for two molecules to be neighbors (see also the "Materials and Methods" section). The volume and quadrupole moments of the two molecules in each neighbor pair were first converted into an integer value by using the following two equations:
A mathematical equation, expression, or formula.
 Object name is 1758-2946-3-25-i4.gif
(4)
A mathematical equation, expression, or formula.
 Object name is 1758-2946-3-25-i5.gif
(5)
where superscript "bin" is used to distinguish these integers from the original, non-binned values. The denominator Binsize was 5.0 Å3 for all the three volumes, and 2.5 Å5, 0.5 Å5, and 0.1 Å5, for Qx, Qy, and Qz, respectively. After all 4.18 billion 3-D neighbors were binned according to their Vbin and Qbin values, the 3-D neighbor distribution for a given (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i6.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i7.gif) pair was analyzed as a function of ΔQbin.
To illustrate the general premise above that quadrupole deviations from zero result in a reduction is shape similarity, Figure Figure66 shows the neighbor count as a function of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif for (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif) = (100, 110), (100, 120), and (100, 130). As anticipated, maximum neighbor populations exist when An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif is near the origin, and rapidly decrease in count (nearly linear reduction on a log curve) as the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif deviates from zero. In addition, for a given (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gifAn external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif) pair, neighbors were observed only for a certain range of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif, indicating that this range information can be used as a filter that pre-screens non-neighbor pairs. The asymmetric distribution of the 3-D neighbors in Figure Figure66 with respect to the ordinate axis (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i11.gif) suggests that two different filters would need to be generated: one for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i12.gif and the other for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i13.gif.
Figure 6
Figure 6
Quadrupole difference tolerance. The distributions of the 3-D neighbors as a function of the binned quadrupole differences, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif, of the two moleclues in each neighbor for (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif) = (100, 110), (100, 120), and (100, 130), respectively, illustrating how the frequency (more ...)
Figures Figures7,7, ,88 and and99 show the ΔQbin threshold for each quadrupole component as a function of volume for the 4.18 billion 3-D neighbors. Note that, since PubChem regularly gets additional new unique content from its contributors, there is always a possibility that the 3-D neighboring of these new records may identify previously unseen cases of ΔQbin threshold. If we use these ΔQbin threshold maps [see panels (a) and (b) of Figures Figures7,7, ,88 and and9]9] as a filter during neighboring, we would preclude those 3-D neighbors. Therefore, we modified the maps [see panels (c) and (d) of Figures Figures7,7, ,88 and and9],9], as described in the "Materials and Methods" section, to extend ΔQbin difference values or to add neighboring bins where no population is found in an attempt to mitigate any such issues in the fringe regions on the maps.
Figure 7
Figure 7
Monopole volume An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif shape compatibility map and filter. The absolute value of the maximum possible value of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif for two molecules to be 3-D neighbors of each other, as a function of binned monopole volumes, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif of molecules 1 and 2, respectively, at ST ≥ (more ...)
Figure 8
Figure 8
Monopole volume An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i29.gif shape compatibility map and filter. The absolute value of the maximum possible value of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i29.gif for two molecules to be 3-D neighbors of each other, as a function of binned monopole volumes, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif of molecules 1 and 2, respectively, at ST ≥ (more ...)
Figure 9
Figure 9
Monopole volume An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i30.gif shape compatibility map and filter. The absolute value of the maximum possible value of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i30.gif for two molecules to be 3-D neighbors of each other, as a function of binned monopole volumes, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif of molecules 1 and 2, respectively, at ST ≥ (more ...)
These modified ΔQbin threshold maps are designated as quadrupole filters. For simplicity, we name these filters with a capital letter "F" followed by a subscript, which represents one of the quadrupole components, and a superscript, which represents the type of volume involved. For example, filter "An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i14.gif" indicates that the Qx filter generated with the analytic volume, Van.
Given that these quadrupole filters were built using an existing set of 3-D neighbor cases, one needs to validate the extent of their efficacy. To do so, a 13.2 billion molecule conformer pair test set not considered as a part of the original 3-D neighboring training set, is utilized (see the "Materials and Methods" section). After computing the ST scores for the 13.2 billion pairs, the fraction of 3-D neighbors and non-neighbors, which would have been pre-screened if the quadrupole filters were applied, is summarized in Table Table33.
Table 3
Table 3
Accuracy of filtering as a function of volume type and quadrupole component at ST ≥ 0.80 threshold
Of the three volume types utilized, the monopole-based quadrupole filters, Fmp, is arguably the best. Filter An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i15.gif removed 4.78 billion pairs (36.3%), while incurring a loss of only 30 out of 24 million "potential" neighbors. [Note that the definition of a PubChem 3-D neighbor involves feature similarity as well as shape similarity, while the quadrupole filters deal only with shape similarity. As such, the 30 pairs filtered out had a ST score sufficient to be a 3-D neighbor, making it a "potential" 3-D neighbor.] The false negative count of 30 removed by An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i15.gif is negligible, but does show that use of such a filter will result in precluding some potential 3-D neighbors in its use, in this case at a rate of 1 in 800,000.
Filters An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i16.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i17.gif are not as efficient as An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i15.gif, but could still filter out 3.92 billion pairs (29.8%), and 3.59 billion pairs (27.3%), respectively, when considered individually. If the three Fmp filters are used in a series (denoted as An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif, and applied one after the other), 5.33 billion pairs (40.4%) could be removed with a loss of only 32 potential neighbors. Filter Fso showed similar performance to Fmp, but it filtered out more potential neighbors (288 for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i19.gif versus 32 for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif) and removed slightly fewer non-neighbors (39.1% for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i19.gif versus 40.4% for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif). The Fan filters showed the least loss of potential neighbors (4 for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i20.gif versus 32 for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif), but also removed the least non-neighbors (29.0% for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i20.gif versus 40.4% for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif).
Effects of the ST threshold for PubChem 3-D neighboring upon the efficiency of the quadrupole filters were also investigated by generating a set of quadrupole filters, each using a different ST threshold, ranging from 0.80 to 0.99 with an increment of 0.01. As shown in Figure 10, the fraction of molecule pairs filtered increases almost linearly as a function of the ST threshold. For the entire ST range tested, the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i19.gif filters showed better efficiencies than the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i20.gif filter.
Figure 10
Figure 10
Shape compatibility filtering efficiency. Performance of the Fxyz quadrupole filter to filter conformer pairs at different ST threshold values.
3. Application of 3-D neighbor filters using quadrupole moment differences
Given that filtering conformer pairs using steric shape quadrupoles is effective with minimal loss of potential 3-D neighbors, a "real world" test is made with An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif to see how use of these filters in the context of PubChem 3-D neighboring improves throughput. To achieve this, comparison is made to earlier benchmarks [7] whereby a set of known drugs and other molecules of keen biomedical interest are neighbored against the 3-D contents of PubChem. Table Table44 and Table Table55 summarize the results of these tests.
Table 4
Table 4
Acceleration of PubChem 3-D neighboring using the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif quadrupole filter
Table 5
Table 5
Efficiency of the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif quadrupole filter
Considering PubChem 3-D neighboring is a precomputed similarity search, one can see that the neighboring throughput improvements using An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif are substantial, with an average improvement of 31% across the range of conformer counts per compound. Perhaps surprising is that the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif filtering removes only 7% of the conformer pairs, yet achieves a 31% neighboring throughput improvement. This emphasizes the dramatic cost/benefit difference between the computation necessary to achieve the 7% reduction versus what is expended in its absence.
It is important to note that An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif is not the first filter applied in Table Table5,5, meaning that there are three other filters utilized before An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif. The filter ordering is such so as to maximize the cost/benefit of each filter. To examine what happens if An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif is used as the first filter, neighboring is repeated for the case of one diverse conformer per compound. When used first, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif removes 48.3% of all conformer pairs (44.8%, 2.1%, and 1.4% conformer pairs for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i15.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i16.gif, and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i17.gif, applied in that order, respectively) versus the 7.4% as shown in Table Table5.5. The CT Feature, CT Volume, and ST Volume filters, applied in that order, remove 27.9%, 0.1%, and 0.002% conformer pairs, respectively, when An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif is applied first.
Simple molecular shape descriptors, volume and steric quadrupole moments (embodying the length, width, and height of a shape), of 4.18 billion 3-D neighbor pairs resulting from PubChem 3-D neighboring of 17.1 million single conformer molecules were analyzed. The maximum quadrupole differences between neighbor conformers were determined. This examination demonstrated a distinct dependency of shape similarity upon quadrupole variation. With some slight modification of fringe regions, the results of this analysis were turned into computationally inexpensive, yet highly effective set of filters capable of removing 3-D conformer pairs that cannot meet a required shape similarity, using only knowledge of the volume and steric quadrupole moments of the conformer pair. When applied in the context of shape similarity searching, these filters can significantly improve throughput performance by avoiding expensive superposition optimization computation of conformer pairs that cannot possibly meet a pre-defined shape similarity search threshold.
The filters devised were tested using a dataset of 13.2 billion compound pairs. The quadrupole filters based on a monopole volume showed the best efficacy, while the filters using an analytic volume had the lowest efficacy. For all the three volume types, the Qx filters eliminated a larger portion of the compound pairs than the Qy and Qz filters. When the filters were used in a series simultaneously, they could eliminate 30~40% of non-neighbor pairs, with the removal of a negligible amount of potential neighbors. For example, the Qxyz filter based on the monopole volumes (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif) could eliminate 40.4% of the 13.2 billion compound pairs with a loss of 32 potential neighbors out of 24 million at a shape Tanimoto (ST) threshold of 0.80. It was also demonstrated that this filtering efficiency improves linearly as a function of shape similarity threshold approaching 100% efficiency at an ST threshold of 0.99. Further testing of the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i18.gif filters in the context of PubChem 3-D neighboring processing resulted in conformer pair throughput improvements of 31% on average.
In summary, the quadrupole filters developed in this study can speed up the PubChem 3-D neighbor processing with a negligible loss of the 3-D neighbors. However, its applicability is not just limited to PubChem 3-D neighboring. The results of the present study also suggest that the shape multipole moments can be applied generally to enhance the speed of 3-D similarity search methods by the rapid preclusion of dissimilar molecules that cannot be a result. This approach may be able to significantly speed up 3-D similarity search, especially if the 3-D shape superposition optimization is a bottleneck of the similarity search.
1. Datasets
At the time of project initiation, PubChem 3-D neighboring of 17,143,181 unique molecules (ranging from CID 1 to CID 25,000,000) had been completed using a single conformer per compound, yielding 4,182,412,802 3-D neighbors. Using the Shape Toolkit from the OpenEye Software [19], the analytic volume (Van), monopole volume (Vmp), self-overlap volume (Vso), and steric shape quadrupole moments (Qx, Qy, and Qz) were computed for the theoretical conformer of all 17.1 million molecules. See Figures Figures22 and and44 for the distributions of the computed values.
2. Filter generation
The quadrupole filters developed for pre-screening conformer-pairs based on quadrupole differences as a function of shape similarity ST threshold were generated using the following steps:
1) The 4.18 billion 3-D neighbor pairs and their associated data were obtained from PubChem.
2) The volumes (Vmp, Vso, and Van) and quadrupole components (Qx, Qy, and Qz) of the compound pair for each 3-D neighbor were converted into integers using Equations 4 and 5 to yield An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i21.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i22.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i23.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i24.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i25.gif, and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i26.gif, respectively. The denominator BinSize was 5.0 Å3 for all three volume types and 2.5 Å5, 0.5 Å5, and 0.1 Å5, for Qx, Qy, and Qz, respectively.
3) For each of the three binned volume types, the following was performed using the 3-D neighbor pairs (in this case using An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i21.gif as an example):
a) Of the two conformers in a 3-D neighbor, the one with the smaller An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i21.gif value was designated as molecule 1 and the other as molecule 2. When the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i21.gif value was the same for both, the one with the smaller An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i24.gif value was designated as molecule 1. If both the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i21.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i24.gif values were the same for both, the one with the smaller An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i25.gif was designated as molecule 1. If An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i25.gif was also the same for both molecules, the one with the smaller An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i26.gif was designated as molecule 2. If all four descriptors are the same for both molecules, the one that appears first for the pair was designated as molecule 1.
b) For each of the three binned quadrupole components, and using An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i24.gif as an example:
i) 3-D neighbors were binned according to three indices, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif, and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i27.gif, where subscripts 1 and 2 indicate molecules 1 and 2, respectively, determined in step 3a, and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i28.gif is the Qx difference between the two molecules.
ii) The neighbor count for all (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif) bins was analyzed to find the maximum possible absolute value of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif for a given (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i9.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i10.gif) pair. It results in the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif difference maps as a function of binned volume pairs [see panels (a) and (b) in Figures Figures7,7, ,88 and and99].
iii) The An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i8.gif difference maps were modified, as described in the next section, to generate a final Qx filter based on monopole volumes (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i15.gif) [see panels (c) and (d) in Figures Figures7,7, ,88 and and99].
4) To obtain filters effective at an ST threshold other than ≥ 0.80, first restrict the original 4.18 billion 3-D neighbor pairs to those at or above the desired ST threshold and repeat step 3.
3. Modification of filters
Figure Figure1111 shows a schematic diagram describing how an original difference map is modified at a given ΔQbin value. In an original map [panel (a) of Figure Figure11],11], the (An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i6.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i7.gif) bins that have population are indicated in red. Note that not all bins are populated between the minimum and maximum values of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i7.gif for a given An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i6.gif in the fringe area. It is likely that these bins could be occupied by 3-D neighbors in the future, simply lacking an example at this time. Therefore, these bins are included in the neighbor regions [as shown in panel (b) of Figure Figure11]11] at the given ΔQbin value. Similarly, any empty bins within the range of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i6.gif at a given An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-25-i7.gif are also set in the neighbor regions [panel (c) of Figure Figure11]11] for the given ΔQbin value.
Figure 11
Figure 11
Transformation of shape compatibility map into a filter. Schematic diagram describing modification of an original difference map at a given ΔQbin value: (a) in an original map, neighbor regions are indicated in red, (b) all empty bins between (more ...)
This procedure is performed for all unique ΔQbin values starting with the maximum. As lesser ΔQbin values are considered in this correction, greater ΔQbin values are considered at the ΔQbin value being considered. A pseudo-code implementation of this procedure is shown in Figure Figure12.12. All quadrupole filters resulting from this modification are available in Additional file 1.
Figure 12
Figure 12
Pseudo code to transform shape compatibility map into a filter.
4. Efficiency test of filters
To test the efficiency of the quadrupole filters devised, two sets of molecules were chosen. One set contains molecules in the PubChem CID range of 1 ~ 25,000,000, and the other contains those in the CID range of 25,000,001~25,001,000. Because a theoretical conformer was not generated for all CIDs or because compound records were not "live", the two datasets had 17,488,897 and 753 molecules, respectively. All-by-all comparison between the two sets gives 13,169,139,441 CID pairs. Using the first diverse conformer for each compound, the ST values for these 13.2 billion pairs were computed using ROCS [20] from OpenEye software, Inc., consuming ~419 CPU days in total, and stored. These ST scores were used to estimate how many CID pairs would be filtered out when applying the quadrupole filters as a function of volume type and as a function of ST threshold, for example, as demonstrated in Table Table33 and Figure Figure1010.
5. Effect of Quadrupole filters on PubChem3D Neighboring
One aspect of this effort is to examine the change in real-world efficiency of PubChem3D neighboring processing when using quadrupole filters while computing the 3-D "Similar Conformers" relationship. To achieve this, the set of 4,218 biologically relevant chemical structures with known pharmacological actions from our earlier efforts [7] was used. These small molecules with known biological action (Query set) were neighbored against 26,157,365 compound records (Search set), representing the entire "live" PubChem3D contents as of Oct. 2010, using up to 1, 3, 5, 7, and 10 diverse conformers per compound for both compound sets. Timing and efficiency differences with our earlier work are given in Tables Tables44 and and55.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SK analyzed the quadrupole differences of the 3-D neighbors, generated the quadrupole filters, and wrote the first draft. EEB supervised the project and revised manuscript. SHB reviewed the final manuscript. All authors read and approved the final manuscript.
Supplementary Material
Additional file 1
Quadrupole filters. A zip archive of text files containing information on the maximum quadrupole differences as a function of molecular volumes.
Acknowledgements
We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services.
  • Bolton EE, Wang Y, Thiessen PA, Bryant SH. In: Annual Reports in Computational Chemistry. Ralph AW, editor. Vol. 4. David CS: Elsevier; 2008. PubChem: integrated platform of small molecules and biological activities; pp. 217–241. [Cross Ref]
  • Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37:W623–W633. doi: 10.1093/nar/gkp456. [PMC free article] [PubMed] [Cross Ref]
  • Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH. An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010;38:D255–D266. doi: 10.1093/nar/gkp965. [PMC free article] [PubMed] [Cross Ref]
  • Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38:D5–D16. doi: 10.1093/nar/gkp967. [PMC free article] [PubMed] [Cross Ref]
  • Bolton EE, Kim S, Bryant SH. PubChem3D: conformer generation. J Cheminformatics. 2011;3:4. doi: 10.1186/1758-2946-3-4. [PMC free article] [PubMed] [Cross Ref]
  • Bolton EE, Kim S, Bryant SH. PubChem3D: diversity of shape. J Cheminformatics. 2011;3:9. doi: 10.1186/1758-2946-3-9. [PMC free article] [PubMed] [Cross Ref]
  • Bolton EE, Kim S, Bryant SH. PubChem3D: similar conformers. J Cheminformatics. 2011;3:13. doi: 10.1186/1758-2946-3-13. [PMC free article] [PubMed] [Cross Ref]
  • PubChem substructure fingerprint description. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf
  • Grant JA, Gallardo MA, Pickup BT. A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996;17:1653–1666. doi: 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K. [Cross Ref]
  • Grant JA, Pickup BT. A Gaussian description of molecular shape. J Phys Chem. 1995;99:3503–3510. doi: 10.1021/j100011a016. [Cross Ref]
  • Grant JA, Pickup BT. In: Computer Simulation of Biomolecular Systems. van Gunsteren WF, Weiner PK, Wilkinson AJ, editor. Dordrecht: Kluwer Academic Publishers; 1997. Gaussian shape methods; pp. 150–176.
  • Haigh JA, Pickup BT, Grant JA, Nicholls A. Small molecule shape-fingerprints. J Chem Inf Model. 2005;45:673–684. doi: 10.1021/ci049651v. [PubMed] [Cross Ref]
  • Mansfield ML, Covell DG, Jernigan RL. A new class of molecular shape descriptors. 1. Theory and properties. J Chem Inf Comput Sci. 2002;42:259–273. [PubMed]
  • Gavezzotti A. The calculation of molecular volumes and the use of volume analysis in the investigation of structured media and of solid-state organic-reactivity. J Am Chem Soc. 1983;105:5220–5225. doi: 10.1021/ja00354a007. [Cross Ref]
  • La-Scalea MA, Menezes CMS, Ferreira EI. Molecular volume calculation using AM1 semi-empirical method toward diffusion coefficients and electrophoretic mobility estimates in aqueous solution. J Mol Struct Theochem. 2005;730:111–120. doi: 10.1016/j.theochem.2005.05.030. [Cross Ref]
  • Edward JT. Molecular volumes and Stokes-Einstein equation. J Chem Educ. 1970;47:261–270. doi: 10.1021/ed047p261. [Cross Ref]
  • Lepori L, Gianni P. Partial molar volumes of ionic and nonionic organic solutes in water: a simple additivity scheme based on the intrinsic volume approach. J Solut Chem. 2000;29:405–447. doi: 10.1023/A:1005150616038. [Cross Ref]
  • Spillane WJ, Birch GG, Drew MGB, Bartolo I. Correlation of computed van der waals and molecular volumes with apparent molar volumes (AMV) for amino-acid, carbohydrate and sulfamate tastant molecules. Relationship between Corey-Pauling-Koltun volumes (Vcpk) and computed volumes. J Chem Soc-Perkin Trans 2. 1992. pp. 497–503. [Cross Ref]
  • Version 1.8.0. OpenEye Scientific Software, Inc.: Santa Fe, NM; 2010. ShapeTK-C++
  • Version 2.2. OpenEye Scientific Software, Inc.: Santa Fe, NM; 2006. ROCS - Rapid Overlay of Chemical Structures.
Articles from Journal of Cheminformatics are provided here courtesy of
BioMed Central