1. Distribution of shape descriptor components and their volume dependency
The molecular shape quadrupoles in the principal-axes frame [9
] are given as the following:
where, Qx, Qy, and Qz are the x, y, and z components of the quadrupole moment, respectively. The x, y, and z components are conceptually equivalent to the length, width, and height of a molecule, respectively, with the largest quadrupole component defined as Qx and the smallest as Qz, by convention. An assumption underlying this study is that there is a point whereby, if the shape quadrupole difference between two conformers is too large, they cannot meet the ST ≥ 0.8 threshold required by PubChem3D neighboring, as illustrated in Figure . This relationship, if it actually exists, would allow conformer pairs to be filtered out, avoiding the time-consuming shape superposition optimization step for those pairs and enhancing the throughput of the PubChem 3-D neighboring. To attempt to determine if a relationship can be found, the shape quadrupole differences for known 3-D "Similar Conformers" were analyzed.
Figure 1 Small changes in dimensions can result in large changes in overlap. Using a 2-D rectangle shape with constant area (0.4 in2), one can see that small changes in shape dimensions (length and width) can result in large changes in shape overlap (ST). Note (more ...)
At the time of quadrupole filter project initiation (in October, 2008), 3-D neighboring of 17,143,181 unique molecules, effectively covering the CID range 1-25,000,000, had been completed using a single conformer per compound, yielding 4,182,412,802 3-D neighbors. Table shows the statistics of the three quadrupole components for those 17.1 million molecules. The mean and standard deviation for Qx, Qy, and Qz were 15.01 ± 8.07 Å5, 3.81 ± 1.80 Å5, and 1.52 ± 0.65 Å5, respectively. Figure and display the distributions of Qx, Qy, and Qz, after they were binned into units of 2.5 Å5, 0.5 Å5, and 0.1 Å5, respectively. All three components showed strongly skewed distributions; however, most of the molecules were populated near the mean and relatively few molecules had quadrupole components much larger than the mean values.
Figure 2 Quadrupole distribution. The frequency of occurance of the three quadrupole moment components for 17.1 million molecules from the PubChem Compound database, where (a) Qx, (b) Qy, and (c) Qz were binned into units of 2.5 Å5, 0.5 Å5, and (more ...)
Figure 3 Quadrupole interdependence. The distribution of 17.1 million molecules from the PubChem Compound database as a function of (a) Qx and Qy, (b) Qx and Qz, and (c) Qy and Qz, respectively. Qx, Qy, and Qz were binned into units of 2.5 Å5, 0.5 Å (more ...)
The molecular volume and quadrupole moments are correlated with each other according to the following equation:
is the radius of gyration and Vmp
is the monopole volume, which corresponds to the monopole in the shape multipole expansion [13
]. Equation 3
implies that the size of a molecule (represented by the molecular volume) is not completely independent of its quadrupole moment. Therefore, at the beginning of this study, the correlation between molecular volume and quadrupole moment was investigated. Note that, because the molecular volume is not a measurable quantity with a clear, unanimous definition, there are many ways to estimate it [13
]. Therefore, in addition to the monopole volume, the PubChem 3-D information includes two other volumes computed in different ways. One is the analytic volume and the other is the self-overlap volume. The analytic volume is considered to be most consistent to other definitions of molecular volume among the three, but its computation is also the slowest. For this reason, evaluation of the ST score given in Equation 1
uses the self-overlap volume, whose evaluation is considerably faster than the analytic volume; however, it typically overestimates the molecular volume by a factor of three greater than the analytic volume, as shown in Table . Each compound conformer record in the PubChem provides all three volumes and they can be downloaded: individually from the Compound Summary pages, using a list from the PubChem Download Facility (http://pubchem.ncbi.nlm.nih.gov/pc_fetch
), or in bulk from the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem
). To avoid confusion about these three different volumes used in the present paper, we denote the monopole volume, self-overlap volume, and analytic volume as Vmp
, and Van
, respectively, whereas the volume in a general sense is denoted as V
(without any subscript).
Figure displays the distribution of the three different volumes of the 17.1 million molecules from the PubChem Compound database. In general, Vso is the largest, and Van is the smallest. As shown in Figure , the quadrupole moment increases with molecular size, implying that the effect of quadrupole difference between two molecules upon their shape similarity may depend on their relative molecular sizes. Therefore, the quadrupole differences of 3-D "Similar Conformer" neighbors as a function of volume need to be considered.
Figure 4 Volume distribution. The frequency of occurance of the three different volume types, analytic volume (Van, blue), monopole volume (Vmp, red), and self-overlap volume (Vso, green), for 17.1 million molecules from the PubChem Compound database, where all (more ...)
Figure 5 Volume-quadrupole interdependence. The distribution of 17.1 million molecules from the PubChem Compound database as a function of the molecular volume type and quadrupole component. Van [in panel (a)-(c)], Vmp [in panel (d)-(f)], and Vso [in panel (g)-(i)] (more ...)
2. Design of 3-D neighbor filters using quadrupole moment differences
As a general premise, if two molecules with the same volume also have identical values for the quadrupole components, they are likely to be shape similar to each other. In addition, as the quadrupole moment difference deviates from zero, the maximum shape similarity is expected to decrease (see Figure ). When the quadrupole (and volume) difference becomes greater than some value or threshold, the shape dissimilarity is such that the molecule conformer pair cannot possibly meet the criteria to be a PubChem 3-D neighbor (ST ≥ 0.8). Therefore, if we know these quadrupole difference thresholds for a given volume pair, one may be able to preclude conformer pairs that are not sufficiently shape similar, using only knowledge of the volume and quadrupole moments.
In the present study, the quadrupole moment differences of the 4.18 billion 3-D neighbors, identified from the 3-D neighboring of 17.1 million molecules, were analyzed to find the maximum possible quadrupole differences for two molecules to be neighbors (see also the "Materials and Methods
" section). The volume and quadrupole moments of the two molecules in each neighbor pair were first converted into an integer value by using the following two equations:
where superscript "bin" is used to distinguish these integers from the original, non-binned values. The denominator Binsize
was 5.0 Å3
for all the three volumes, and 2.5 Å5
, 0.5 Å5
, and 0.1 Å5
, for Qx, Qy
, and Qz
, respectively. After all 4.18 billion 3-D neighbors were binned according to their Vbin
values, the 3-D neighbor distribution for a given (
) pair was analyzed as a function of ΔQbin
To illustrate the general premise above that quadrupole deviations from zero result in a reduction is shape similarity, Figure shows the neighbor count as a function of
) = (100, 110), (100, 120), and (100, 130). As anticipated, maximum neighbor populations exist when
is near the origin, and rapidly decrease in count (nearly linear reduction on a log curve) as the
deviates from zero. In addition, for a given (
) pair, neighbors were observed only for a certain range of
, indicating that this range information can be used as a filter that pre-screens non-neighbor pairs. The asymmetric distribution of the 3-D neighbors in Figure with respect to the ordinate axis (
) suggests that two different filters would need to be generated: one for
and the other for
Figure 6 Quadrupole difference tolerance. The distributions of the 3-D neighbors as a function of the binned quadrupole differences, , of the two moleclues in each neighbor for (, ) = (100, 110), (100, 120), and (100, 130), respectively, illustrating how the frequency (more ...)
Figures , and show the ΔQbin threshold for each quadrupole component as a function of volume for the 4.18 billion 3-D neighbors. Note that, since PubChem regularly gets additional new unique content from its contributors, there is always a possibility that the 3-D neighboring of these new records may identify previously unseen cases of ΔQbin threshold. If we use these ΔQbin threshold maps [see panels (a) and (b) of Figures , and ] as a filter during neighboring, we would preclude those 3-D neighbors. Therefore, we modified the maps [see panels (c) and (d) of Figures , and ], as described in the "Materials and Methods" section, to extend ΔQbin difference values or to add neighboring bins where no population is found in an attempt to mitigate any such issues in the fringe regions on the maps.
These modified ΔQbin
threshold maps are designated as quadrupole filters. For simplicity, we name these filters with a capital letter "F
" followed by a subscript, which represents one of the quadrupole components, and a superscript, which represents the type of volume involved. For example, filter "
" indicates that the Qx
filter generated with the analytic volume, Van
Given that these quadrupole filters were built using an existing set of 3-D neighbor cases, one needs to validate the extent of their efficacy. To do so, a 13.2 billion molecule conformer pair test set not considered as a part of the original 3-D neighboring training set, is utilized (see the "Materials and Methods" section). After computing the ST scores for the 13.2 billion pairs, the fraction of 3-D neighbors and non-neighbors, which would have been pre-screened if the quadrupole filters were applied, is summarized in Table .
Accuracy of filtering as a function of volume type and quadrupole component at ST ≥ 0.80 threshold
Of the three volume types utilized, the monopole-based quadrupole filters, Fmp
, is arguably the best. Filter
removed 4.78 billion pairs (36.3%), while incurring a loss of only 30 out of 24 million "potential" neighbors. [Note that the definition of a PubChem 3-D neighbor involves feature
similarity as well as shape
similarity, while the quadrupole filters deal only with shape similarity. As such, the 30 pairs filtered out had a ST score sufficient to be a 3-D neighbor, making it a "potential" 3-D neighbor.] The false negative count of 30 removed by
is negligible, but does show that use of such a filter will result in precluding some potential 3-D neighbors in its use, in this case at a rate of 1 in 800,000.
are not as efficient as
, but could still filter out 3.92 billion pairs (29.8%), and 3.59 billion pairs (27.3%), respectively, when considered individually. If the three Fmp
filters are used in a series (denoted as
, and applied one after the other), 5.33 billion pairs (40.4%) could be removed with a loss of only 32 potential neighbors. Filter Fso
showed similar performance to Fmp
, but it filtered out more potential neighbors (288 for
versus 32 for
) and removed slightly fewer non-neighbors (39.1% for
versus 40.4% for
). The Fan
filters showed the least loss of potential neighbors (4 for
versus 32 for
), but also removed the least non-neighbors (29.0% for
versus 40.4% for
Effects of the ST threshold for PubChem 3-D neighboring upon the efficiency of the quadrupole filters were also investigated by generating a set of quadrupole filters, each using a different ST threshold, ranging from 0.80 to 0.99 with an increment of 0.01. As shown in Figure the fraction of molecule pairs filtered increases almost linearly as a function of the ST threshold. For the entire ST range tested, the
filters showed better efficiencies than the
Shape compatibility filtering efficiency. Performance of the Fxyz quadrupole filter to filter conformer pairs at different ST threshold values.
3. Application of 3-D neighbor filters using quadrupole moment differences
Given that filtering conformer pairs using steric shape quadrupoles is effective with minimal loss of potential 3-D neighbors, a "real world" test is made with
to see how use of these filters in the context of PubChem 3-D neighboring improves throughput. To achieve this, comparison is made to earlier benchmarks [7
] whereby a set of known drugs and other molecules of keen biomedical interest are neighbored against the 3-D contents of PubChem. Table and Table summarize the results of these tests.
Table 4 Acceleration of PubChem 3-D neighboring using the quadrupole filter
Table 5 Efficiency of the quadrupole filter
Considering PubChem 3-D neighboring is a precomputed similarity search, one can see that the neighboring throughput improvements using
are substantial, with an average improvement of 31% across the range of conformer counts per compound. Perhaps surprising is that the
filtering removes only 7% of the conformer pairs, yet achieves a 31% neighboring throughput improvement. This emphasizes the dramatic cost/benefit difference between the computation necessary to achieve the 7% reduction versus what is expended in its absence.
It is important to note that
is not the first filter applied in Table , meaning that there are three other filters utilized before
. The filter ordering is such so as to maximize the cost/benefit of each filter. To examine what happens if
is used as the first filter, neighboring is repeated for the case of one diverse conformer per compound. When used first,
removes 48.3% of all conformer pairs (44.8%, 2.1%, and 1.4% conformer pairs for
, applied in that order, respectively) versus the 7.4% as shown in Table . The CT Feature, CT Volume, and ST Volume filters, applied in that order, remove 27.9%, 0.1%, and 0.002% conformer pairs, respectively, when
is applied first.