1. Distribution of shape descriptor components and their volume dependency
The molecular shape quadrupoles in the principal-axes frame [
9,
13] are given as the following:
where, Qx, Qy, and Qz are the x, y, and z components of the quadrupole moment, respectively. The x, y, and z components are conceptually equivalent to the length, width, and height of a molecule, respectively, with the largest quadrupole component defined as Qx and the smallest as Qz, by convention. An assumption underlying this study is that there is a point whereby, if the shape quadrupole difference between two conformers is too large, they cannot meet the ST ≥ 0.8 threshold required by PubChem3D neighboring, as illustrated in Figure . This relationship, if it actually exists, would allow conformer pairs to be filtered out, avoiding the time-consuming shape superposition optimization step for those pairs and enhancing the throughput of the PubChem 3-D neighboring. To attempt to determine if a relationship can be found, the shape quadrupole differences for known 3-D "Similar Conformers" were analyzed.
At the time of quadrupole filter project initiation (in October, 2008), 3-D neighboring of 17,143,181 unique molecules, effectively covering the CID range 1-25,000,000, had been completed using a single conformer per compound, yielding 4,182,412,802 3-D neighbors. Table shows the statistics of the three quadrupole components for those 17.1 million molecules. The mean and standard deviation for Qx, Qy, and Qz were 15.01 ± 8.07 Å5, 3.81 ± 1.80 Å5, and 1.52 ± 0.65 Å5, respectively. Figure and display the distributions of Qx, Qy, and Qz, after they were binned into units of 2.5 Å5, 0.5 Å5, and 0.1 Å5, respectively. All three components showed strongly skewed distributions; however, most of the molecules were populated near the mean and relatively few molecules had quadrupole components much larger than the mean values.
The molecular volume and quadrupole moments are correlated with each other according to the following equation:
where
Rg is the radius of gyration and
Vmp is the monopole volume, which corresponds to the monopole in the shape multipole expansion [
13].
Equation 3 implies that the size of a molecule (represented by the molecular volume) is not completely independent of its quadrupole moment. Therefore, at the beginning of this study, the correlation between molecular volume and quadrupole moment was investigated. Note that, because the molecular volume is not a measurable quantity with a clear, unanimous definition, there are many ways to estimate it [
13-
18]. Therefore, in addition to the monopole volume, the PubChem 3-D information includes two other volumes computed in different ways. One is the analytic volume and the other is the self-overlap volume. The analytic volume is considered to be most consistent to other definitions of molecular volume among the three, but its computation is also the slowest. For this reason, evaluation of the ST score given in
Equation 1 uses the self-overlap volume, whose evaluation is considerably faster than the analytic volume; however, it typically overestimates the molecular volume by a factor of three greater than the analytic volume, as shown in Table . Each compound conformer record in the PubChem provides all three volumes and they can be downloaded: individually from the Compound Summary pages, using a list from the PubChem Download Facility (
http://pubchem.ncbi.nlm.nih.gov/pc_fetch), or in bulk from the PubChem FTP site (
ftp://ftp.ncbi.nlm.nih.gov/pubchem). To avoid confusion about these three different volumes used in the present paper, we denote the monopole volume, self-overlap volume, and analytic volume as
Vmp,
Vso, and
Van, respectively, whereas the volume in a general sense is denoted as
V (without any subscript).
Figure displays the distribution of the three different volumes of the 17.1 million molecules from the PubChem Compound database. In general, Vso is the largest, and Van is the smallest. As shown in Figure , the quadrupole moment increases with molecular size, implying that the effect of quadrupole difference between two molecules upon their shape similarity may depend on their relative molecular sizes. Therefore, the quadrupole differences of 3-D "Similar Conformer" neighbors as a function of volume need to be considered.
2. Design of 3-D neighbor filters using quadrupole moment differences
As a general premise, if two molecules with the same volume also have identical values for the quadrupole components, they are likely to be shape similar to each other. In addition, as the quadrupole moment difference deviates from zero, the maximum shape similarity is expected to decrease (see Figure ). When the quadrupole (and volume) difference becomes greater than some value or threshold, the shape dissimilarity is such that the molecule conformer pair cannot possibly meet the criteria to be a PubChem 3-D neighbor (ST ≥ 0.8). Therefore, if we know these quadrupole difference thresholds for a given volume pair, one may be able to preclude conformer pairs that are not sufficiently shape similar, using only knowledge of the volume and quadrupole moments.
In the present study, the quadrupole moment differences of the 4.18 billion 3-D neighbors, identified from the 3-D neighboring of 17.1 million molecules, were analyzed to find the maximum possible quadrupole differences for two molecules to be neighbors (see also the "
Materials and Methods" section). The volume and quadrupole moments of the two molecules in each neighbor pair were first converted into an integer value by using the following two equations:
where superscript "bin" is used to distinguish these integers from the original, non-binned values. The denominator
Binsize was 5.0 Å
3 for all the three volumes, and 2.5 Å
5, 0.5 Å
5, and 0.1 Å
5, for
Qx, Qy, and
Qz, respectively. After all 4.18 billion 3-D neighbors were binned according to their
Vbin and
Qbin values, the 3-D neighbor distribution for a given (

,

) pair was analyzed as a function of Δ
Qbin.
To illustrate the general premise above that quadrupole deviations from zero result in a reduction is shape similarity, Figure shows the neighbor count as a function of

for (

,

) = (100, 110), (100, 120), and (100, 130). As anticipated, maximum neighbor populations exist when

is near the origin, and rapidly decrease in count (nearly linear reduction on a log curve) as the

deviates from zero. In addition, for a given (


) pair, neighbors were observed only for a certain range of

, indicating that this range information can be used as a filter that pre-screens non-neighbor pairs. The asymmetric distribution of the 3-D neighbors in Figure with respect to the ordinate axis (

) suggests that two different filters would need to be generated: one for

and the other for

.
Figures , and show the ΔQbin threshold for each quadrupole component as a function of volume for the 4.18 billion 3-D neighbors. Note that, since PubChem regularly gets additional new unique content from its contributors, there is always a possibility that the 3-D neighboring of these new records may identify previously unseen cases of ΔQbin threshold. If we use these ΔQbin threshold maps [see panels (a) and (b) of Figures , and ] as a filter during neighboring, we would preclude those 3-D neighbors. Therefore, we modified the maps [see panels (c) and (d) of Figures , and ], as described in the "Materials and Methods" section, to extend ΔQbin difference values or to add neighboring bins where no population is found in an attempt to mitigate any such issues in the fringe regions on the maps.
These modified Δ
Qbin threshold maps are designated as quadrupole filters. For simplicity, we name these filters with a capital letter "
F" followed by a subscript, which represents one of the quadrupole components, and a superscript, which represents the type of volume involved. For example, filter "

" indicates that the
Qx filter generated with the analytic volume,
Van.
Given that these quadrupole filters were built using an existing set of 3-D neighbor cases, one needs to validate the extent of their efficacy. To do so, a 13.2 billion molecule conformer pair test set not considered as a part of the original 3-D neighboring training set, is utilized (see the "Materials and Methods" section). After computing the ST scores for the 13.2 billion pairs, the fraction of 3-D neighbors and non-neighbors, which would have been pre-screened if the quadrupole filters were applied, is summarized in Table .
| Table 3Accuracy of filtering as a function of volume type and quadrupole component at ST ≥ 0.80 threshold |
Of the three volume types utilized, the monopole-based quadrupole filters,
Fmp, is arguably the best. Filter

removed 4.78 billion pairs (36.3%), while incurring a loss of only 30 out of 24 million "potential" neighbors. [Note that the definition of a PubChem 3-D neighbor involves
feature similarity as well as
shape similarity, while the quadrupole filters deal only with shape similarity. As such, the 30 pairs filtered out had a ST score sufficient to be a 3-D neighbor, making it a "potential" 3-D neighbor.] The false negative count of 30 removed by

is negligible, but does show that use of such a filter will result in precluding some potential 3-D neighbors in its use, in this case at a rate of 1 in 800,000.
Filters

and

are not as efficient as

, but could still filter out 3.92 billion pairs (29.8%), and 3.59 billion pairs (27.3%), respectively, when considered individually. If the three
Fmp filters are used in a series (denoted as

, and applied one after the other), 5.33 billion pairs (40.4%) could be removed with a loss of only 32 potential neighbors. Filter
Fso showed similar performance to
Fmp, but it filtered out more potential neighbors (288 for

versus 32 for

) and removed slightly fewer non-neighbors (39.1% for

versus 40.4% for

). The
Fan filters showed the least loss of potential neighbors (4 for

versus 32 for

), but also removed the least non-neighbors (29.0% for

versus 40.4% for

).
Effects of the ST threshold for PubChem 3-D neighboring upon the efficiency of the quadrupole filters were also investigated by generating a set of quadrupole filters, each using a different ST threshold, ranging from 0.80 to 0.99 with an increment of 0.01. As shown in Figure the fraction of molecule pairs filtered increases almost linearly as a function of the ST threshold. For the entire ST range tested, the

and

filters showed better efficiencies than the

filter.
3. Application of 3-D neighbor filters using quadrupole moment differences
Given that filtering conformer pairs using steric shape quadrupoles is effective with minimal loss of potential 3-D neighbors, a "real world" test is made with

to see how use of these filters in the context of PubChem 3-D neighboring improves throughput. To achieve this, comparison is made to earlier benchmarks [
7] whereby a set of known drugs and other molecules of keen biomedical interest are neighbored against the 3-D contents of PubChem. Table and Table summarize the results of these tests.
| Table 4Acceleration of PubChem 3-D neighboring using the quadrupole filter |
| Table 5Efficiency of the quadrupole filter |
Considering PubChem 3-D neighboring is a precomputed similarity search, one can see that the neighboring throughput improvements using

are substantial, with an average improvement of 31% across the range of conformer counts per compound. Perhaps surprising is that the

filtering removes only 7% of the conformer pairs, yet achieves a 31% neighboring throughput improvement. This emphasizes the dramatic cost/benefit difference between the computation necessary to achieve the 7% reduction versus what is expended in its absence.
It is important to note that

is not the first filter applied in Table , meaning that there are three other filters utilized before

. The filter ordering is such so as to maximize the cost/benefit of each filter. To examine what happens if

is used as the first filter, neighboring is repeated for the case of one diverse conformer per compound. When used first,

removes 48.3% of all conformer pairs (44.8%, 2.1%, and 1.4% conformer pairs for

,

, and

, applied in that order, respectively) versus the 7.4% as shown in Table . The CT Feature, CT Volume, and ST Volume filters, applied in that order, remove 27.9%, 0.1%, and 0.002% conformer pairs, respectively, when

is applied first.