Subset extraction
At the time of project initiation, the PubChem Compound [
16] database contained approximately 5.3 million unique chemicals and mixtures. We focused our attention on a subset of PubChem by targeting only single-component molecules with size and flexibility below lead-like [
17] or drug-like [
18] filtering cut-offs. Our strategy was to work with a simple but relevant subset that could be incrementally updated with more challenging compounds in future studies.
The distributions of non-hydrogen (heavy) atoms and rotatable bonds for PubChem single component structures are presented in Figure . For this study, we limited our work to the first half of each distribution,
i.e., just those small molecules (less than twenty-eight non-hydrogen atoms) with low flexibility (less than six rotatable bonds). Furthermore, we removed all compounds with incomplete stereochemistry (stereo atoms or bonds), to avoid enumerating multiple stereo-configurations. We also removed ionic forms of structures, since their neutralized forms will be contained in the PubChem compound dataset. The PubChem compound subset selected is summarized in Figure . Despite restrictions, the final dataset resulted in approximately one million (1 035 040) unique PubChem compounds representing about 19% of PubChem at the time of dataset extraction. Most of the structures are drug-like organic compounds and, therefore, are well suited for the MMFF94s force field [
19] implemented in the 3D conformer generator OMEGA 1.8.1 and 2.0 Beta [
20] used for this study.
Algorithms
The alignment-recycling (AR) methodology is intended to obviate performing the optimization required to maximize the volume overlap of the query conformer to each and every conformer in a 3D conformer dataset. This is achieved by selecting representative conformers to completely cover the "shape space" of the 3D conformer dataset. The granularity of coverage is defined by an empirical cutoff named "Design-Tanimoto" (see section Reference shape selection). Each conformer in the dataset is overlaid to each representative conformer and the overlay information is retained, if the similarity with a representative conformer is of sufficient magnitude.
The empirical criterion to decide if two overlaid conformers can be considered similar is named "Transform-Tanimoto" (see section Alignment recycling). Its value greatly influences the number of reference shapes associated with each conformer. By means of analogy to a binary fingerprint, the Transform-Tanimoto threshold defines when a bit is set.
To search the dataset by shape similarity, the query fingerprint, to extend the analogy, is compared to the dataset fingerprints to find common reference shapes. The Tanimoto value computed between query and database fingerprints with
AR is not that from
Eq. 1, as is typical with 2D fingerprint methods and used by the 3D fingerprint method of Haigh
et al. [
15]. Instead, finding a common reference shape triggers computing, via
Eq. 2, the shape Tanimoto between the query conformer and database conformer, as may be performed by a typical brute-force ROCS approach. In our method, the 3D conformer overlay used in computing the shape Tanimoto is generated by simply reusing the transformation,
i.e., the rotation matrix and translation vector, from the overlay to the common reference shape. This trivial transformation, while specific to alignment of a reference conformer, when applied, can yield a relatively accurate shape overlay between the query and database conformers without the need to perform the conformer overlay alignment optimization. Usually, when the query and a database conformer are fairly similar, multiple reference shapes are found to be in common. In such cases, all reference shape alignments are reused to find a maximum shape Tanimoto between conformers.
Reference shape selection
As described in the
Method section, we implemented the clustering algorithm of Haigh
et al. [
15] to select a diverse set of reference shapes. For this study, we chose a Design-Tanimoto value of 0.75, which, according to their work, represented the best trade-off between sampling speed and granularity. This means, by definition, no pair of reference shapes has a similarity above 0.75, after diversity selection, and that every conformer in the entire dataset is associated with at least one reference shape with a shape Tanimoto similarity above 0.75.
Diverse reference shape selection for the one million compound dataset was performed in two stages. In the first stage, only a single conformer representative generated by OMEGA 1.8.1 [
20] was used. The single conformer dataset was entirely covered after the inclusion of 2 458 reference shapes. In the second stage, we sampled the conformational space of each compound using OMEGA 2.0 Beta[
20] at an RMSD of 1.0 Å. This generated approximately fifteen million (14 925 817) conformers. The distribution of conformers per compound is strongly skewed towards low values, with 50% of the compounds having six or fewer conformers and only 10% of the compounds accounting for 49% of the total conformer count. Interestingly, 99.8% of the fifteen million conformers in the second stage can be clustered at a Design-Tanimoto of 0.75 using one of the initial 2 458 reference shapes of the single conformer subset, revealing a large amount of shape redundancy in the multi-conformer models. However, the shape space of the remaining 0.2% conformers increases the number of diverse reference shapes from 2 458 to 5 534. This potentially surprising result may be a consequence of the sphere-exclusion algorithm variant used for the reference shape selection. In the attempt to cover the entire dataset shape space with a minimum number of reference shapes, the algorithm tends to leave 'holes' in the shape space, thus producing unequally sampled regions. Given the substantial redundancy of conformer shapes in the multi-conformer model dataset, it is very likely that a large fraction of the additional 3 076 reference shapes is necessary to fill these holes. There is no direct indication that the additional reference shapes resulted from anything more than sampling deficiency,
i.e., were not directly attributable to the size or flexibility of molecules. Use of more efficient sampling algorithms designed to avoid empty spaces,
e.g., DISE [
21], may lead to more efficient shape space coverage than that used in this study.
Because we aim at selecting a diverse set of shapes, the reference conformers appear to represent particular structural features to a greater extent than are present in the entire dataset. For example, only 20% of the PubChem dataset contain chiral centers; however, 33% of the reference shapes contain a chiral center. Similarly, a (non-exhaustive) trend is found between the dataset and reference conformers for triple bonds (8% versus 37%), lack of aromatic atoms (6% versus 21%), and presence of a ring system with more than six atoms (3% versus 28%). As a consequence, reference shapes generated from structures with less common features tend to cluster fewer database conformers than those coming from compounds with more common features.
Alignment-recycling (AR)
AR takes advantage of information created upon comparison of the reference shapes to a conformer during shape fingerprint generation. When a conformer is overlaid on a reference shape, and the computed shape Tanimoto is above the Transform-Tanimoto, the data required to reproduce that alignment are saved (Figure ). Such information has the form of a three-by-three rotation matrix and a translation vector. In contrast to ROCS, AR can only occur when the query conformer structure is found to have a reference shape in common with a database conformer. The alignment between the query conformer and database conformer is determined using the retained rotational matrices and translational vectors relative to that reference shape.
The procedure to align the query conformer, Q, and database conformer, D, is the following (as depicted in Figure ). The three-by-three rotation matrix and the translation vector to overlay the database conformer D on the reference shape R are merged into a single four-by-four affine transformation matrix (MRD). Similarly, one can construct the four-by-four affine transformation matrix MQR by using the transpose of the three-by-three rotational matrix and the minus of the translational vector from the overlay of the query conformer Q on the reference shape R. The matrix MQD is produced by the matrix multiply of MQR with MRD. Conformer D is aligned on conformer Q by multiplying the coordinate vector of each atom of D with MQD. In some aspects, the method is conceptually similar to structural alignments performed in 3D-QSAR methodologies for which all the conformers of the dataset are aligned on the same reference template. In our case, the reference template is a reference shape pre-selected during the initial diverse selection.
Each time an alignment is attempted after transformation matrices combination, the quality of the alignment is evaluated by a single point shape Tanimoto estimation via a Gaussian Grid approximation similar to ROCS, as detailed in the Methods section. The final number of matrix multiplications and alignments depends on the Transform-Tanimoto value as well as the number of reference shapes in the vicinity of the query conformer.
In practice, a single combination of transformation matrices cannot guarantee a result close to an optimal structural alignment. Some conformers may have different optimal alignments with a reference shape due to structural symmetry; however, the presence of multiple reference shapes greatly increases the chance of finding an alignment very close to the analytical maximum overlap solution. A convenient property of the method is that similar structures tend to have more reference shapes in common than dissimilar ones, thus far more CPU time is dedicated to the alignment of similar structures than for dissimilar structures.
Finding the right Transform-Tanimoto
Similarity searches often require a threshold as a simple criterion to prune the hit list. The threshold value is somewhat subjective although a reasonable range of useful values can be deduced from the literature involving ROCS. Rush
et al. [
11] mention a general rule-of-thumb that a shape Tanimoto value greater than 0.75 provides visual shape similarity, although they used a 0.85 threshold to select their ZipA-FtsZ protein-protein inhibitors. According to a regression plot from Bostrom
et al. [
22] and our own in-house experience, a RMSD cut-off of 1.0 Å used during conformational sampling with OMEGA 2.0 Beta roughly corresponds to a shape Tanimoto between 0.75 and 0.85. In their virtual screening study, Muchmore
et al. [
13] found a melanin-concentrating hormone receptor 1 antagonist with nanomolar IC50 at a shape Tanimoto above 0.80. Taking these studies into account, our range of interest in finding similar shapes is limited to ROCS shape Tanimoto between 0.75 and 1.0, alignments with lower similarity values were not considered for this work.
The suitable Transform-Tanimoto value, which determines if two structures have a reference shape in common and enables alignment via matrix multiplication, was determined empirically. For that, we performed a set of random overlays using both ROCS and alignment-recycling. Our objective was to keep the Transform-Tanimoto value as high as possible to limit the possible number of matrix combinations and, in doing so, save substantially on CPU time. We started by setting the Transform-Tanimoto value to the Design-Tanimoto value, i.e., 0.75. When applying the AR technique, alignment cases where two conformers do not share a common reference shape are assigned a shape Tanimoto value of zero. Because the initial Transform-Tanimoto threshold was not providing the quantity of hits to be consistent with the brute-force approach, primarily due to not finding appropriate reference shapes in common, we progressively decreased the Transform-Tanimoto value by 0.01.
The relation between ROCS and alignment-recycling at several Transform-Tanimoto values is plotted on Figure . The plots are based on ~1.3 million (1 283 211) alignments with a ROCS shape Tanimoto in the range 0.75–1.0. This subset is part of a training set of thirty million random shape-overlays, generated by comparing 2 000 random conformers against 15 000 random conformers from the fifteen million PubChem conformer dataset. The particular nature of the distribution is unveiled by binning the data every 0.01 shape Tanimoto and plotting the isocontour lines at commonly used thresholds for proportion estimation. The scale on the side of each plot gives an indication of the proportion of the data points between each isocontour. All the data points are contained between the minimum and the maximum of each bin, the other isocontour lines (i.e. first and last percentile, decile, and quartile) highlight the intrinsic distribution of the data among each bin. The plots indicate that there is no hard shape Tanimoto limit between finding and not finding a common reference shape between ROCS and AR, but rather some probabilistic distribution. For example, at 0.75 Transform-Tanimoto, 25% of the alignments with a 0.75 ROCS shape Tanimoto do not share an associated AR reference shape. This proportion decreases to less than 10% at a Transform-Tanimoto of 0.74, and less than 1% at 0.73. By means of comparison, an AR reference shape is always found in common for the Transform-Tanimoto values 0.75, 0.74, and 0.73 at ROCS shape Tanimoto values of 0.89, 0.86, and 0.82, respectively. Although we could have further decreased the Transform-Tanimoto to even lower values, in order to further decrease or eliminate the chance of not finding an AR reference shape in common, we felt that a Transform-Tanimoto equal to 0.73 produced satisfying results.
A more detailed comparison on the relative performance of AR at 0.73 Transform-Tanimoto is plotted on Figure . A negative difference shows that the ROCS overlay is better than the recycled overlay. In contrast to ROCS, AR does not aim at finding the exact global alignment solution. Instead, AR attempts to provide an alignment that is very close with little difference in terms of the shape Tanimoto and graphical display. Consequently, AR overlays are 0.01 shape Tanimoto less than the ROCS overlays 25% of the time (Figure : point A); however, this difference is visually minor as shown in Figure . A decrease of 0.03 shape Tanimoto (Figure : point B), as shown in Figure , brings some visual separation to the AR and ROCS alignments. At a difference of 0.05 shape Tanimoto (Figure : point C), there is a clear visual difference between the overlays, although they are still qualitatively the same (Figure ). The degree of alignment quality that may be required by a user depends strongly on the intended use of the alignment, and AR alignments could certainly be used as a very good starting point for subsequent shape overlap optimization, e.g., using ROCS.
The distribution in Figure also indicates that about 1% of the time alignment-recycling performs relevantly better (shape Tanimoto difference > 0.01) than ROCS. One possible explanation for this observation is that ROCS gets locked into a local minimum during overlap optimization. A more likely explanation is differences in the numerical precision of the ROCS Grid method versus ours (see section Gaussian shape overlay). Overall, the chance of getting a poor AR alignment, as compared to one produced by ROCS, is relatively rare, when using a 0.73 Transform-Tanimoto value and considering the full 0.75–1.0 shape Tanimoto range.
Comparing speed and hit lists
The test set used here for speed and hit list comparison contained 65 compounds extracted from a dataset of leads and drugs from Oprea
et al. [
17]. Each test set compound was represented by a single random low-energy conformer. Together, the 65 conformers span a diverse range of shapes derived from simple structures,
e.g., salicylic acid, to fairly complex ones,
e.g., morphine. The CPU time required to query the fifteen million conformer dataset using the various methods is shown in Table . To compute ROCS shape overlays for the entire conformer dataset takes, on average, 2.3 hours, while the time required to perform
AR screening at 0.73 Transform-Tanimoto is about 1.3 minutes. This represents more than a 100-fold speedup.
| Table 1CPU time to query the fifteen million conformer database with a single conformer |
This increase in throughput is not surprising. For each conformer, we are only ever optimizing the overlay to the query for the 5 534 reference shapes. Also, the conformer database reference shape fingerprints are quite sparse, having only 1, 40, or 141 reference shapes set at minimum, average, or maximum, respectively. In contrast, ROCS requires optimizing the overlay of the query conformer to all fifteen million database conformers. As a means of comparison, CPU times required to search the dataset at Transform-Tanimoto values equal to 0.74 and 0.75 are also shown in Table . These timings indicate a two- and four-fold decrease, respectively, directly related to a substantial decline in the number of reference shapes considered during screening. This also suggests that each additional 0.01 decrease in the Transform-Tanimoto will increase the AR method CPU requirement by a factor of two.
The overlap of AR using a 0.73 Transform-Tanimoto value (AR-0.73) and ROCS hit lists were examined to see if the AR-0.73 method produces results similar to ROCS using the shape Tanimoto similarity thresholds 0.75, 0.80, and 0.85. Figure compares the count of compound hits using both methods. As shown in Table , AR-0.73 consistently produced ~20% fewer hits than ROCS on average, when using identical shape Tanimoto search thresholds. According to Figure , the AR-0.73 shape Tanimoto is, on average, 0.01 less than that resulting from an optimized alignment using ROCS. This suggests that a fairly small decrease in the AR-0.73 screening shape Tanimoto threshold, relative to that of ROCS, should bring the hit count, with similar alignment quality, into sync. Figure shows how the relative count of hits grows as the AR-0.73 screening threshold is decreased, relative to ROCS. Figure also shows that a similar count of query hits may be obtained at AR-0.73 shape Tanimoto values equal to 0.740, 0.792 and 0.844 for ROCS shape Tanimoto equal to 0.75, 0.80 and 0.85, respectively. Comparable hit counts, however, do not imply commonality of hit lists.
| Table 2Average compound hit list size resulting from querying the fifteen million conformer database with a single conformer |
Figure shows the ability of AR-0.73 to reproduce a growing percentage of the ROCS compound hit list as the AR-0.73 screening threshold is decreased, while keeping the ROCS screening threshold constant. As Figure shows, however, that simply decreasing the AR-0.73 screening threshold only improves the union of the two compound hit lists to a point, after which diminishing returns sets in and the hit list overlap becomes worse. This result is expected considering decreasing the AR-0.73 screening threshold results in both ROCS hits missed by AR-0.73 and AR-0.73 hits that would be found by ROCS, if the ROCS screening threshold was not kept constant.
In each case, the AR-0.73 shape Tanimoto values 0.740, 0.792 and 0.844, originally highlighted as providing a similar number of compound hits, appear to also provide the best trade-off in maximizing the reproducibility of the ROCS hit lists for shape Tanimoto cut-off values equal to 0.75, 0.80 and 0.85. Using these shape Tanimoto values when comparing the two methods compound hit lists, one can expect the average similarity between the AR-0.73 and ROCS hit lists to be in the range of 81–87% (Table ). The remaining 13–19% hit compounds that were not part of both hit lists belonged to one of the following four categories: hits found by AR-0.73, due to the use of a lower shape Tanimoto threshold, that would have been found by ROCS if the thresholds had been equal (5.7–6.1%); hits found by AR-0.73 but just missed by ROCS, due to finding a suboptimal solution during the maximization of the volume overlap or variation in the grid numeric precision (0.6–2.3%); hits missed by AR-0.73, but found by ROCS, due to the inability to find a reference shape in common (0.0–0.2%); and hits missed by AR-0.73, but found by ROCS, due to suboptimal volume overlap using alignment-recycling only, i.e., without overlap maximization optimization (6.7–10.1%). Regarding this last category, the AR-0.73 missed hits were only narrowly missed, with the missed hits having average shape Tanimoto values of 0.731, 0.787, and 0.840, just 0.009, 0.005, and 0.004 below the AR-0.73 similarity thresholds of 0.740, 0.792, and 0.844, respectively. This shows that even though the hit list intersection appears to decrease slightly with increasing shape Tanimoto value, the missed hits are increasingly proximate to the shape Tanimoto threshold.
| Table 3Comparison between ROCS and AR-0.73 hit lists when using reduced similarity thresholds for AR-0.73 |
The observed correction for maximum overlap of AR-0.73 and ROCS hit lists as a function of shape Tanimoto appears to be linear. If this relationship holds across the entire range of ROCS shape Tanimoto values of 0.75 to 1.0, one could employ Eq. 3 to select the appropriate AR-0.73 shape Tanimoto cut-off to use for a desired ROCS shape Tanimoto value to achieve maximum overlap of results.
where STAR-0.73 is the suggested optimum AR-0.73 shape Tanimoto value to use for a corresponding shape Tanimoto value, STROCS, in the range of 0.75 to 1.0.