1. Description of "Similar Conformers" neighboring relationship
PubChem uses two 3-D similarity measures to determine whether two molecules are "Similar Conformers". One of these is the shape Tanimoto (ST) for shape similarity [10
], given by Eq. (2). The second similarity measure, defined by Eq. (3), is the color Tanimoto (CT) [10
], which quantifies the 3-D shape similarity of fictitious "color" atoms, each representing the 3-D location of a particular pharmacophore feature functional group type: hydrogen-bond donor, hydrogen-bond acceptor, cation, anion, hydrophobe, or ring. The ST and CT values range between 0 (for no similarity) and 1 (for identical).
are the respective self-overlap volume and VAB
is the overlap volume of conformers A and B.
where, for each of the six independent fictitious feature atom types, VAA and VBB are the respective self-overlap volumes and VAB is the overlap volume of conformers A and B.
Pair-wise shape and feature comparison of conformers takes two basic steps: (1) optimization of the shape superposition between two 3-D chemical structures, to find their maximum shape overlap in terms of ST, and (2) a single-point CT computation at that maximum shape overlay. PubChem 3-D "Similar Conformers" neighbors are identified as any pair-wise conformer superposition with ST and CT values of ≥0.8 and ≥0.5 (actually ≥0.795 and ≥0.495, after floating point number rounding is considered), respectively.
An important issue with 3-D neighboring is the number of conformers considered. Although PubChem generates a conformer ensemble for each molecule, consisting of up to 500 sampled conformations, it is not practical to consider all of these for 3-D neighboring. Therefore, a selection of diverse conformers for each compound is considered for the purposes of 3-D neighboring. A detailed description of how the diverse conformer set is derived can be found in the Materials and Methods section (See "Diverse conformer concept").
It is important to note that 3-D neighboring using a single conformer per compound has a one-to-one correspondence between compound pairs and conformer pairs. When using multiple conformers per compound, it is possible that only a subset of possible conformer pairs per compound pair may satisfy the 3-D neighboring criteria. For clarification, a 3-D conformer neighbor pair is defined as any conformer pair with ST ≥ 0.8 and CT ≥ 0.5. If there is at least one conformer neighbor pair among all possible conformer pairs from a given compound pair, a compound neighbor pair results. In this work, a 3-D neighbor implies a 3-D compound neighbor. If further clarification is necessary, the terms 3-D compound neighbors and 3-D conformer neighbors are used.
2. The distribution of 3-D neighbors
At the time of writing, 26,153,061 PubChem Compound records (CIDs) have a "Similar Conformers" neighboring relationship using the first two diverse conformers per compound. These identified 6.62 billion unique compound neighbor pairs and 8.16 billion unique conformer neighbor pairs. The average compound neighbor count per compound, after exclusion of self-neighbor pairs, is 253. Figure shows the frequency of neighbor count per compound, cumulative % CID count, and cumulative % 3-D neighbor count. Although some CIDs have more than 30,000 neighbors, 21.9 million CIDs (87.5%) have less than 1,000 neighbors, and 1.12 million CIDs (4.27%) do not even have a neighbor beyond self. This rather skewed population of the neighbor count per CID is reflected in the plot of % cumulative neighbor count versus % cumulative CID count (Figure ). One can see that 20% of the chemical structures have more than 80% of the "Similar Conformer" neighbor pairs.
Count of "Similar Conformers" per compound. The frequency of unique 3-D compound neighbors counts per PubChem Compound record (CID) [blue diamond], percent cumulative CID count [red square], and percent cumulative 3-D neighbor count [green triangle].
Most compounds have few 3-D neighbors. More than 80% of all CIDs have only 20% of 3-D neighbors.
The chemical structures on the extreme end, with more than 30,000 neighbors each, have a common motif of two substituted aromatic ring systems separated by different linkers. Figure depicts a single-linkage clustering of all 324 chemical structures with more than 30,000 3-D neighbors performed with the PubChem Structure Clustering tool using the PubChem 2-D dictionary-based binary fingerprint and Eq. (1) to help highlight the different chemical series represented. The most prevalent of these are based on N-phenylbenzamide (CID 7168). Neighboring reflects the contents of PubChem. If there is a large subpopulation of chemical structures very similar to each other, those chemical structures will interrelate; however, one advantage of 3-D "Similar Conformers" neighboring is that it relates chemical structures that have similar shape and features, which can be somewhat orthogonal to a chemical series identified by 2-D "Similar Compound" neighboring (to be discussed in more detail in the next section).
Figure 3 Compounds with the most 3-D neighbors. The PubChem Structure Clustering analysis of the 324 PubChem Compound records with more than 30,000 neighbors shows a common structural motif of two (aromatic) rings separated by a linker. N-phenylbenzamide (CID (more ...)
Of the 1.12 million CIDs without a neighbor pair, except for self, these include a large and significant percentage of the total cases where the count of atoms or features is high, as depicted in Figure . The lack of 3-D neighbor means that these larger compounds lack a 3-D complement, which is not surprising given that shape space grows exponentially and PubChem3D limits consideration to chemical structures with fifty or fewer non-hydrogen atoms, making it increasingly less likely that a suitable neighbor can be found as a function of volume. Otherwise, the profile of chemical structures without neighbors is much like that for a set of 26,157,365 CIDs that represent the entire "live" PubChem3D contents as of October 2010 (designated as the Search set), representing a small minority of chemical structures with unique shape and feature profiles. For the first and second diverse conformers per compound, respectively, there are 1.31 million and 4.77 million cases where only the self neighbor is found. Employing a second diverse conformer allows 0.19 million additional CIDs to have a compound neighbor beyond self. The big increase in self-only neighbor pairs for the second diverse conformer, which represents the conformer most dissimilar to the first in a conformer ensemble, is notable; however, it is too early to say definitively whether these counts of no-neighbor per conformer will remain high, as more diverse conformers per compound are considered.
Figure 4 Molecules without neighbors. Non-hydrogen atom count and feature atom count profiles for the 1.12 million CIDs without a neighbor pair (other than the self-neighbor pair) compared to those for all 26.1 million neighbored CIDs (Search set), showing "no (more ...)
3. Comparison of 2-D and 3-D similarity neighbors
For a given molecule, PubChem provides a "Similar Compounds" 2-D neighboring relationship, computed using a 2-D binary fingerprint and a threshold of 0.9 Tanimoto similarity using Eq. (1). It is interesting to see how one can find related biological annotation information using the 3-D "Similar Conformers" neighboring relationship as opposed to the 2-D "Similar Compounds". To demonstrate this, three well known molecules of biomedical interest are selected: caffeine (CID 2519), aspirin (CID 2244), and morphine (CID 5288826). The overlap of three primary types of annotation is examined. The metrics used are unique and common count of neighbors with links to: Medical Subject Heading (MeSH) [18
], through which one can locate scientific literature about a similar chemical structure in PubMed [19
]; PubChem BioAssay database [3
], where one can find biological and experimental data, including protein binding inhibition values; and protein 3-D structures [20
], representing 3-D structures of a discrete protein with a bound ligand, determined by X-ray crystallography or NMR spectroscopy. Figure gives the overlaps found between 2-D and 3-D neighboring relationships. As one can see, caffeine has 1,231 2-D neighbors, but only 302 of these are in common with its 2,298 3-D neighbors. The non-overlapping parts between the 2-D and 3-D neighboring show how similar, yet unique, chemical space is located. Of the unique 3-D neighbors, they expand, beyond its 2-D counterpart, the available biomedical annotation that may be related and relevant, with an additional 23 MeSH links, 274 biological experiments, and a doubling of the protein 3-D structures to consider. A similar result is found in the case of aspirin and morphine. It appears clear that in these cases 3-D similarity complements 2-D similarity with a mostly unique set of chemical structures that help one to discover connections between small molecules that might otherwise be missed. While this near orthogonality of neighbor sets won't be true for all chemical structures, it can be helpful to locate and relate available information in a vast data system such as PubChem.
Figure 5 2-D neighbors versus 3-D neighbors. Comparison of the 2-D "Similar Compound" and 3-D "Similar Conformer" neighboring relationships using three well known small molecules, caffeine, aspirin, and morphine, demonstrates how each neighboring relationship (more ...)
To further emphasize how the 3-D "Similar Conformers" neighboring relationship may complement the 2-D "Similar Compounds" neighboring relationship, the 2-D and 3-D similarity scores of eight drug molecules with the same mechanism of action are compared in Figure , and the 3-D alignment for particular compound pairs, whose 2-D and 3-D similarity difference are relatively large, are depicted in Figure . All eight drugs are known inhibitors of prostaglandin synthase [21
] and were carefully selected for illustrative purposes from the PubChem Compound database via the MeSH pharmacological action of "anti-inflammatory agents, non-steroidal" (MeSH ID 68000894), also known as NSAIDs. While the 2-D similarity between drug molecules is calculated using the PubChem subgraph fingerprint [8
], the 3-D similarity scores represent the best ST and CT similarity values from all possible combinations of the first ten diverse conformers of each compound pair. Although all eight molecules inhibit the same target, only one molecule pair (CIDs 3332 and 3394) is identified as a 2-D neighbor, as shown in the lower triangle of the similarity score matrix. The 3-D similarity approach, however, identified 11 molecule pairs as 3-D neighbors. For example, although the 2-D similarity score between CIDs 1302 and 2581 is 0.43, there are significant 3-D shape and feature overlaps (ST = 0.92 and CT = 0.55) between them (Figure ). If fewer conformers are used, the number of resulting 3-D "Similar Conformers" neighbor pairs will be reduced. When using 2, 3, 5, 7, and 10 diverse conformers, a total of 2, 3, 9, 11, and 11 compound pairs and 2, 3, 14, 22, and 27 conformer pairs, respectively, met the 3-D neighboring criteria for the eight drug molecules.
Figure 6 Similarity score matrix for selected non-steroidal anti-inflammatory drugs. The lower triangle of the score matrix corresponds to the 2-D similarity scores computed using the PubChem fingerprint, and the upper triangle corresponds to the 3-D similarity (more ...)
3-D superposition of selected 3-D "Similar Conformers" pairs. Although there is little 2-D similarity, using the PubChem fingerprint, significant 3-D similarity are found between selected non-steroidal anti-inflammatory drugs.
While not all eight selected NSAID drug molecules are 3-D neighbors of each other, examining the 3-D neighbors of the 3-D neighbors shows that each of the eight drug molecules is related to one or more of the eight drug molecules, effectively forming a cluster of related drugs that are highly similar in terms of shape and pharmacophore features but rather dissimilar in terms of 2-D graph similarity. Actually, this "cluster" of NSAID drugs presented in Figure is part of a larger 3-D cluster, with only eight of thirteen members being selected for clarity and demonstrative purposes. In addition, this is only one of several NSAID drug "clusters" that one can find using 3-D similarity. For the purposes of brevity and focus, only the drug class NSAIDs is explored, but suffice it to say that there are other examples one can find with other drug target classes that are similarly demonstrative.
If a molecule has known bioactivity, there is a reasonable expectation [26
] that its similarity neighbors may also be similarly bioactive. As demonstrated in Figure and , the 3-D "Similar Conformers" relationship can be useful to identify structurally similar molecules that may be completely missed when only the 2-D "Similar Compounds" relationship is exploited. Therefore, one might consider to use PubChem's precomputed 2-D and 3-D neighboring relationships as complementary virtual screening tools or to help understand how chemical structures relate to each other relative to their biological efficacy.
4. Effect of using multiple conformers
Taking into account all conformers of each CID for 3-D neighboring using the current methodology is simply not practical. The PubChem "Similar Conformers" neighboring relationship described here considers (at the time of writing) only two diverse conformers per compound (with a third conformer per compound soon to be released). One may wonder, as more conformers are considered, does one locate more chemical structures and, if so, to what extent? Is there a point of "diminishing returns", where a plateau forms in the curve of unique neighbor count as a function of diverse conformer count? Indirect evidence addressing aspects of these questions can be found in the 3-D neighboring data PubChem provides.
PubChem assigns different unique compound identifiers (CIDs) for different isotopomers of the same chemical structure. For example, CID 2244 and CID 450661 are both aspirin (Figure ), but they differ from each other in the mass of one of the carbonyl carbon atoms. Although they are effectively identical for 3-D neighboring purposes, the conformer generation processing employed in PubChem3D resulted in different "default" conformers that are effectively mirror images of each other, with an insignificant energy difference of less than 0.5 kcal/mol. Superposition of the default conformers of these two CIDs yields a ST of 0.83, meeting the ST neighboring threshold; however, the CT at this superposition is only 0.27, which is not similar enough to satisfy the "Similar Conformers" 3-D neighboring threshold. As shown in Figure and Table , the neighbors for the first three diverse conformers of CID 2244 and CID 450661 each have some degree of overlap, and, in some cases, this overlap is significant. For example, 62% (775 of 1,251) of the 3-D neighbors for the first diverse conformer of CID 2244 are identical to the 3-D neighbors found for the second diverse conformer of CID 450661. Similarly, 63% (812 out of 1,296) of the 3-D neighbors of the second diverse conformer of CID 2244 overlap with those of the first diverse conformer of CID 450661, while the third diverse conformer of CID 2244 shares 60% (730 out of 1,214) of its neighbors with the third conformer of CID 450661. Although there is a great deal of similarity between different chosen conformers of aspirin, they still identify a sizeable population of unique 3-D neighbors between CID 2244 and CID 450661, and, thus, unique shape/feature space. This demonstrates the sensitivity of the conformers used during neighboring processing, even for simple chemical structures like aspirin; however, considering PubChem is using a diverse conformer scheme, as more conformers are used in neighboring, the coverage of the conformational variation improves. This leaves one to wonder, how many diverse conformers per compound might be necessary to saturate this coverage and moderate the effects of this sensitivity?
Figure 8 Sensitivity of conformer choice in 3-D neighboring. Independent conformer processing for CID 2244 and CID 450661, which differ by a single isotope, resulted in default conformers that are effectively mirror images. The 3-D neighbors are different, but (more ...)
Sensitivity of conformer choice in 3-D neighboring.
To help address this question more directly, 4,218 compounds were 3-D neighbored against all of PubChem3D. This set of 4,218 compounds were selected using a query of the PubChem Compound database ("has pharm"[Filter] AND "has 3d conformer"[Filter] AND 0[AtomChiralUndefCount] AND 0[BondChiralUndefCount]
). This query means that the queried chemical structures have known pharmacological action as annotated by MeSH [18
], have a conformer model in PubChem3D, and have zero undefined SP2/SP3 stereo centers. (The last criterion is utilized solely to limit the count of chemical structures considered and should have no bearing on the results of this test.) The PubChem CIDs for the selected chemical structures are available in Additional file 1
These molecules were selected as they are among the most biologically relevant small molecule chemical structures known, being heavily studied in the biomedical literature and consisting, in large part, of most known drugs. Of the very broad range of 367 pharmacological actions defined for the 4,218 small molecules, the three with greatest compound count were enzyme inhibitors (336), anti-bacterial agents (237), and antineoplastic agents (230). These small molecules with known biological action (Query set) were neighbored against 26,157,365 compound records (Search set), representing the entire "live" PubChem3D contents as of Oct. 2010, using up to 1, 3, 5, 7, and 10 diverse conformers per compound for both compound sets. As shown in Table , the average conformer counts between the Query set and Search set are similar, with the query set being slightly less flexible. The non-hydrogen atom count and feature count profiles depicted in Figure for the Query set are also comparable to those found for the Search set.
Effect of using multiple conformers per compound on 3-D neighboring.
Query and Search set profile comparison. Frequency plot of the counts of non-hydrogen atoms and features for the 4,218 chemical structures with known pharmacological action (Query) and all 26,157,365 PubChem3D Compound records (Search).
Looking at Table , one can see that the average counts of neighbors per conformer and those per compound increase as a function of diverse conformer count. Interestingly, as shown in Figure , the average count of compound neighbors per compound appears highly correlated with the logarithm of total conformer pairs considered by neighboring. This suggests one must exponentially increase the count of conformer pairs to achieve a complementary linear increase in unique compound pairs.
Figure 10 Compound 3-D neighbor count correlated to Log(Conformer pair count). Plot of count of 3-D neighbors per compound and neighbors per conformer [left Y-axis] and a plot of Log10(total conformer pairs) [right Y-axis] as a function of diverse conformer count (more ...)
It is not completely clear why this should be so, but one consideration comes to mind. It may be an artefact of the nature of the diverse conformer relationship, whereby a default conformer is chosen as the first, the most diverse conformer to the default conformer is the selected as the second, and each subsequent diverse conformer must be furthest away from the previously selected diverse conformers. This means that the most diverse conformers for a chemical structure are always considered first. Subsequently, each additional diverse conformer will increasingly resemble the previous diverse conformers, potentially yielding compound neighbors found previously by the other conformers for the same chemical structure. This is reflected by the ratio of conformer and compound 3-D neighbors. At three, five, and seven diverse conformers, 38%, 53%, and 61%, respectively, of the conformer neighbors point to the same compound neighbors. By ten diverse conformers, 68% of the conformer neighbors point to the same compound neighbors. With this said, one thing is clear. Neighboring more diverse conformers per compound will result in more compound neighbors per compound; however, the computation effort expended to do this grows exponentially as an increasing ratio of conformer neighbors show you more ways two compounds are interrelated.
One interesting aspect of Table and Figure is that the average conformer neighbor count per conformer grows very slowly. A ten times growth in conformers, corresponding to a 68 times increase in conformer pairs considered, results in only a 70% increase in the average conformer neighbor count. This is somewhat surprising given the argument above. It appears to suggest that each added diverse conformer of a chemical structure is also adding a significant portion of unique shape/feature space. This is seen in Table , whereby the conformer neighbors of each of the first three diverse conformers of aspirin (CID 2244 or CID 450661) mostly had very little overlap, typically less than 20%, of similar conformer neighbors with other diverse conformers of the same chemical structure. While the degree of unique shape/feature space being added may diminish as more diverse conformers are added, it would still appear to be rather substantial even at ten diverse conformers per compound. Eventually, one may expect, as even more diverse conformers are considered, that the average count of conformer neighbors per conformer may grow substantially, as conformers increasingly yield similar neighbor lists, but clearly this point is not yet reached at ten conformers per compound, as reflected by the continued growth in average count of compound neighbors per compound. Perhaps, for most chemical structures, this point may be reached by twenty diverse conformers. Using the computers and algorithms of today, and as reflected in the total search time in Table , twenty diverse conformers per compound is still a mountain too high to climb for a collection of the size of PubChem.
5. Efficiency of 3-D neighboring scheme
Although the overall speed of 3-D neighboring depends on various factors, such as atom count, use of a precomputed shape grid approach, etc., a modern computer processor core can process on the order of 10^2 to 10^3 3-D conformer pair superpositions per second, when using a Gaussian-based shape definition. In theory, 26.1 million compounds with two diverse conformers per compound would require more than a quadrillion (10^15) pair-wise conformer superposition determinations, corresponding to +40,000 years of processor core computation; however, PubChem 3-D neighboring processing was completed in about two months using ~2,500 computer processing cores (which represents more the throughput achieved in terms of actual time on a somewhat chaotic and somewhat unstable shared compute cluster rather than actual CPU time), meaning it took ~400 years of compute server time. How was this achieved?
To demonstrate the efficiency of the PubChem3D neighboring system, and reusing the previous example of querying 4,218 known bioactive small molecules against all of PubChem, Table gives the percentage of conformer pairs excluded by filter type and the percentage of time spent in each stage of the neighboring processing. In the first stage, a series of three filters are utilized to screen out conformer pairs incapable of achieving the ST and CT thresholds of 0.8 and 0.5, respectively, required to be a neighbor. The most effective of these is the CT feature filter with 65% efficiency for this test set, which is to say more than half of all conformer pairs encountered can be effectively ignored. One nice aspect is that this CT feature filter operates on compound pairs, as opposed to conformer pairs. The other two filters at this stage check for incompatible shape or feature volume between conformer pairs. The total CPU time spent performing these three filters is less than 1%, yet they are effective, removing 68% of all conformer pairs from further consideration.
Performance of 3-D neighboring.
Alignment recycling is the next stage after filtering. This methodology consists of: comparing a shape fingerprint; locating common reference shapes; and then reuse of the alignment to the common reference, where the shape overlap and the feature overlap are computed at that recycled alignment to the reference shape. This is repeated for each common reference shape and only the best superposition is kept.
Alignment recycling provides two opportunities to further remove conformer pairs from consideration. If a reference shape cannot be found in common, the conformers are considered to be too different to be a neighbor. This alignment recycling fingerprint filter removes an additional 4% of all conformer pairs (14% of all conformer pairs not already filtered). If the pre-optimized best overlap from alignment recycling is not sufficiently large (yielding an ST of at least 0.735), the conformer pair is considered to be incapable of being a neighbor. This alignment recycling overlap filter removes 27% of all conformer pairs (96% of all conformer pairs not already filtered) but consumes 86% of CPU time. Together, all filtering steps remove 99.8% of conformer pairs prior to optimization of the conformer superposition at the recycled alignment. The final shape optimization step consumes 10% of the CPU time, retaining less than 0.6% of optimized conformer pairs as neighbor pairs. About 66% of conformer pairs shape-optimized are rejected due to an insufficient ST value (<0.795) to become a neighbor and the remainder rejected due to insufficient CT value (<0.495) at the shape-optimized superposition.
The overall throughput of the 3-D neighboring methodology is consistent across the range of diverse conformers considered, at a rate of ~150,000 conformers per second. The other overhead reported in Table involves mostly the billions and trillions of timing measurements but also involves some memory allocation aspects. In reality, with timing statistics turned off, there is very little other overhead to the method. While the total size of the input binary data files grows as a function diverse conformer count, ranging from 19 GB to 159 GB, the computational density is more than sufficient to avoid making input of these search files a bottleneck, provided at least four conformers are being queried simultaneously. If fewer than four conformers are queried at a time, and the input binary files are not memory resident, input can be a bottleneck.
6. Alignment recycling
The alignment recycling methodology [14
] was extended to cover non-hydrogen atom counts from 0-50 and rotatable bond counts from 0-15. This was achieved by leveraging our recent study on the diversity of shape space [15
], where shape space was shown to grow gradually as a function of conformer volume and a dynamic shape similarity threshold for a relatively constant count of reference shapes. This curve (the Unique-Shape Tanimoto in Figure ) was used to effectively partition shape space into seven regions. Each fingerprint region has a distinct shape similarity threshold (the Fingerprint Tanimoto in Figure ) and covers the entire shape diversity of a given conformer volume range. As Table shows, there are a total of 3,311 reference shapes across all seven regions, representing the entire shape diversity of 5.2 billion conformers for the entire contents (live and non-live) of the PubChem3D system (+45.9 million small molecules).
Figure 11 Shape fingerprint design. Plot of conformer count (blue line) [left Y-axis], cumulative % conformers (red line) [right Y-axis], unique-shape Tanimoto (green line) [right Y-axis], and fingerprint Tanimoto (purple line) [right Y-axis] as a function of conformer (more ...)
When computing the shape fingerprint of a conformer, if a reference shape has a shape optimized superposition that is greater than or equal to the fingerprint shape similarity threshold (
), the corresponding 3-D fingerprint bit is set. Although there are 3,311 reference shapes, the reference shapes utilized per conformer is relatively few. As shown in Figure , for the first ten diverse conformers from the 26.1 million compounds (246 million conformers) covered in the study of 4,218 small molecules of biomedical interest, there are at most a total of 129 reference shapes used per conformer, with an average and standard deviation of 39 +/- 13. This sparseness is to be expected as the shape fingerprint primarily identifies a specific region of shape space. Figure depicts the count of set bits per fingerprint region across the 246 million conformers. As Table shows, each fingerprint area covers a specific volume range. So, one should not expect a conformer with volume 100 Å3
to have reference shapes in the conformer volume range 433-999, and vice versa. In fact, while each conformer has at least one reference shape set, many of the 246 million conformers considered do not have any reference shapes set in one of the seven different fingerprint regions. For the fingerprint reference shape volume (Å3
) ranges 1-165, 166-199, 200-238, 239-285, 286-344, 345-432, and 433-999, a total of 83.2%, 62.4%, 35.1%, 11.6%, 2.4%, 2.6%, and 4.1% of the 246 million conformers, respectively, are not using the fingerprint region. This is reflected in the relatively high counts of conformers with no reference shapes, as depicted in the magnified section of Figure .
Figure 12 Shape fingerprint bits are sparsely set. Frequency plot of the total count of fingerprint reference shapes set per conformer for the first ten conformers of the 26,157,365 PubChem3D Compound records in the Search set, corresponding to 246,874,949 conformers. (more ...)
Figure 13 Some shape fingerprint volume regions are mostly unused. Plot of the frequency of the shape fingerprint bit counts per fingerprint volume region for the first ten conformers of the 26,157,365 PubChem3D Compound records in the Search set, corresponding (more ...)
The relative popularity of each discrete 3,311 reference shapes varies markedly. Depicted in Figure , one can see the frequency of use of each reference shape defined in a given fingerprint volume range for the 246 million conformers. In each fingerprint volume range, there exist a very small number of reference shapes that clearly stand out as being used most often. Afterwards, the use of individual reference shapes falls off sharply and then gradually, until only peripheral reference shapes that are rarely used are left. This motif is seen for all fingerprint volume regions and may reflect the relative uniqueness (or lack thereof) of shapes across the first ten diverse conformers in PubChem.
Figure 14 Frequency of fingerprint reference shape use. The frequency of use of the 3,311 fingerprint reference shape bits, separated by fingerprint volume region, by the first ten conformers of the 26,157,365 PubChem3D Compound records in the Search set, corresponding (more ...)
7. Superposition storage
Superposition of two conformers requires modification of the coordinates of one conformer relative to the other. Retention of the rotational matrix and translation vector is a practical approach to retain a superposition between conformers to avoid having to re-compute a superposition or store modified coordinates of a conformer.
Storage of superposition results in PubChem3D involves identification of: the two conformers involved, often with one of the two conformers implicitly identified (e.g., by storing the superposition as a subordinate property of a conformer); the 3 × 3 rotation matrix; and the 3 × 1 translation vector. The PubChem3D conformer ID is often represented as either a 64-bit unsigned integer (sometimes stored in 16-character hex form), with the 32-high bits representing the PubChem Compound identifier (CID) and the 16-low bits representing the local conformer ID (LID), or two numbers "." separated (e.g., CID.LID). Storage of the rotation and translation parts represents more of a challenge, given there are twelve floating point numbers to convey. To provide for a more compact superposition representation, the ability to pack/unpack the rotation and translation into a 64-bit unsigned integer was developed. While described in more detail in the Materials and Methods section below, this involves transforming the rotation matrix into a quaternion and packing each of the four (Qw, Qx, Qy, Qz) components into 32-bits, 8 bits each. The remaining 32-bits are used to encode the translation vector.
To study the loss in accuracy due to encoding/decoding the conformer superposition information into a 64-bit integer, 1.85 billion unique conformer neighbor pairs in the 0-20 million CID range involving conformers that are the first diverse conformer of a compound were used. The chemical structure and 3-D coordinates of each conformer pair were: downloaded from the PubChem3D data system database; the superposition between the conformers was optimized, yielding a before ST/CT value pair; the superposition rotation and translation was encoded, decoded, and applied to the original downloaded conformer pair coordinates; and then a single point ST and CT value was computed, yielding an after ST/CT value pair. The difference in the before/after ST and CT values were binned in 0.001 increments and the population of the occupied bins are plotted in Figure and summarized in Table .
Figure 15 Effect of superposition packing on ST/CT. The difference in the ST/CT scores (binned in 0.001 increments) before and after packing superposition translation/rotation information into an unsigned 64-bit integer. A positve difference indicates an enhancement (more ...)
Effect of superposition packing on ST/CT.
Perhaps most remarkable, the superposition encode/decode procedure is just as likely to enhance the ST and CT values as detract from them. Also interesting is that the CT error curves are much broader, reflecting, in part, the much greater positional sensitivity of the CT measure. Small deviations in rotation have an increasing effect the further an atom is from the molecule center. Fictitious feature atoms are relatively sparse, have small atomic radii, and are often close to the periphery of the chemical structure. Shape similarity, on the other hand, is not as sensitive, as real atoms are relatively dense and most atoms in the molecule are typically near the steric center, thus, fewer atoms are affected from rotation encoding effects. As a whole, the use of a 64-bit integer to store a conformer pair superposition results in relatively few cases where the Tanimoto difference (after-before) is less than 0.025, with the chances for this to occur for ST and CT being 1 in 14.6 million and 1 in 955, respectively. If the error from being off a small fraction of a degree from the original superposition is too much, one could simply re-optimize the conformer superposition provided by PubChem, as the benefits in terms of the ease of storage are considerable.