PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of jcheminfoBioMed CentralBiomed Central Web Sitesearchsubmit a manuscriptregisterthis articleJournal of CheminformaticsJournal Front Page
 
J Cheminform. 2011; 3: 26.
Published online 2011 July 22. doi:  10.1186/1758-2946-3-26
PMCID: PMC3223603

PubChem3D: Biologically relevant 3-D similarity

Abstract

Background

The use of 3-D similarity techniques in the analysis of biological data and virtual screening is pervasive, but what is a biologically meaningful 3-D similarity value? Can one find statistically significant separation between "active/active" and "active/inactive" spaces? These questions are explored using 734,486 biologically tested chemical structures, 1,389 biological assay data sets, and six different 3-D similarity types utilized by PubChem analysis tools.

Results

The similarity value distributions of 269.7 billion unique conformer pairs from 734,486 biologically tested compounds (all-against-all) from PubChem were utilized to help work towards an answer to the question: what is a biologically meaningful 3-D similarity score? The average and standard deviation for the six similarity measures STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt were 0.54 ± 0.10, 0.07 ± 0.05, 0.62 ± 0.13, 0.41 ± 0.11, 0.18 ± 0.06, and 0.59 ± 0.14, respectively. Considering that this random distribution of biologically tested compounds was constructed using a single theoretical conformer per compound (the "default" conformer provided by PubChem), further study may be necessary using multiple diverse conformers per compound; however, given the breadth of the compound set, the single conformer per compound results may still apply to the case of multi-conformer per compound 3-D similarity value distributions. As such, this work is a critical step, covering a very wide corpus of chemical structures and biological assays, creating a statistical framework to build upon.

The second part of this study explored the question of whether it was possible to realize a statistically meaningful 3-D similarity value separation between reputed biological assay "inactives" and "actives". Using the terminology of noninactive-noninactive (NN) pairs and the noninactive-inactive (NI) pairs to represent comparison of the "active/active" and "active/inactive" spaces, respectively, each of the 1,389 biological assays was examined by their 3-D similarity score differences between the NN and NI pairs and analyzed across all assays and by assay category types. While a consistent trend of separation was observed, this result was not statistically unambiguous after considering the respective standard deviations. While not all "actives" in a biological assay are amenable to this type of analysis, e.g., due to different mechanisms of action or binding configurations, the ambiguous separation may also be due to employing a single conformer per compound in this study. With that said, there were a subset of biological assays where a clear separation between the NN and NI pairs found. In addition, use of combo Tanimoto (ComboT) alone, independent of superposition optimization type, appears to be the most efficient 3-D score type in identifying these cases.

Conclusion

This study provides a statistical guideline for analyzing biological assay data in terms of 3-D similarity and PubChem structure-activity analysis tools. When using a single conformer per compound, a relatively small number of assays appear to be able to separate "active/active" space from "active/inactive" space.

Background

Recent advances in combinatorial chemistry [1-6] and high-throughput screening technology [7-17] have made the synthesis and screening of diverse chemical compounds easier, helping to create a demand in the biomedical research community for archives of publicly available screening data. To help satisfy this demand, the U.S. National Institutes of Health launched the PubChem project (http://pubchem.ncbi.nlm.nih.gov) [18-21] as a part of its Molecular Libraries Roadmap Initiative. PubChem archives contributed biological screening data and chemical information from various data sources in academia and industry, and offers its contents free of charge to biomedical researchers, helping to facilitate scientific discovery.

PubChem consists of three primary databases: Substance, Compound, and BioAssay. While the PubChem Substance database (unique identifier SID) contains information provided by individual depositors, the PubChem Compound database (unique identifier CID) contains the unique standardized chemical structure contents extracted from the PubChem Substance database. PubChem provides various analysis tools to relate chemical structures to the biological activity data stored in the PubChem BioAssay database (unique identifier AID).

The PubChem3D project [22-25], launched, in part, to help users identify useful structure-activity relationships, generates a theoretical 3-D conformer model [22,23] for each molecule in the PubChem Compound database, whenever it is possible. An all-against-all 3-D neighboring relationship (known as "Similar Conformers") [24] is pre-computed to help users to locate related data in the archive, augmenting the complementary "Similar Compounds" relationship, based on 2-D similarity of the PubChem subgraph binary fingerprint [26].

PubChem3D uses two 3-D similarity measures: shape-Tanimoto (ST) [24,27-30] and color-Tanimoto (CT) [24,27,28]. The ST score is a measure of shape similarity, which is defined as the following:

equation image
(1)

where VAA and VBB are the self-overlap volume of conformers A and B and VAB is the common overlap volume between them. The CT score, given by Equation (2), quantifies the similarity of 3-D orientation of functional groups used to define pharmacophores (henceforth referred to simply as "features") between conformers by checking the overlap of fictitious "color" atoms [28] used to represent the six functional group types: hydrogen-bond donors, hydrogen-bond acceptors, cation, anion, hydrophobes, and rings.

equation image
(2)

where, the index "f" indicates any of the six independent fictitious feature atom types, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i3.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i4.gif are the self-overlap volumes for feature atom type f and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i5.gif is the overlap volume of conformers A and B for feature atom type f. The ST and CT scores range between 0 (for no similarity) and 1 (for identical molecules). These similarity metrics can be combined to create a Combo-Tanimoto (ComboT), as specified by Equation (3):

equation image
(3)

The ST and CT similarity metrics attempt to cover key aspects important for locating chemical structures that may have similar biological activity. ST helps to identify molecules that can adopt a particular 3-D shape, e.g., of an inhibitor bound in a particular conformational orientation in a protein binding pocket. Considering that a hydrocarbon and a drug molecule could adopt the same shape, CT helps to identify molecules with similar 3-D orientation of features, e.g., necessary for making binding interactions between a small molecule and protein binding pocket. This suggests that two molecules with highly similar 3-D shape and 3-D feature orientations may also have similar biological activity. It should be no small wonder that such similarity metrics have garnered widespread use in virtual screening [31,32]. It leads one to wonder: what is a statistically meaningful 3-D similarity score? Or, in other words, if one was to examine the 3-D similarities between biologically tested compounds, what does the distribution look like? In the case of 2-D similarity, one only needs the molecule graph to make a comparison but, in the case of 3-D similarity, molecules can potentially adopt a number of different conformations. Is it sufficient to use only a single conformer per compound and still realize a statistically meaningful difference or separation between the 3-D similarities of reputed actives and inactives from a biological test?

In the present paper, two important questions concerning ST, CT, and ComboT as 3-D similarity measures are investigated. The first question is "if we randomly select any two conformers from the PubChem Compound database, what values of ST, CT, and ComboT scores will be expected on the average?" With knowledge of these values, one can evaluate a statistical significance of the similarity score between any two conformers in PubChem (e.g., if their similarity score becomes greater than what one expects for a random conformer pair, it may be statistically more meaningful).

The second question we seek to answer in this study is "for a given bioassay in PubChem, what is the average difference in similarity scores between the noninactive-noninactive (NN) pairs and the noninactive-inactive (NI) pairs, when a single conformer per compound is used for 3-D similarity computation?" The choice of terminology of NN and NI are necessary considering that the definition of an "active" is not always specified in PubChem. Therefore, for the purposes of this study, we consider "active space" to be anything not specified to be "inactive", thus the term "noninactive" is used in place of "active". This may help provide users with an idea on the separation in the 3-D shape and feature spaces between the active and inactive compounds tested in a given bioassay. An additional question we will answer is: does an optimization type affect the similarity scores? Currently, the PubChem 3-D neighboring involves a shape superposition optimization that maximizes the ST scores [24], but it may be possible to optimize a feature superposition that maximizes the CT score. Will the ST-optimization and CT-optimization make any changes in a 3-D similarity-based bioassay data analysis?

Results and Discussion

A. Notations

In the present study, we consider six different similarity measures: ST, CT, and ComboT for two different optimization types (either ST-optimized or CT-optimized). They are denoted with a superscript, which represents the optimization type (either "ST-opt" or "CT-opt"), and a subscript, which specifies the type of CID pairs ("NN" for the NN pairs and "NI" for the NI pairs). The subscript "NN-NI" is used for the similarity score difference between the NN and NI pairs. For example, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i7.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i8.gif indicate the ST-optimized ComboT scores for the NN and NI pairs, respectively, while An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i9.gif means the difference between the two. The word "XT" is used when we refer to any of the similarity measures (i.e., ST, CT, and ComboT), or a similarity score in a general sense.

In the second part of this study, we analyze the average and standard deviation of the similarity scores of CID pairs for a given AID, and these per-AID average and standard deviation are denoted with Greek letters μ and σ, respectively, followed by the corresponding similarity measure in parentheses [e.g., An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i10.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i11.gif]. The per-AID average and standard deviation of the similarity score difference between the NN and NI pairs for a given AID are computed using the following equations:

equation image
(4)
equation image
(5)

where XT is one of the six similarity measures (i.e., STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt), and nNN and nNI are the number of the NN pairs and NI pairs for the AID, respectively. When we refer to the average and standard deviation of the per-AID statistical parameters over a set of AIDs, we use additional Greek letters μ and σ, respectively, followed by the corresponding statistical parameter in brackets. For example, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i14.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i15.gif represent the overall average and standard deviation of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i16.gif over a set of AIDs, while An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i17.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i18.gif indicate the overall average and standard deviation of An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i19.gif.

B. 3-D similarity score distribution of random conformer pairs

B-1. Structural and chemical characteristics of the biologically tested molecules

As of January 2010, the PubChem BioAssay database had 2,008 bioassay records, (ranging from AID 1 to AID 2310) and 734,486 molecules with a 3-D conformer model were tested in at least one of these bioassays. The structural and chemical characteristics of these biologically tested molecules are shown in Figures Figures1,1, ,22 and and3,3, and they are compared with those of the entire PubChem3D contents (26,157,365 CIDs as of September 2010) in Table Table1.1. The average and standard deviation of the heavy atom count per-CID are 24.6 ± 6.4, slightly less than those across the entire PubChem3D contents (26.3 ± 7.0). The conformer monopole volume (V) and three components of the shape quadrupole moments (Qx, Qy, and Qz, which give a sense of the conformer length, width, and height dimensions, respectively) [25] of the biologically tested molecules default conformer are also slightly less than those across the entire PubChem3D contents (474.1 ± 124.0 Å3 vs. 509.0 ± 137.1 Å3 for V, 12.6 ± 7.0 Å5 vs. 13.6 ± 7.8 Å5 for Qx, 3.3 ± 1.6 Å5 vs. 3.6 ± 1.8 Å5 for Qy, 1.3 ± 0.6 Å vs. 1.5 ± 0.6 Å5 for Qz). As shown in Figure 1(b) and Table Table1,1, the 734,486 biologically tested molecules have 8.1 ± 2.6 features on average, slightly less than the entire PubChem3D contents does (8.5 ± 2.7). The count for each of the six feature types of the biologically tested molecules is equal to or slightly less than those of the entire PubChem3D contents.

Table 1
Summary statistics of chemical structure descriptors.
Figure 1
Atom and feature count histograms of biologically tested compounds. Frequency (blue) and percent cumulative frequency (red) of (a) heavy atom count and (b) total feature count for the 734,486 molecules tested in at least one bioassay archived in the PubChem ...
Figure 2
Conformer volume and quadrupole histograms of biologically tested compounds. Frequency (blue) and percent cumulative frequency (red) of (a) volume, (b) Qx, (c) Qy, and (d) Qz for the 734,486 molecules tested in at least one bioassay archived in the PubChem ...
Figure 3
Individual feature histograms of biologically tested compounds. Frequency (blue) and percent cumulative frequency (red) of respective feature atom count for the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: ...

B-2. Distribution of 3-D similarity scores for biologically tested molecules

One key question this study attempts to answer is: what are statistically meaningful 3-D similarity values for biologically tested molecules? By using the entire set of 734,486 biologically tested molecules in PubChem (as of late January 2010) and their 269,734,474,855 unique CID pairs, we believe this to be a sufficient corpus to make such a determination in a general sense. What may be questionable (to some) is the intention to use only a single conformer per compound for each of the CID pairs.

The reasons for this choice are rather practical. The use of two diverse conformers per compound yields four times more unique conformer pairs and using three diverse conformers per compound makes the unique conformer pair set nine times larger and so on. In other words, the problem size scales as a square of the conformers per compound considered. We could sample the 734,486 compounds into a smaller set, to say ten percent of the original dataset and then consider three diverse conformers per compound to yield approximately the same count of conformer pairs, but are three diverse conformers per compound sufficient? If we down sampled to 1% of the biologically tested compounds and used ten diverse conformers per compound, would ten diverse conformers per compound be sufficient and would the random 1% of the compound set be sufficient to represent biologically tested compounds? For the purposes of this study, we will ignore the multiple conformer representation issue and consider a single conformer per compound to be sufficiently random to provide a useful set of statistically meaningful 3-D similarity thresholds; however, a more detailed study may be necessary to determine the full effect of using multiple conformers per compound, e.g., when picking the best conformer pair per compound pair.

To investigate the average values of ST, CT, and ComboT for random conformer pairs, we downloaded all 734,486 biologically tested molecules from PubChem that had a theoretical 3-D description, and the six similarity scores [i.e., STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt] were computed for all 269,734,474,855 unique CID pairs arising from all possible combination of the 734,486 CIDs, using a single conformer per-CID. The distribution of these scores represents the 3-D similarity scores one would get from any two conformers randomly selected from the PubChem database. The distributions of the similarity scores, binned in 0.01 increments, are shown in Figure Figure44 and their statistics are summarized in Table Table2.2. The average and standard deviation for STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt were 0.54 ± 0.10, 0.07 ± 0.05, 0.62 ± 0.13, 0.41 ± 0.11, 0.18 ± 0.06, and 0.59 ± 0.14, respectively. The conformer pairs whose similarity scores are equal to or smaller than μ+σ account for 85% to 87% of the 269.7 billion CID pairs, and the corresponding fractions for the μ+2σ threshold range from 96% to 98%. This information may be used to evaluate the statistical significance of the similarity score between any two conformers. For example, if the STST-opt value between two conformers is 0.74, the probability of randomly getting a STST-opt score equal to or higher than 0.74 is only 2%, and hence, one may consider that the two conformers have statistically meaningful similarity in terms of STST-opt.

Table 2
Summary statistics for 3-D similarity over all biologically tested compounds.
Figure 4
Overall 3-D similarity statistics between biologically tested compounds. Distribution of 3-D similarity scores of 269,734,474,855 conformer pairs, arising from the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: ...

Note that the PubChem "Similar Conformers" 3-D neighboring requires the STST-opt ≥ 0.8 and CTST-opt ≥ 0.5 for two molecules to become neighbors of each other. The conformer pairs whose ST value is smaller than 0.80 correspond to 99.32% of the random ST score distribution. Similarly, the conformer pairs with CTST-opt < 0.50 correspond to 99.98% of the random CT score distribution. Therefore, if the STST-opt and CTST-opt scores are assumed to be independent of each other, the probability of two conformers being identified as 3-D "Similar Conformers" neighbors of each other by chance is (100 - 99.32) × (100 - 99.98) = 0.0136% (or 1 in 7,353). Note that the CTST-opt score is not completely independent of the STST-opt score because it is evaluated at the ST-optimized alignment. Therefore, the probability of random conformers being identified as PubChem 3-D neighbors will be higher than the estimated value of 0.0136%, but it will still be smaller than 1%.

Figures Figures5,5, ,6,6, and and77 show the distribution of the average and standard deviation of the 3-D similarity scores per-CID (computed from the similarity scores between one CID of the 734-K conformer set and all the other conformers in the set) for ST, CT, and ComboT for both ST-optimized and CT-optimized superpositions, representing the similarity scores that one may expect when a conformer in PubChem is compared with a randomly selected conformer. Most conformers have the average and standard deviation similar to those for the random conformers listed in Table Table2.2. However, in the case of STST-opt [Figure 5 (a)] there is a bit of skew in the distribution of average ST value per CID towards the maximum value, peaking at 0.58, as opposed to the overall average of 0.54. Also of interest in Figure 5 (a), the ST average per-CID rapidly drops off as the ST average approaches 0.65. Note that a small fraction of biologically tested CIDs in PubChem have low average similarity scores per-CID, which indicates their relative uniqueness in the 3-D shape space (i.e., their 3-D shape and/or feature orientations may be very different from most biologically tested molecules in PubChem, resulting in low similarity scores on average).

Figure 5
Per-CID shape similarity statistics of biologically tested compounds. Distribution of the average and standard deviation of the ST scores for each of the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) ...
Figure 6
Per-CID feature similarity statistics of biologically tested compounds. Distribution of the average and standard deviation of the CT scores for each of 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay database: (a) ST-optimized ...
Figure 7
Per-CID shape plus feature similarity statistics of biologically tested compounds. Distribution of the average and standard deviation of the ComboT scores for each of the 734,486 molecules tested in at least one bioassay archived in the PubChem BioAssay ...

Potentially surprising when looking at feature similarity statistics in Table Table22 is that standard deviation values for CT are about half that found for ST. When looking at the per-CID statistics in Figure Figure6,6, one sees that the range of standard deviation of CT is comparable to that of ST, although with a significant population of CIDs on the lower end of the standard deviation. Why is this so? Presumably, the 3-D orientation of features is substantially more diverse than the 3-D molecular shape, keeping both the average and standard deviation values low when compared to all other biologically tested compounds.

An important observation is that the overall ComboTST-opt and ComboTCT-opt scores have very similar average values, as shown in Table Table2.2. Whereas the STST-opt average was greater by 0.13 than the STCT-opt average, the CT-optimization results in an average CTCT-opt score greater by 0.11 than that of CTST-opt. As a result, the difference in averages between ComboTST-opt and ComboTCT-opt were only 0.03, implying that the ComboT score is not very sensitive to the type of optimization. A similar optimization-type dependency of the ST, CT, and ComboT scores was observed in Figures Figures5,5, ,66 and and7.7. That is, whereas the ST-optimization results in an increased ST and decreased CT scores, the CT-optimization gives a decreased ST and increased CT scores, resulting in the average ComboT score that is relatively constant regardless of the optimization type employed. However, as shown in Figure Figure7,7, the ComboTCT-opt data had a narrower range of standard deviation variation per-CID than ComboTST-opt and the standard deviation for ComboTCT-opt per-CID appeared to linearly increase as a function of the per-CID average value.

C. 3-D similarity score differences for the NN and NI pairs

The second part of this study examines the question: is it sufficient to only use a single conformer per compound and still realize a statistically meaningful difference or separation between the 3-D similarities of reputed actives and inactives? Or, to say this in another way, are noninactive and inactive compounds in a given bioassay well separated in 3-D shape/feature space? If so, one would expect to see some statistically significant separation in 3-D similarity scores between the partitioned noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs. This requires 3-D similarity scores for both the NN pairs and NI pairs for each assay considered. This information is already available in the all-by-all similarity score matrices for the 734-K biologically tested molecules computed in the first part of this study. A detailed procedure for extracting the 3-D similarity scores from these matrices on the per-AID basis was described in the Materials and Methods section.

It is important to note that 3-D similarity methodologies (or other analysis methodologies, for that matter) are not expected to work for all biological assay data sets. A tacit assumption of 3-D methodologies is that chemical structures with similar shape and binding features will have similar (if not the same) mode of action of "activity", e.g., of binding to a protein binding pocket in the same fashion. In reality, some assays in PubChem do not have a well-defined target, e.g., being a whole cell, meaning that there could be a number of targets and a number of different mechanisms of action per target for the observed activity in a single assay. In other cases, many chemical structures are active for reasons that have little to do with binding to a protein target, being aggregators, covalent binders, cytotoxic, or some other unintended mode giving rise to the measured "activity" during the biological test (so called "false positives"). As such, 3-D methodology cannot be expected to work for false positives, as reputed "active" molecules may not have any apparent 3-D correlation to each other. This is also true of cases of molecules that would be "active" if not for solubility or some other issue during the biological experiment performed (so called "false negatives"). These issues with biological tests will be nearly completely ignored for the purpose of this analysis. Instead, by looking across a wide set of assays and assay types, there is an expectation that, if there is some effect whereby 3-D similarity averages between "actives" will be greater than the averages between "actives" and "inactives" using a single conformer per compound, a certain subpopulation of assays will show this behavior.

C-1. Selection of AIDs from the PubChem BioAssay database

Among the 2,008 AIDs archived in the PubChem BioAssay database at the time of project initiation (January 2010), 1,744 AIDs had at least one molecule with a 3-D theoretical description. The bioassays in the PubChem BioAssay database can be classified into four categories, according to user-provided assay types (i.e., screening, confirmatory, summary, and other) and the assay count for each category in the 1,744 AIDs is shown in Figure 8 (a). Note that there is another category, "Unspecified", because the assay-type attribute for these AID records are not provided. There were 523 screening assays (30%), 867 confirmatory assays (50%), 57 summary assays (3%), 192 other assays (11%), and 105 unspecified (6%).

Figure 8
Assay counts by category. Assay count for each assay-type category in the PubChem BioAssay database: (a) for assays that have at least one tested molecule with 3-D information (as of January 28, 2010), (b) for assays that have at least one noninactive-noninactive ...

For a given AID, comparison of the 3-D similarity scores for the NN pairs with those for the NI pairs requires that the AID has at least one NN pair and one NI pair. Among the 1,744 AIDs, there were 1,441 AIDs that satisfy this condition [Figure 8 (b)]. Further filtering was necessary to remove AIDs in which the number of NN or NI pairs is too small, because these AIDs may yield biased results. On the contrary, we did not want to filter out more summary assays, if it could be avoided, as there were only nine summary assays at this point. [Summary assays are final stages of lead/probe screening processes and, as such, they have a significantly smaller number of molecules provided (and hence, a smaller number of the NN and NI pairs), compared to other assay types.] Among the nine summary assays in Figure 8 (b), AID 1844 had the smallest number of the NN pairs, which was six, and this number was used as a threshold for further filtering (i.e., AIDs with less than six NN pairs or less than six NI pairs were excluded in any subsequent analysis). After requiring an assay to have a minimum of six compound pairs for each of the NN and NI pairs (that is, 12 pairs per-AID in total), 1,389 AIDs resulted. As shown in Figure 8 (c), there were 444 primary screenings (32%), 742 confirmatory screenings (53%), 9 summary assays (1%), 97 other assays (7%), and 97 unspecified (7%).

C-2. Differences between the 3-D similarity scores of NN and NI pairs

With the set of 1,389 AIDs decided, the average and standard deviation [i.e., μ(XT) and σ(XT), respectively] of the six different similarity values were determined for the NN and NI pairs per-AID. The complete set of per-AID results is available in Additional File 1, and the distributions of the per-AID average similarity scores for the NN and NI pairs [i.e., μ(XTNN) and μ(XTNI), respectively] across the 1,389 AIDs are shown in Figure Figure9.9. The corresponding distributions of differences between the average similarity scores for NN and NI pairs per-AID [i.e., μ(XTNN-NI)] are provided in Figure Figure10,10, while Table Table33 and Table Table44 summarize by similarity optimization type the per-AID statistics across all 1,389 AIDs [i.e.,μ[μ(XT)], σ[μ(XT)],μ[σ(XT)] and σ[σ(XT)]], with further break out by assay type category.

Table 3
Summary statistics per-AID for shape-Tanimoto (ST) optimized 3-D similarity.
Table 4
Summary statistics per-AID for color-Tanimoto (CT) optimized 3-D similarity.
Figure 9
μ(XT) per-AID similarity histogram. The distribution of the average similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 AIDs in the PubChem BioAssay database: (a) shape-Tanimoto (ST), (b) color-Tanimoto ...
Figure 10
μ(XTNN-NI) per-AID similarity statistics. The distribution of the difference of the average similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 AIDs in the PubChem BioAssay database: (a) shape-Tanimoto ...

When looking at the distributions in Figure Figure99 of the per-AID results, it is interesting to see, for a single conformer per compound anyway, that the per-AID average similarity distribution of NN pairs (primarily corresponding to the reputed "active/active" compound space) overlaps extensively with those of the NI pairs (essentially the reputed "active/inactive" compound space). The original hope was that there might be two clearly separated distributions, as this would be a clear signal that 3-D similarity using a single conformer per compound is able to distinguish between "actives" and "inactives" across all PubChem assays, but this is clearly not the case. The average and standard deviation of the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i20.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i21.gif values per-AID over the 1,389 AIDs in Table Table33 were 0.58 ± 0.05 and 0.57 ± 0.04, respectively. The corresponding values for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i22.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i23.gif were 0.11 ± 0.07 and 0.09 ± 0.05, respectively. The small differences in these overall averages between the NN and NI pairs per-AID should not be considered statistically significant, considering their standard deviations. In fact, the average of averages per-AID for the NN and NI pairs are not significantly different from the STST-opt and CTST-opt values for random conformers (0.54 ± 0.10 and 0.07 ± 0.05, respectively), listed in Table Table2.2. For the same reason, the ComboTST-opt differences between the NN and NI pairs are also not statistically significant. Note that, although the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i25.gif values primarily increase from primary screenings to confirmatory assays to summary assays in general, this increase should also not be interpreted to be statistically meaningful, considering that the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i26.gif values increases even more rapidly, as shown in Tables Tables33--4.4. The optimization type (i.e., either ST- or CT-optimization) was also found to not make significant difference in μ(XTNN-NI) values.

Despite the significant overlap between the distributions for the NN and NI pairs in Figure Figure9,9, there are very subtle differences between them; for all six similarity scores, the NN-pair distributions, compared to the NI-pair distributions, have smaller AID counts at the peak and greater AID counts at the upper-tail region, indicating a small shift of the NN-pair distribution toward high similarity scores. This shift is also reflected in sharp, (mostly) normal distributions of μ(XTNN-NI), centered on the positive side just above zero in all cases (Figure (Figure10).10). This suggests that single conformer per compound 3-D similarity is showing some of the anticipated effect of the "similarity principle", which states that structurally similar molecules are likely to have similar biological activities [33-36], such that the "active/active" space is separated from the "active/inactive" space; however, for most assays in PubChem, this effect is simply not large enough to be unambiguous for all biological assays, as reflected in the μ[μ(XTNN-NI)] values smaller than σ[μ(XTNN-NI)] for all six similarity measures. Tables Tables33 and and44 also clearly show that, in general, there is no clear statistically meaningful separation across assays or assay category type using a single conformer per compound. For example, while there is clearly a positive average of NN-NI difference across all similarity score types for all assays and all assay categories, ranging from 0.00-0.13 for μ[μ(XTNN-NI)], the corresponding standard deviation of the average [i.e., "σ[μ(XTNN-NI)]"] is consistently larger than the average value.

These results lead to a number of questions. Why isn't there a greater, unambiguous separation in the 3-D similarity scores between the NN and NI pairs? Is it that we are employing a single conformer per compound in the analysis? After all, the current PubChem3D theoretical conformer generation approach does not guarantee that the single (default) conformer used for each molecule in the NN pairs is a (or "the") bioactive conformation. A general premise of the interpretation of 3-D similarity between a NN pair requires a "bioactive" conformation surrogate for both noninactive molecules. Estimating 3-D similarity between "non-bioactive" conformers of both molecules, or between a "bioactive" conformer of one molecule and a "non-bioactive" conformer of the other, is essentially identical to 3-D similarity comparison for the NI pairs. Therefore, the use of a single conformer per compound is not likely to result in enough similarity score difference between the NN and NI pairs across a wide set of assays. Using multiple conformers per compound may result in a greater separation in similarity scores between the NN and NI pairs, but performing the same analysis using multiple conformers per compound is prohibitively expensive, considering that we are dealing with 269.7 billion conformer pairs arising from 734 thousand compounds and optimizing each conformer pair by ST and then by CT (9 TB of data gzip compressed). Any increase in the count of conformers also increases the computational complexity (and data storage requirements) by the square of the number of conformers per compound considered.

From a gross statistical approach, there is not sufficient separation across the averages of assays for a single conformer per compound to say definitively there is a clear separation between NN and NI pairs. It could be that, by considering multiple conformers per compound (and picking the best similarity conformer pair per compound pair), a clearer separation may occur, but this is a study for another day (and a bigger computer cluster and a bigger data storage system). There are, however, clear examples where some AIDs do show a clear separation, as shown in the tail regions of Figure Figure10,10, using only a single conformer per compound.

C-3. Outliers

Although the overall average differences in similarity scores between the NN and NI pairs were not statistically significant, some AIDs do have substantial (and statistically meaningful) NN-NI differences. These "outlier" cases correspond to the tail regions of the distribution curves in Figure Figure10.10. For each of the six similarity measures, the AIDs that lie outside the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i29.gif region were extracted and are henceforth defined as "outliers". Figure Figure1111 shows Venn diagrams detailing the outlier overlap as a function of 3-D similarity score type. To aid in discussion, the AIDs that have a statistically significant positive value of average NN-NI difference are deemed "upper-bound" cases [Figure 11(b) and 11(d)] and the AIDs that have a statistically significant negative value of average NN-NI difference are deemed "lower-bound" cases [Figure 11(a) and 11(c)].

Figure 11
Assay μ[μ(XTNN-NI)] outlier commonality by 3-D similarity type. The Venn diagrams show the number of AIDs whose difference of the average similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive (NI) pairs of 1,389 ...

The lower-bound cases are when the average 3-D similarity scores for "active/inactive" compound pairs are greater than for "active/active" compound pairs, a counter result to the whole notion of chemical similarity. While the opposite of what one might expect, it can readily occur from a set of chemical structures that are predominately 3-D similar, being on both sides of that subjective and (at times) arbitrary line of being "active" or "inactive", and where most compounds in the compound series are considered "inactive", as can be the case with well defined "activity cliffs" [34,37-40].

Among the 109 unique, lower-bound An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i30.gif outliers, 102 (94%) are An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i31.gif outliers whereas only 7 (6%) are unique to An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i32.gif outliers [Figure 11 (a)]. A similar trend is found in the case of lower-bound An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i33.gif outliers [Figure 11 (c)]. Perhaps this should not be a surprise as shape alone (ignoring features) might not be expected to be a good discriminator of "actives" and "inactives". On the other hand, as shown in Figure 11 (b), there are relatively few unique upper-bound outlier cases solely attributable to An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i34.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i35.gif, being only 39 (24%) and 6 (4%), respectively, of the total. Rather, there is significant overlap between all three 3-D similarity score types, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i36.gif, An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i37.gif, and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i38.gif, with 120 of the 165 (73%) unique upper-bound outlier cases common to An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i39.gif. Again, a similar trend is found for upper-bound An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i40.gif outliers [Figure 11 (d)]. This suggests, for the upper-bound AID outlier cases, use of ComboT similarity score is most efficient at finding most of the outlier cases when using a single conformer per compound.

Figure Figure1212 compares the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i41.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i42.gif AID outlier cases. There are 120 and 125 upper-bound AID outliers for An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i43.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i44.gif, respectively, and 116 are common to both. In contrast, there are 26 AIDs common in the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i45.gif and An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i46.gif lower-bound outliers, while about half that value are unique to each. This shows that the upper-bound AID outliers are predominately conformer superposition optimization type independent.

Figure 12
Assay μ[μ(ComboTNN-NI)] outlier commonality by superposition optimization type. The Venn diagrams show the number of AIDs whose difference of the average ComboT similarity scores for noninactive-noninactive (NN) pairs and noninactive-inactive ...

Table Table55 gives the top 25% of the common ComboTNN-NI upper-bound AID outliers, yielding the largest magnitude difference in average NN-NI separation, and Table Table66 gives all common ComboTNN-NI lower-bound AID outliers. Table Table77 lists the count of assay outliers broken down by optimization type and similarity metric type. Exploring the top five assays in Table Table5,5, the first three represent trivial examples of a compound series easily identifiable using 2-D similarity or 3-D similarity or by eye. AID 672, with the fourth largest NN-NI positive difference found, is somewhat more interesting.

Table 5
Top 25% of common upper-bound assay μ(ComboTNN-NI) outliers.
Table 6
Common lower-bound assay μ(ComboTNN-NI) outliers.
Table 7
Outlier breakdown by optimization type and similarity type.

AID 672 is a secondary confirmatory assay with four active compounds, shown in Figure 13 (a), that comprise the NN pairs. Of these four structures, three have a similar substructure but only two of the structures (CIDs 647501 and 653297) might be considered "similar" with a 0.76 2-D similarity using the PubChem subgraph fingerprint [Figure 13 (b)]; however, using ComboTST-opt 3-D similarity, all four compounds have pair-wise similarity beyond random (i.e., ComboTST-opt > { μ + σ } = 0.74 from Table Table2)2) except for one compound pair (CIDs 66541 and 787437). An example of one of these pair-wise superpositions [Figure 13 (c)] shows one way these different chemical structures can be superimposed relative to their shape and feature complements. While a relatively small example, and easy to examine in detail, there readily exists much larger examples.

Figure 13
Separation between actives and inactives. An example of clear separation between An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i51.gif 3-D similarities of 0.45 (see Table 5), the four active compounds from AID 672: (a) shows 2-D and 3-D similarity dendrograms generated using the PubChem Structure Clustering ...

AID 2230, also a secondary confirmatory assay and fifth in the list found in Table Table5,5, possesses a much larger NN set with 92 compounds. When examining these by 2-D cluster analysis using the PubChem Structure Clustering tool, as shown in Figure Figure14,14, there are clearly two compound series, one with 51 compounds and the other with 31 compounds, representing the majority of the "active" chemical structures. Switching to 3-D ComboT similarity, all but four of the 92 compounds, as shown in Figure Figure15,15, are inter-related at a ComboTCT-opt value above 1.04. As shown in Table Table2,2, a value of 1.04 is more than three standard deviations away from the random average of 0.59 for ComboTCT-opt. As one goes to a ComboTCT-opt value of 1.2, several different clusters appear with the largest containing 46 compounds and second largest containing 20 compounds. This demonstrates how 3-D similarity is able to relate chemical series distinct in 2-D similarity, as representing similar shape and feature space even with a single conformer per compound.

Figure 14
2-D similarity isolates related chemical series. Dendrogram from the PubChem Structure Clustering tool for 88 of the 92 noninactive pairs from AID 2230 showing two primary clusters (containing 51 and 31 compounds, respectively) at 0.8 Tanimoto using 2-D ...
Figure 15
3-D similarity interrelates chemical series. Dendrogram from the PubChem Structure Clustering tool for 88 of the 92 noninactive pairs from AID 2230 showing three primary clusters (containing 46, 20, and 13 compounds, respectively) at 1.2 combo Tanimoto ...

Conclusion

Six 3-D similarity measures (STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt) in conjunction with 734,486 biologically tested compounds from PubChem were utilized to help answer the question: what is a biologically meaningful 3-D similarity score? The distribution of the six similarity measures for biologically tested compound pairs, resulting from computation of all-against-all similarity scores (269.7 billion unique conformer pairs), yielded an average and standard deviation for STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt of 0.54 ± 0.10, 0.07 ± 0.05, 0.62 ± 0.13, 0.41 ± 0.11, 0.18 ± 0.06, and 0.59 ± 0.14, respectively. These values represent valuable benchmarks for the 3-D similarity values provided by PubChem and those computed by some commercial software packages. One can now know when a statistically meaningful superposition between a conformer pair occurs, potentially helping to improve their ability to analyze bioactivity information.

This random distribution of biologically tested compounds was constructed using a single theoretical conformer per compound (the "default" conformer provided by PubChem). If one were to use multiple diverse conformers per compound and pick the best 3-D similarity score, the average random distribution values may well be higher (perhaps significantly so); however, if one considers the continuum of all similarity values produced in the use of multiple diverse conformers per compound to yield a similar random distribution values, the averages (and standard deviations) above may still be applicable or, perhaps, treated as a conservative lower bound result. Further study is clearly warranted using multiple diverse conformers per compound. This work is a critical first step covering a very wide corpus of chemical structures and biological assays and creating a statistical framework to build upon.

The second part of this study explored the question of whether it was possible to realize a statistically meaningful 3-D similarity value separation between reputed biological assay "inactives" and "actives". Using the terminology of noninactive-noninactive (NN) pairs and the noninactive-inactive (NI) pairs to represent comparison of the "active/active" and "active/inactive" spaces, respectively, each of the 1,389 biological assays were examined by their 3-D similarity score differences between the NN and NI pairs and analyzed across all assays and assay category types. Regardless of the optimization type employed (i.e., either of ST- or CT-optimization), the overall average difference between the μ(XTNN) and μ(XTNI) values, while consistently positive (as hoped), were not statistically unambiguous after considering their large standard deviations. Similarly, an increase in the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i47.gif values upon going from primary screenings to confirmatory assays to summary assays was also not statistically meaningful, due to an even more rapid increase in the An external file that holds a picture, illustration, etc.
Object name is 1758-2946-3-26-i48.gif values.

The negligible difference in 3-D similarity between the NN and NI pairs may be due to employing a single conformer per compound in this study. Conceivably the 3-D similarity between two noninactive molecules should be evaluated using the "bioactive" conformer for each molecule, being the conformer giving rise to the observed biological activity; however, the single conformers per compound used in the present study are not guaranteed to be sufficiently similar to the bioactive conformers, and the average similarity scores per-AID for the NN pairs were not much different than those from the NI pairs. Considering the negligible difference in the 3-D similarity scores between the NN and NI pairs, it may not be appropriate to analyze bioassay data with a single conformer per compound in a general sense. With that said, there were a subset of biological assays where a clear separation between the NN and NI pairs were found. In addition, use of combo Tanimoto (ComboT) alone, independent of superposition optimization type, appears to be the most efficient 3-D score type in identifying these cases.

Materials and methods

1. Datasets

At the time of project initiation (late January of 2010), there were 2,008 bioassays (unique identifier AID) deposited in the PubChem BioAssay database, ranging from AID 1 to AID 2310. Among the chemical structures tested in these assays, those with associated PubChem Compound records (unique identifier CID) with theoretical 3-D conformer models available [22] were considered in the present study. Note that the 3-D information is only available for CIDs that satisfy the following restrictions [22,23]:

(1) is a single covalent component.

(2) contains only organic [H, C, N, O, F, P, S, Cl, Br, and I] elements

(3) possess only typical bonding situation (e.g., no hyper valent situations)

(4) not too big (e.g., 50 non-hydrogen atoms or less) and not too flexible (e.g., 15 effective rotors or less)

(5) have five undefined stereocenters or less

There are 734,486 CIDs satisfying the above conditions for the 2,008 AIDs. All data is accessible from the PubChem website (http://pubchem.ncbi.nlm.nih.gov). Bulk download of data is also available from the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem). The AIDs considered are provided in Additional File 1 with per-AID statistics of 3-D similarity scores for the NN and NI pairs.

2. Similarity Score Computation

In the first part of this study, the first diverse conformer [24] for each of the 734,486 CIDs were downloaded. A total of six different 3-D similarity scores were computed, resulting from three different similarity metrics computed for conformer pairs superpositions optimized in two different ways. The three similarity metrics are: shape Tanimoto [ST, Equation (1)], measuring the shape similarity; color Tanimoto [CT, Equation (2)], measuring the similarity of 3-D orientation of functional groups used to defined pharmacophores (specified simply as features); and combo Tanimoto (ComboT), the simple sum of ST and CT [Equation (3)]. The two conformer superposition methods used optimize: by shape similarity (ST-optimized), where conformer shape overlap is maximized; and feature similarity (CT-optimized), where conformer feature overlap is maximized. Feature definitions and all similarities were computed using the C++ Shape toolkit [28] from OpenEye Scientific Software, Inc.

There were a total of 269,734,474,855 conformer pair similarity sets from all possible unique combinations of the 734,486 conformers. Histograms of the computed similarity scores were generated after binning all similarity scores in 0.01 increments [using the C function "rint(float)"]. Note that we used only the first diverse conformer for each compound, being the PubChem default conformer. Considering the total size of data files (9.0 TB compressed, when storing only the two conformer IDs, the two similarity scores, the 3 × 3 rotational matrix, and translation vector per conformer pair computed), employing additional conformers per compound in this study would quickly overwhelm the available computational resources and disk space to consider.

Many of the compounds in the present study were biologically tested in multiple assays, and hence, a substantial fraction of conformer pairs appear in multiple assays. Therefore, since consideration is given to one assay at a time, extracting the similarity scores for the conformer pairs tested in each AID from the all-by-all similarity score matrices computed and stored in the first part of study is described in Figure Figure1616.

Figure 16
Analysis method overview. Pseudo code that describes the process by which the average and standard deviation of the ST-optimized similarity scores for noninactive-inactive (NI) pairs for individual bioassay were computed. This process was repeated for ...

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

EEB computed the similarity score matrices. SK analyzed the data and wrote the first draft. SHB reviewed the final manuscript. All authors read and approved the final manuscript.

Supplementary Material

Additional file 1:

Similarity Scores Statistical parameters of similarity scores for each AID.

Acknowledgements

We are grateful to the NCBI Systems staff, especially Ron Patterson, Charlie Cook, and Don Preuss, whose efforts helped make the PubChem3D project possible. This research was supported (in part) by the Intramural Research Program of the National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. (http://biowulf.nih.gov).

References

  • Aina OH, Liu RW, Sutcliffe JL, Marik J, Pan CX, Lam KS. From combinatorial chemistry to cancer-targeting peptides. Mol Pharm. 2007;4:631–651. doi: 10.1021/mp700073y. [PubMed] [Cross Ref]
  • Pettersson S, Clotet-Codina I, Este JA, Borrell JI, Teixido J. Recent advances in combinatorial chemistry applied to development of anti-HIV drugs. Mini-Rev Med Chem. 2006;6:91–108. doi: 10.2174/138955706775197820. [PubMed] [Cross Ref]
  • Corbett PT, Leclaire J, Vial L, West KR, Wietor JL, Sanders JKM, Otto S. Dynamic combinatorial chemistry. Chem Rev. 2006;106:3652–3711. doi: 10.1021/cr020452p. [PubMed] [Cross Ref]
  • Rupasinghe CN, Spaller MR. The interplay between structure-based design and combinatorial chemistry. Curr Opin Chem Biol. 2006;10:188–193. doi: 10.1016/j.cbpa.2006.03.014. [PubMed] [Cross Ref]
  • Diller DJ. The synergy between combinatorial chemistry and high-throughput screening. Curr Opin Drug Discov Dev. 2008;11:346–355. [PubMed]
  • Moos WH, Hurt CR, Morales GA. Combinatorial chemistry: oh what a decade or two can do. Mol Divers. 2009;13:241–245. doi: 10.1007/s11030-009-9127-y. [PubMed] [Cross Ref]
  • Dunlop J, Bowlby M, Peri R, Vasilyev D, Arias R. High-throughput electrophysiology: an emerging paradigm for ion-channel screening and physiology. Nat Rev Drug Discov. 2008;7:358–368. doi: 10.1038/nrd2552. [PubMed] [Cross Ref]
  • Inglese J, Johnson RL, Simeonov A, Xia MH, Zheng W, Austin CP, Auld DS. High-throughput screening assays for the identification of chemical probes. Nat Chem Biol. 2007;3:466–479. doi: 10.1038/nchembio.2007.17. [PubMed] [Cross Ref]
  • Echeverri CJ, Perrimon N. High-throughput RNAi screening in cultured cells: a user's guide. Nat Rev Genet. 2006;7:373–384. doi: 10.1038/nrg1836. [PubMed] [Cross Ref]
  • Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R. Statistical practice in high-throughput screening data analysis. Nat Biotechnol. 2006;24:167–175. doi: 10.1038/nbt1186. [PubMed] [Cross Ref]
  • Bajorath F. Integration of virtual and high-throughput screening. Nat Rev Drug Discov. 2002;1:882–894. doi: 10.1038/nrd941. [PubMed] [Cross Ref]
  • Goddard JP, Reymond JL. Enzyme assays for high-throughput screening. Curr Opin Biotechnol. 2004;15:314–322. doi: 10.1016/j.copbio.2004.06.008. [PubMed] [Cross Ref]
  • Edwards BS, Oprea T, Prossnitz ER, Sklar LA. Flow cytometry for high-throughput, high-content screening. Curr Opin Chem Biol. 2004;8:392–398. doi: 10.1016/j.cbpa.2004.06.007. [PubMed] [Cross Ref]
  • Chen P. Electrospray ionization tandem mass spectrometry in high-throughput screening of homogeneous catalysts. Angew Chem Int Ed. 2003;42:2832–2847. doi: 10.1002/anie.200200560. [PubMed] [Cross Ref]
  • Hertzberg RP, Pope AJ. High-throughput screening: new technology for the 21st century. Curr Opin Chem Biol. 2000;4:445–451. doi: 10.1016/S1367-5931(00)00110-1. [PubMed] [Cross Ref]
  • White RE. High-throughput screening in drug metabolism and pharmacokinetic support of drug discovery. Annu Rev Pharmacol Toxicol. 2000;40:133–157. doi: 10.1146/annurev.pharmtox.40.1.133. [PubMed] [Cross Ref]
  • Sundberg SA. High-throughput and ultra-high-throughput screening: solution- and cell-based approaches. Curr Opin Biotechnol. 2000;11:47–53. doi: 10.1016/S0958-1669(99)00051-8. [PubMed] [Cross Ref]
  • Bolton EE, Wang Y, Thiessen PA, Bryant SH. In: Annual Reports in Computational Chemistry.Volume. 4. Ralph AW, David CS, editor. Elsevier; 2008. PubChem: integrated platform of small molecules and biological activities; pp. 217–241. [Cross Ref]
  • Wang YL, Xiao JW, Suzek TO, Zhang J, Wang JY, Bryant SH. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37:W623–W633. doi: 10.1093/nar/gkp456. [PMC free article] [PubMed] [Cross Ref]
  • Wang YL, Bolton E, Dracheva S, Karapetyan K, Shoemaker BA, Suzek TO, Wang JY, Xiao JW, Zhang J, Bryant SH. An overview of the PubChem BioAssay resource. Nucleic Acids Res. 2010;38:D255–D266. doi: 10.1093/nar/gkp965. [PMC free article] [PubMed] [Cross Ref]
  • Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2010;38:D5–D16. doi: 10.1093/nar/gkp967. [PMC free article] [PubMed] [Cross Ref]
  • Bolton EE, Kim S, Bryant SH. PubChem3D: conformer generation. J Cheminformatics. 2011;3:4. doi: 10.1186/1758-2946-3-4. [PMC free article] [PubMed] [Cross Ref]
  • Bolton EE, Kim S, Bryant SH. PubChem3D: diversity of shape. J Cheminformatics. 2011;3:9. doi: 10.1186/1758-2946-3-9. [PMC free article] [PubMed] [Cross Ref]
  • Bolton EE, Kim S, Bryant SH. PubChem3D: similar conformers. J Cheminformatics. 2011;3:13. doi: 10.1186/1758-2946-3-13. [PMC free article] [PubMed] [Cross Ref]
  • Kim S, Bolton EE, Bryant SH. PubChem3D: shape compatibility filtering using molecular shape quadrupoles. J Cheminformatics. 2011;3:25. doi: 10.1186/1758-2946-3-25. [PMC free article] [PubMed] [Cross Ref]
  • PubChem substructure fingerprint description. ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf
  • ROCS - Rapid Overlay of Chemical Structures. Version 3.0.0, OpenEye Scientific Software, Inc.: Santa Fe, NM; 2009.
  • ShapeTK-C++ Version 1.8.0, OpenEye Scientific Software, Inc.: Santa Fe, NM; 2010.
  • Grant JA, Gallardo MA, Pickup BT. A fast method of molecular shape comparison: a simple application of a Gaussian description of molecular shape. J Comput Chem. 1996;17:1653–1666. doi: 10.1002/(SICI)1096-987X(19961115)17:14<1653::AID-JCC7>3.0.CO;2-K. [Cross Ref]
  • Rush TS, Grant JA, Mosyak L, Nicholls A. A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J Med Chem. 2005;48:1489–1495. doi: 10.1021/jm040163o. [PubMed] [Cross Ref]
  • Nicholls A, McGaughey GB, Sheridan RP, Good AC, Warren G, Mathieu M, Muchmore SW, Brown SP, Grant JA, Haigh JA. et al. Molecular shape and medicinal chemistry: a perspective. J Med Chem. 2010;53:3862–3886. doi: 10.1021/jm900818s. [PMC free article] [PubMed] [Cross Ref]
  • McGaughey GB, Sheridan RP, Bayly CI, Culberson JC, Kreatsoulas C, Lindsley S, Maiorov V, Truchon JF, Cornell WD. Comparison of topological, shape, and docking methods in virtual screening. J Chem Inf Model. 2007;47:1504–1519. doi: 10.1021/ci700052x. [PubMed] [Cross Ref]
  • Johnson MA, Maggiora GM, (Eds.) Concepts and Applications of Molecular Similarity. New York, NY: John Wiley & Sons, Inc; 1990.
  • Maggiora GM. On outliers and activity cliffs - why QSAR often disappoints. J Chem Inf Model. 2006;46:1535–1535. doi: 10.1021/ci060117s. [PubMed] [Cross Ref]
  • Martin YC, Kofron JL, Traphagen LM. Do structurally similar molecules have similar biological activity? J Med Chem. 2002;45:4350–4358. doi: 10.1021/jm020155c. [PubMed] [Cross Ref]
  • Willett P. Similarity methods in chemoinformatics. Annu Rev Inform Sci Technol. 2009;43:3–71.
  • Dimova D, Wawer M, Wassermann AM, Bajorath J. Design of multitarget activity landscapes that capture hierarchical activity cliff distributions. J Chem Inf Model. 2011;51:258–266. doi: 10.1021/ci100477m. [PubMed] [Cross Ref]
  • Wassermann AM, Bajorath J. Chemical substitutions that introduce activity cliffs across different compound classes and biological targets. J Chem Inf Model. 2010;50:1248–1256. doi: 10.1021/ci1001845. [PubMed] [Cross Ref]
  • Medina-Franco JL, Martinez-Mayorga K, Bender A, Marin RM, Giulianotti MA, Pinilla C, Houghten RA. Characterization of activity landscapes using 2D and 3D similarity methods: consensus activity cliffs. J Chem Inf Model. 2009;49:477–491. doi: 10.1021/ci800379q. [PubMed] [Cross Ref]
  • LeDonne N, Rissolo K, Bulgarelli J, Tini L. Use of structure-activity landscape index curves and curve integrals to evaluate the performance of multiple machine learning prediction models. J Cheminformatics. 2011;3:7. doi: 10.1186/1758-2946-3-7. [PMC free article] [PubMed] [Cross Ref]

Articles from Journal of Cheminformatics are provided here courtesy of BioMed Central