High-throughput screening (HTS) is the dominant method of lead discovery in pharmaceutical research and chemical biology. A plurality of the new chemical entities in clinical trials may have their origins in this technique, as do at least two drug.1
Whereas these screens have been productive against traditional drug targets, such as GPCRs, ligand-gated ion channels, and kinases, screening libraries of synthetic molecules has been problematic for others, such as antimicrobial targets and those identified from genomic studies. The reasons for these successes and failures have been widely debated.2-5
From a theoretical perspective, however, one might wonder not that screens of 106
molecules sometimes fail, but rather that they ever succeed.
Chemical space, i.e.
all possible molecules, is estimated to be greater than 1060
molecules with 30 or fewer heavy atoms;6
10μg of each would exceed the mass of the observable universe. This figure will diminish if criteria for synthetic accessibility and drug-likeness are taken into account and increase steeply if up to 35 heavy atoms, about 500 Daltons, are allowed. Positing even a modest specificity of proteins for their ligand, the odds of a hit in a random selection of 106
molecules from this space seems negligible.
HTS nevertheless does
return active molecules for many targets; how does it overcome the odds stacked against it? One might hazard two hypotheses. First, molecules that are formally chemically different can be degenerate to a target, and many derivatives of a chemotype may have little effect on affinity. This behavior, and the polypharmacology of small molecules,7-9
undoubtedly contributes to screening hit rates. Such chemical degeneracy seems unlikely, however, to overcome the long odds against screening. A second explanation is that screening libraries are far from random selections, but rather are biased toward molecules likely to be recognized by biological targets. This second hypothesis seems more plausible, as many accessible molecules are likely to resemble or derive from metabolites and natural products. Some of these will have been synthesized to resemble such biogenic molecules, while others will have used biogenic molecules as a starting material. The role of bias in screening has been mooted before,10-13
and indeed methods to measure metabolite- or natural product-likeness have been reported, permitting the design of these features into screening libraries.14,15
How such bias might be quantified relative to what one would expect for an unbiased collection, and thus its extent and impact on screening and discovery, has remained unexplored.
Quantifying library bias requires three sets of molecules: one that represents all of chemical space, one that represents molecules that proteins are intrinsically likely to recognize—defining the optimal bias, and one that represents screening libraries. The set representing chemical space previously seemed inaccessible. Recently, however, Fink and Reymond have calculated all of the synthetically accessible molecules with 11 or fewer non-hydrogen (heavy) atoms composed of first row elements (C, N, O, and F); there are over 26 million of these, not allowing for stereochemistry.16
Whereas these molecules are small compared to most biologically interesting compounds, this Generated DataBase (GDB) is comprehensive, giving us the full unbiased set within its boundary criteria. For the molecules that proteins are likely to bind—defining the bias—several sets are possible, such as those molecules that have become drugs. Indeed, several investigators have adopted this approach when asking “what is drug-likeness and how can libraries be biased towards it?”17,18
Here, however, we ask why one should expect to find any
hits from screening, and so need a reference set that captures protein recognition in general. For this purpose drugs are imperfect, reflecting many other criteria, like bioavailability, and are backward-looking, capturing information only on a small number of targets. We therefore chose metabolites and natural products from the KEGG (2 018 molecules) and the Dictionary of Natural Products (141 985 molecules) databases, respectively. These molecules are recognized by at least one protein in the biosphere, often many, and are out-group molecules, uninfluenced by human invention. For the set of molecules representing screening libraries we use those molecules that are commercially available, reasoning that most HTS libraries, even in the pharmaceutical industry, are largely composed of molecules that have been purchased from commercial vendors, or closely resemble them (for the MLSMR, the US national screening collection, almost all of the ~300 000 molecules are commercially sourced). To compare the commercially available molecules to those of the GDB, we restrict the former by the same criteria: only purchasable molecules with 11 or fewer heavy atoms composed of first row elements are considered. There are 25 810 such molecules in the ZINC database of commercially available molecules (http://zinc.docking.org
); we refer to these as the purchasable-GDB ().
Overlap between commercially available molecules and the GDB gives the purchasable GDB.
As we will show, when metabolites are compared to both the purchasable-GDB and the full GDB, the purchasable subset is almost 1000-fold more similar to metabolites than the overall GDB, our proxy to full chemical space. The same bias is observed when the two sets are compared to natural products. The bias grows dramatically with molecular size, suggesting that this bias will be greater still among larger “lead-like” or “drug-like” molecules in screening. This is consistent with the idea that these libraries are massively and productively biased toward biogenic molecules. We leverage this observation to ask what scaffolds occur among biogenic molecules but are absent from those commercially available. Almost 1300 ring-scaffolds are found among natural products that are missing from commercial libraries—these scaffolds provide criteria that could be used to further increase the bias in screening libraries toward those molecules that proteins have evolved to recognize.