Structure-based virtual screening has had several important successes in recent years
1–10 and is now a common technique in early stage drug discovery at most pharmaceutical companies as well as some university groups. Unfortunately, virtual screening techniques continue to require expert knowledge and extensive infrastructure and remain out of reach for many medicinally—and biologically—oriented investigators who might otherwise be able to exploit them. Among the steepest barriers to entry is the lack of a suitable database of small molecules with which to screen. These databases are either expensive to acquire or time-consuming and difficult to prepare and curate. To be useful for structure-based screening, 3D structures must be calculated for each available molecule. The structures must be linked to the supplier information, itself requiring some database design. More difficult are the problems of calculating multiple protonation, stereo- and regiochemical, tautomeric, and conformational states for the database molecules. Computing these multiple molecular states is challenging and is the focus of ongoing research.
11–13 Finally, as supplier catalogs are often updated monthly, considerable curatorial work is required to remain current.
The “gold standard” for docking databases, at least in academic groups, has been the Available Chemicals Directory (ACD) from Molecular Design Limited (
http://www.mdli.com, San Leandro, CA). This database contains about 250 000 purchasable compounds, while the screening compound analogue, the ACD-SC, has over 2.3 million compounds. The ACD has been extensively curated for chemical correctness, is compatible with corporate information systems using Oracle, and has molecules that are available as 3D models. Even the ACD, however, requires extensive post curation. For instance, correct protonation states for docking must be assigned. Many ACD molecules have counterions that must be removed prior to docking. Each molecule is present in only one tautomeric form, sometimes not the biologically relevant one. Only a single conformation of each molecule is available. Finally, the ACD is a commercial product that is often too expensive for nonspecialist labs to purchase and maintain. Similarly, the ChemNavigator database (
http://www.chemnavigator.com) contains over 10 million unique purchasable drug-like compounds but is also neither entirely ready for docking nor free. Again, dealing with protonation, charge, tautomeric forms, and salts is left to the user.
Several free collections of small molecules are available, though none is entirely satisfactory for docking. The Ligand.Info database (
http://Ligand.info)
14 contains about 1 million compounds from various free databases. Although it contains 3D structures, there has been little effort so far to tautomerize, protonate, and charge them for docking. Furthermore, many of its compounds are not purchasable. The ChemBank project
15 (
http://chembank.med.harvard.edu) currently contains about 900 000 molecules, many annotated for function. As this is a 2D database, it is unsuitable for structure based screening as is.
In principle, a virtual screening library could be built from the 2D compound information provided by many compound suppliers. Indeed, this is what we have ultimately done with ZINC. To do so, a 3D structure must be generated, typically using the 2D molecular description supplied by the vendors; these often contain stereo- or regioisomeric ambiguities. The correct protonation state, charge, and tautomeric forms must be enumerated or chosen. To avoid wasted effort, insoluble, reactive, and aggregating compounds should be eliminated.
In an effort to make virtual screening more accessible to a broad community, we describe here a free database of purchasable molecules, many of them “drug-like” or “lead-like”, available in several 3D formats immediately usable by many popular docking programs. The salient criteria for our database, ZINC, an acronym for “ZINC is not commercial”, are as follows. Compounds should be purchasable for rapid testing of docking hypotheses. Subsets of molecules with variable properties such as functional groups, molecular weight, and calculated logP should be easy to create and manipulate. The database must support multiple protonation models, tautomeric forms, stereochemistries (e.g. racemic mixtures as well as stereochemically pure compounds), regioisomeric forms (E/Z isomerism), suppliers, and 3D conformational sampling. It should be possible to annotate molecules using both numeric and alphanumeric data. It should be easy to add new molecules, tag, or remove those that are no longer available and fix those that have errors. The database should be quick to search and download, and it should be straightforward to obtain regular updates.