|Home | About | Journals | Submit | Contact Us | Français|
We present the relational database EDULISS (EDinburgh University Ligand Selection System), which stores structural, physicochemical and pharmacophoric properties of small molecules. The database comprises a collection of over 4 million commercially available compounds from 28 different suppliers. A user-friendly web-based interface for EDULISS (available at http://eduliss.bch.ed.ac.uk/) has been established providing a number of data-mining possibilities. For each compound a single 3D conformer is stored along with over 1600 calculated descriptor values (molecular properties). A very efficient method for unique compound recognition, especially for a large scale database, is demonstrated by making use of small subgroups of the descriptors. Many of the shape and distance descriptors are held as pre-calculated bit strings permitting fast and efficient similarity and pharmacophore searches which can be used to identify families of related compounds for biological testing. Two ligand searching applications are given to demonstrate how EDULISS can be used to extract families of molecules with selected structural and biophysical features.
The high throughput screening regimes of the past 20 years led by big pharma and more recently developed by screening centres through the Molecular Libraries Roadmap program are providing increasing amounts of publicly available biological information. The bioassay and compound databases in PubChem (1) contain information on over 25 million structures and on over 60 million data points from thousands of assays. Smaller but well annotated databases like ChEMBLdb with over 500000 entries provide information on the properties and activities of drug-like molecules and their targets (2). This explosion of data linking compounds to biological activity should provide a means for predicting new biological effects for large numbers of classes of small drug-like molecules using bioinformatic and database mining approaches (3).
In order to test such in silico predictions it is important to have databases of available compounds. It is only relatively recently that searchable interactive small molecule databases have become available to non-commercial research groups. One such resource is ChemDB (4), a searchable chemical database containing nearly 5 million small molecules with their stereoisomers. Interactive databases like ZINC (5) provide large and well annotated collections with some searching capacity. Such databases can contain a variety of structurally related information stored as SMILES strings, InChI or Daylight fingerprints (6). 3D coordinates may also be used as input for structure-based virtual screening (7–9) or pharmacophore searching (10). The idea of relating the activity of a molecule to the spatial distribution of a number of functional groups (11) has been widely used in QSAR (12) and structure-based studies as implemented in programs like GRID (13), LigandScout (14) and Catalyst (15).
The EDULISS database stores 3D atomic coordinates for each molecule along with over 1600 calculated molecular properties. These so called molecular descriptors provide a numerical profile for each molecule consisting of calculated values such as molecular weight, surface area and number of rotatable bonds. By using a selection of descriptors it is possible to rapidly select small related families of molecules from the database. An extension of this selection procedure provides a very efficient way of identifying unique compounds. The database also stores a range of interatomic distances between various atom types for each molecule. The overall statistics of interatomic distances is used in an ultrafast shape searching algorithm (16). A specific subset of interatomic distances between all hydrogen bond donor and acceptor atoms, halogens, phosphorous and sulphur atoms provide what we call the Interatomic Pharmacophore Profile (IPP). All such distance information is stored for each molecule in pre-calculated bit-strings which provide the basis of a wide range of pharmacophore searching routines and also in the identification of similarly shaped molecules. The EDULISS database is therefore a useful tool for identifying commercially available molecules based on similarity or pharmacophore searches. It is distinguished from other web resources by having over 1600 descriptors for each compound and the ability to carry out unique 3D and 2D searches. There are also convenient links for a subset of compounds to the PubChem database allowing easy access to biological data.
Currently, EDULISS stores over 5.5 million (over 4 million unique) compounds in total, containing data from 28 different commercial and other smaller specialist compound catalogues (Supplementary Data S1). 2D and 3D coordinates for each molecule are stored with over 1600 topological, geometrical, physicochemical and toxicological descriptors per compound. In this database, over 3.9 million compounds fit the Lipinski's rule of five (17) and a total of 3.4 million fit the Oprea lead-like criteria (18): that is molecular weight ≤460, number of rotatable bonds ≤10, calculated Log P between −4 and 4.2, number of hydrogen bond acceptors ≤9, number of hydrogen bond donors ≤5 and number of rings ≤4. The database also contains over 520000 compounds with molecular weight <250Da and potentially fitting the needs of fragment-based screening (19).
The biological properties of a subset of 291000 compounds stored in EDULISS has been retrieved from four other databases, including PubChem, BindingDB (20), ChemBank (21) and DrugBank (22), by identifying identical molecules using the Maximum Common Subgraph algorithm (23). The identity of these compounds in the external databases has been obtained and stored in the EDULISS database. A direct link between EDULISS and the external database has been implemented on the search result pages. Once a particular compound which is identical to one of the PubChem compounds has been hit by either 3D/2D similarity or molecule ID search, the link in the ‘Chemical Properties’ box can lead users to the appropriate PubChem web page. Certain catalogues (e.g. the National Cancer Institute) contain many compounds for which there are a lot of biological data and most hits will have links to the relevant PubChem bioassay summary page.
The EDULISS database is held in a MySQL server. The web-based interface of EDULISS uses Java Servlet technology (see http://java.sun.com/products/servlet/) and JavaServer Pages (JSP, see http://java.sun.com/products/jsp/) to build the web pages (Figure 1). The web site utilizes Apache Tomcat as the web server and the runtime environment for Java technologies mentioned above. For the molecule drawing and visualizing, JME (http://www.molinspiration.com/jme/) and Jmol are utilized which are applications written in Java providing interactive features and have been included in the EDULISS web pages. On the query result page, the users can download the SDfile of hit compounds with their descriptor values. To date, this database has been used freely by the researchers from over 20 countries via its web-based interface.
Regardless of the source of catalogues, the compounds used for EDULISS were entirely collected as 2D SDfile formats then converted into 3D atomic coordinates using CONCORD software. After the conversion process, the molecules were processed by DRAGON 5.4 (http://www.talete.mi.it/) and DEREK (http://www.lhasalimited.org) software calculating 1664 physicochemical and potential toxicity properties for each compound.
As EDULISS holds millions of compounds from various suppliers, it is useful to be able to determine the number of unique compounds in the collection. A 2D graph theory algorithm, Maximum Common Subgraph, MCS (23), has been implemented. Although the MCS is able to precisely identify isomorphous compounds, the number of pair-wise comparisons increase as N×(N−1) where N is the number of compounds and the run time grows dramatically from 1h to 1day when the dataset increases from 800 to 3200 compounds (Supplementary Data S2). Thus, it is impossible to go through the whole EDULISS collection using this method.
We have developed a method to efficiently identify unique compounds by clustering according to specific descriptor values (molecular properties). Using this approach the required graphical comparisons can be considerably decreased. Preliminary studies using molecular weight and atom type were not very useful as only 6% of the compounds in EDULISS could be uniquely identified. However a number of other molecular descriptors show much better discrimination; W3D [Wiener 3D index (24)], Whete [Wiener-type index from electronegativity weighted distance matrix (25)] and Vu [a molecular size descriptor which is one of the Weighted Holistic Invariant Molecular descriptors (26)]. The combination of these three descriptors alone was sufficient to identify 3117625 unique compounds (out of a total of 4011697 unique compounds present in EDULISS). The remaining 2 million compounds were grouped using the three descriptors (W3D, Whete and Vu) into 845193 clusters. The compounds in these clusters with identical descriptors were then compared using MCS. This procedure reduces the number of required pair-wise comparisons using MCS down to 6495096 which can be carried out in 20h.
EDULISS stores more than 1600 molecular descriptors for each compound and users can select a series of descriptor items as a query to identify a subset of molecules which will share common properties. Molecular descriptors are primarily organized into 20 groups according to their attributes, so that the users can conveniently choose and set preferred values for the query. For example it is a simple matter to extract from the Sigma-Aldrich catalogue the 164913 out of 199492 compounds that fit the Lipinski rule of five and the 142660 that comply with the Oprea lead-like criteria.
EDULISS also provides geometrical similarity searches based on a 3D similarity measurement called Ultra Fast Shape Recognition with Atom Types (UFSRAT). UFSRAT uses pre-generated geometric descriptors for molecules within EDULISS to discriminate between both the overall geometric, hydrophobic and electrostatic shape of molecules.
The IPP for each molecule in the database consists of interatomic distances calculated between 8 different atom classes; namely hydrogen bond donor atoms (HDon), hydrogen bond acceptor atoms (HAcc), halogens (fluorine, chlorine, bromine and iodine), sulphur and phosphorus atoms. This gives rise to 15 possible types of interatomic distance for each molecule. Distances are stored in strings 128 bits long as Boolean values (1, true; 0, false). The first bit represents a distance less than or equal to 2.50Å, the next bit is 0.25Å longer (i.e. >2.50 and ≤2.75Å) and so forth until the last bit which represents any distance >34.00Å. Thus, there are 15bit strings for each molecule representing the 15 types of possible distance pairs. Figure 2 illustrates the composition of three bit strings showing distances between HAcc and HDon.
This facility enables compounds to be identified that have a specific geometric arrangement of atoms (‘pharmacophore’) as defined by pair-wise distances of hydrogen bond donors, acceptors, halogens, phosphorous or sulphur atoms (where the searches are restricted to S and P atoms that form double bonds to oxygen). For pharmacaphore searching, a bit string is generated for the user-defined query distances which are then compared to that of each compound in the database. If a specific true bit in the query matches, the distance criterion is met. A user can perform a multi-distance query in a single search. Apart from very efficient storage, bit strings also provide a very fast searching method as the necessary Boolean operations can be carried out very quickly. Users can specify the query by defining preferred distances between selected atom types using the web-based interface. The results are then displayed and hits may be downloaded.
As a test for the pharmacaphore searching routine we used the eight available structures of CDK complexes stored in the PDBbind database (27) with PDB codes 1AQ1, 1DI8, 1DM2, 1E1V, 1E1X, 1FVV, 1UNH and 2A4L. Three nitrogen atoms of the adenine ring of ATP were used as a template to search for ATP-analogues (Figure 3). Applying the three interatomic distance criteria as shown in Figure 3b, four out of the eight ligands (1DM2, 1E1V, 1E1X and 1UNH) were identified as illustrated in Figure 3c–f, (The other four ligands were not recognized as they do not have the adenine-like pharmacophore)
The glycolytic enzyme pyruvate kinase (PYK) is a drug target against trypanosomatid infection (28). Fructose-2,6-bisphosphate (F-2,6-BP) acts as an allosteric activator (29). We are interested in identifying analogue molecules which interfere with allosteric regulation. Figure 4a schematically shows selected interatomic contacts and distances between five atoms in F-2,6-BP and three water molecules selected for pharmacophore searches. Two example search motifs are shown in Figure 4b and c. A series of tolerances have been given for each interatomic distance from 10 to 25%. The numbers of hit compounds in a range of tolerances are tabulated in Figure 4. We selected eight compounds (Sigma-Aldrich ID: N9002, L0144, P3504, 201332, 244813, D5021, H2516 and 86170) for further experimental assay based on visual inspection of the docked pose and on calculated solubility. Of the eight selected compounds, five significantly affected the PYK enzyme kinetics. Figure 4d shows that both pharmacophore search models match atoms of the hit compound P3504 which showed 33% inhibition of enzyme activity. A complex of P3504 with Leishmania mexicana PYK (LmPYK) has been crystallized and solved at a resolution of 2.7Å (H. P. Morgan personal communication) showing the molecule binds at the effector site.
Supplementary Data are available at NAR Online.
The Wellcome Trust and the Scottish University Life Sciences Alliance for the use of Edinburgh Protein Production Facility. K.-Y.H. who did most of the work was supported by an Edinburgh University departmental scholarship. This work was not directly supported by any granting agencies or by commercial companies.
Conflict of interest statement. None declared.
The authors would like to thank the staff at Centre of Translational and Chemical Biology.