High-throughput screening (HTS) is an automated technique and has been effectively used for rapidly testing the activity of large numbers of compounds [1
]. Advanced technologies and availability of large-scale chemical libraries allow for the examination of hundreds of thousands of compounds in a day via HTS. Although the extensive libraries containing several million compounds can be screened in a matter of days, only a small fraction of compounds can be selected for confirmatory screenings. Further examination of verified hits from the secondary dose-response assay can be eventually winnowed to a few to proceed to the medicinal chemistry phase for lead optimization [4
]. The very low success rate from the hits-to-lead development presents a great challenge in the earlier screening phase to select promising hits from the HTS assay [4
]. Thus, the study of HTS assay data and the development of a systematic knowledge-driven model is in demand and useful to facilitate the understanding of the relationship between a chemical structure and its biological activities.
In the past, HTS data has been analyzed by various cheminformatics methods [6
], such as cluster analysis[10
], selection of structural homologs[11
], data partitioning [13
] etc. However, most of the available methods for HTS data analysis are designed for the study of a small, relatively diverse set of compounds in order to derive a Quantitative Structure Activity Relationship(QSAR) [18
] model, which gives direction on how the original collection of compounds could be expanded for the subsequent screening. This "smart screening" works in an iterated way for hits selection, especially for selecting compounds with a specific structural scaffold [22
]. With the advances in HTS screening, activity data for hundreds of thousands' compound can be obtained in a single assay. Altogether, the huge amount of information and significant erroneous data produced by HTS screening bring a great challenge to computational analysis of such biological activity information. The capability and efficiency of analysis of this large volume of information might hinder many approaches that were primarily designed for analysis of sequential screening. Thus, in dealing with large amounts of chemicals and their bioactivity information, it remains an open problem to interpret the drug-target interaction mechanism and to help the rapid and efficient discovery of drug leads, which is one of the central topics in computer-aided drug design [23
Although the (Quantitative) Structure Activity Relationship-(Q)SAR has been successfully applied in the regression analysis of leads and their activities [18
], it is generally used in the analysis of HTS results for compounds with certain structural commonalities. However, when dealing with hundreds of thousands of compounds in a HTS screening, the constitution of SAR equations can be both complicated and impractical to describe explicitly.
Molecular docking is another widely used approach to study the relationship between targets and their inhibitors by simulating the interactions and binding activities of receptor-ligand systems or developing a relationship among their structural profiles and activities[31
]. However, as it takes the interactions between the compounds and the target into consideration, it has been widely used for virtual screening other than to extract knowledge from experimental activities.
Decision Tree (DT) is a popular machine learning algorithm for data mining and pattern recognition. Compared with many other machine learning approaches, such as neural networks, support vector machines and instance centric methods etc., DT is simple and produces readable and interpretable rules that provide insight into problematic domains. DT has been demonstrated to be useful for common medical clinical problems where uncertainties are unlikely [33
]. It has been applied to some bioinformatics and cheminformatics problems, such as characterizations of Leiomyomatous tumour[38
], prediction of drug response[39
], classification of antagonist of dopamine and serotonin receptors[40
], virtual screening of natural products[41
In this study, we propose a DT based model to generalize feature commonalities from active compounds tested in HTS screening. We utilized DT as the basis to develop the model because it has been successfully applied in many biological problems, and it is able to generate a set of rules from the active compounds which can then be used for filtering the untested compounds that are likely to be active in the biological system of interest. Moreover, it has the capability to handle the arbitrary degree of non-linear structurally diversified compounds.
Many elegant algorithms for building decision tree models have been introduced and applied in real life problems, and C4.5[42
] is one of the best known programs for constructing decision trees. In this work, the DT based model was developed on the basis of the Decision Tree C4.5 algorithm[42
]. The representation of the molecular structures is described by the PubChem fingerprint system. The DT based model was further examined by four assays deposited in the PubChem Bioassay Database, the HTS assay for 5-Hydroxytryptamine Receptor Subtype 1a (5HT1a) antagonists(PubChem AID:612), HTS assay for 5HT1a agonists(PubChem AID 567), and two other assays with PubChem AID 565 and 372 for screening the HIV-1 RT-RNase H inhibitors. The results of 10-fold Cross Validation (CV) over these HTS assays suggest the self-consistency of the DT models. Since a model simply provides the rules based on the profiles of active compounds in a specific HTS assay, the computationally generated models were further examined using two HTS assays which tested the same HIV RNase target, but used different compound libraries and were performed independently by two individual research laboratories. Our results suggest that these developed models could be used to validate HTS assay data for noise reduction and to identify hits through virtual screening of additional and larger compound libraries.