We have developed a scoring function for the identification of the active compounds in the AuPosSOM clustered dataset. The results demonstrate that the AuPosSOM contact analysis followed by scoring of the compounds using the scoring function developed can provide a high level of selection for active compounds. For three datasets, the clustering scoring method gave better ROC curves than the best energy scoring functions and for five datasets the efficiency was approximately the same. Thus, this new approach represents an alternative to the energy-based scoring functions and can be more efficient than conventional techniques. Construction of subtrees for the best leaves and cross validation of scoring for different types of contacts proved to be a powerful method for increasing the level of identification of active compounds.
Analysis of the results revealed that clustering efficiency depends strongly on the data set and contact selection. AuPosSOM ROC curves for the filtered coulomb contact selection were better or comparable to those of the conventional scoring functions for eight out of nine datasets, indicating that the clustering approach is more robust than the energy-based one. Score values may be used to estimate the probability for efficient clustering, while information regarding the receptor binding site’s properties can also be helpful.
The clustering was efficient in cases where the difference between contact sets of active compounds and decoys was significant, while the presence of well-populated contacts for active compounds was additionally required for successful scoring. These conditions were satisfied for HB and coulomb contact selections for eight tested datasets. The only exception was the HSP90 target, for which active compounds were characterized by the absence of well-populated, selective HB and coulomb contacts. This may point out the need to search for new types of contact selection. At the same time, energy scoring did not provide good results for this target either. Altogether, these data may indicate that docking failed for HSP90.
Various contact selections were tested for clustering. They characterized two types of contacts: polar and lipophilic. Additionally, all contact selections were probed. Polar contact data sets provided significantly better clustering than lipophilic and all contact selections. Even for hydrophobic active sites, like that of COX-1, lipophilic datasets were not superior. The selection of the atoms by their partial charges appeared to be an efficient method for the evaluation of coulomb interactions, providing the best results for most of the targets. All atom selections failed as a result of masking of specific contacts by a high number of non specific interactions.
An important difference in the AuPosSOM clustering and scoring approach in comparison with the energy scoring approach, is that it takes information about the contacts of all poses of the docked compound simultaneously. This allows for average docking imperfections and avoids errors related to the best pose search. A weakness of this approach might be its inability to evaluate the results correctly when the number of poses with correct contact sets is low. In this setting, the energy scoring-based approach may be used to extract the right pose by energy estimation. Remarkably, in accordance with our results, the scoring functions used in the tests were not efficient for most of the difficult targets. Another important idea is that the contact-based approach does not take the conformation of the pose into consideration. This approach greatly simplifies the analysis, as the main requirement for successful clustering is the presence of a unique set of contacts for active compounds rather than the correct overall conformation of the pose. The latter is often hard to achieve, especially for ligands that were not obtained from the receptor’s crystal structure used for docking.
Fingerprint heat map data representation allows for the identification of key contacts for groups of compounds, as well as easy comparison of contact sets for compounds of different structural families. The examples of the implementation of the AuPosSOM software demonstrate the possibility of its utilization for pharmacophore characterization and its applicability to CAR analysis. The analysis of a contact set of compounds with known activity can directly provide the requirements needed for a search of compounds with the highest binding affinity. The information about key contacts and their populations may also be utilized as a filter for screening large libraries of ligands. One of the recent uses of AuPosSOM clustering is the integrative computational protocol for the discovery of the inhibitors of the Helicobacter pylori
nickel response regulator.30
It is necessary to emphasize that the DUD database contains ligands and decoys that have similar physicochemical properties, and thus represents challenging objects for CAR analysis. For AuPosSOM contact-based clustering and scoring, the search for active compounds in libraries containing compounds with highly diverse properties, should be a much easier problem to manage than DUD datasets. Active compounds have a high affinity for the receptor and are expected to form the largest number of high-populated contacts corresponding to the decoys. Consequently, clusters with these vectors will be assigned the highest scores. Additionally, these clusters can be defined by the visual analysis of the heat maps. Good efficiency of clustering for libraries of compounds with highly diverse properties was demonstrated in our previous publication for Thrombin and HIV Protease targets.19
Version 2.0 of AuPosSOM is available online (http://www.aupossom.com
). Further improvement of clustering and scoring efficiency is in progress.