Aqueous solubility is one of the most fundamental physicochemical properties of drug candidates.(1
) Highly active compounds can be totally silent due to the lack of desirable solubility, which is directly relevant to absorption and eventual bioavailability.2,3
Thus, eliminating compounds with unfavorable solubility as early as possible at the screening stage will reduce costs and save time for drug discovery. However, solubility measurement can be laborious, especially when dealing with a large library of compounds. Therefore, considerable efforts have been devoted to developing comutational tools for fast and accurate estimation of solubility.3,4
Recent modeling studies of solubility (Supporting Information
, Table S1) have employed methods, such as artificial neural networks (ANN), multilinear regression (MLR), support vector machines (SVM), partial least-squares (PLS), random forest (RF), k
-nearest neighbor (KNN), and recursive partitioning (RP).5−18
Though less prevalent, there are also solubility classification studies in which a class label (e.g., soluble or insoluble) is assigned to a given compound.19−23
A common feature in the above studies is that they are based on relatively small data sets. For example, the largest data set ever used consists of less than 6000 compounds (Supporting Information
, Table S1). Though good results can still be achieved, data diversity is limited by using a small data set. As a result, the real predictive power of derived model for an independent test set is also weakened. We notice that the data sets used in most previous studies are derived primarily or at least partially from two commercial databases (AQUASOL and PHYSPROP) or from in-house collections, which often makes it difficult to conduct comparative evaluation using the same data sets. On the other hand, public data sets are becoming increasingly popular, as they are readily available to all researchers. Therefore, results obtained on public data sets from different studies can be possibly compared on the same ground.
Unlike previous studies, we took advantage of a high-quality data set containing over 46
000 compounds with known solubility, which is believed to be so far the largest public one. In this study, we considered the binary classification of solubility by using the SVM, an established machine learning method that has succeeded in many areas, such as pattern recognition and pharmacokinetic property prediction.24−29
Our SVM model was optimized in conjunction with a reduction and recombination feature selection strategy.(30
) In particular, we constructed a hybrid fingerprint from three existing structural and/or physicochemical fingerprints. Our best model employing this fingerprint produced promising results not only in cross-validation but also in the prediction of two independent test sets.