|Home | About | Journals | Submit | Contact Us | Français|
Aqueous solubility is recognized as a critical parameter in both the early- and late-stage drug discovery. Therefore, in silico modeling of solubility has attracted extensive interests in recent years. Most previous studies have been limited in using relatively small data sets with limited diversity, which in turn limits the predictability of derived models. In this work, we present a support vector machines model for the binary classification of solubility by taking advantage of the largest known public data set that contains over 46000 compounds with experimental solubility. Our model was optimized in combination with a reduction and recombination feature selection strategy. The best model demonstrated robust performance in both cross-validation and prediction of two independent test sets, indicating it could be a practical tool to select soluble compounds for screening, purchasing, and synthesizing. Moreover, our work may be used for comparative evaluation of solubility classification studies ascribe to the use of completely public resources.
Aqueous solubility is one of the most fundamental physicochemical properties of drug candidates.(1) Highly active compounds can be totally silent due to the lack of desirable solubility, which is directly relevant to absorption and eventual bioavailability.2,3 Thus, eliminating compounds with unfavorable solubility as early as possible at the screening stage will reduce costs and save time for drug discovery. However, solubility measurement can be laborious, especially when dealing with a large library of compounds. Therefore, considerable efforts have been devoted to developing comutational tools for fast and accurate estimation of solubility.3,4
Recent modeling studies of solubility (Supporting Information, Table S1) have employed methods, such as artificial neural networks (ANN), multilinear regression (MLR), support vector machines (SVM), partial least-squares (PLS), random forest (RF), k-nearest neighbor (KNN), and recursive partitioning (RP).5−18 Though less prevalent, there are also solubility classification studies in which a class label (e.g., soluble or insoluble) is assigned to a given compound.19−23
A common feature in the above studies is that they are based on relatively small data sets. For example, the largest data set ever used consists of less than 6000 compounds (Supporting Information, Table S1). Though good results can still be achieved, data diversity is limited by using a small data set. As a result, the real predictive power of derived model for an independent test set is also weakened. We notice that the data sets used in most previous studies are derived primarily or at least partially from two commercial databases (AQUASOL and PHYSPROP) or from in-house collections, which often makes it difficult to conduct comparative evaluation using the same data sets. On the other hand, public data sets are becoming increasingly popular, as they are readily available to all researchers. Therefore, results obtained on public data sets from different studies can be possibly compared on the same ground.
Unlike previous studies, we took advantage of a high-quality data set containing over 46000 compounds with known solubility, which is believed to be so far the largest public one. In this study, we considered the binary classification of solubility by using the SVM, an established machine learning method that has succeeded in many areas, such as pattern recognition and pharmacokinetic property prediction.24−29 Our SVM model was optimized in conjunction with a reduction and recombination feature selection strategy.(30) In particular, we constructed a hybrid fingerprint from three existing structural and/or physicochemical fingerprints. Our best model employing this fingerprint produced promising results not only in cross-validation but also in the prediction of two independent test sets.
The Burnham Center for Chemical Genomics (BCCG) has launched a screening campaign for aqueous solubility against the NIH Molecular Libraries Small Molecule Repository (MLSMR), which contains more than 350000 compounds. The resultant bioassay (PubChem AID: 1996) was deposited publicly in the PubChem BioAssay database.(31) As of June 18, 2010, this bioassay stored experimental solubility data for 47567 compounds. The solubility data can be downloaded from the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/). All compounds were measured using a standard protocol under the same conditions.(32) We consider that data set compiled from a single source, e.g., those used in this work, is more advantageous for statistical studies than those compiled from various sources (Supporting Information, Table S1).
The 47567 compounds were processed as follows: First, compounds with multiple components, such as mixtures and salts, were discarded. Second, compounds with conflicting or redundant information were minimized. For instance, if two compounds could be characterized with the same fingerprint and their solubility class labels (soluble or insoluble) were inconsistent, then both compounds were discarded to avoid conflict; if their solubility class labels were identical, then only one compound was retained to avoid redundancy. In total, 41501 compounds were compiled and used as the training set for SVM model construction (Table (Table1,1, data set I). The solubility of each compound is expressed in μg/mL unit. As we considered the binary classification of solubility in this study, compounds with solubility ≥10 μg/mL were regarded as soluble, while those <10 μg/mL were regarded as insoluble. This criterion is in accordance with that specified by the original BCCG depositors, although there are considerable debates in the literature on defining the boundary of a soluble/insoluble class.(33)
While this manuscript was in preparation, another 4795 compounds with experimental solubility data were added to the PubChem BioAssay database under the same bioassay (PubChem AID: 1996, updated on July 15, 2010). They were processed as above and served as an internal test set (4510 compounds in total) to assess the performance of our SVM model (Table (Table1,1, data set II). In addition, 32 drug-like compounds with reliably measured intrinsic solubility from a recent solubility prediction challenge(34) were used as an external test set (Table (Table1,1, data set III) to provide a comparative evaluation of our model with those previous methods. The same criterion as above was applied to classify soluble and insoluble compounds.
Molecular fingerprints are widely applied in substructure/similarity searching,(35) compound clustering,(36) and classification.(22) In this study, considering both their popularity and public availability, we adopted the MDL MACCS key(37) and the PubChem fingerprint.(38) The MACCS key is a binary vector of 166 structural and/or physicochemical features (MACCS166), while the PubChem fingerprint represents the presence/absence of 881 substructures (PC881). We also considered one additional fingerprint consisting of six physicochemical properties (ADD6), which were previously found to be relevant to solubility modeling.22,39 With respect to these physicochemical properties, data sets I and II are rather diverse (Figure (Figure1).1). Regardless of the minimal and maximal values, both data sets have similar distributions with respect to most of these properties. This is probably because the MLSMR is a compound library designed for screening purposes. Data set III demonstrates a better diversity in terms of these properties. The PC881 and ADD6 were downloaded from the PubChem Compound database. The MACCS166 keys were generated by using the Open Babel.(40)
Feature selection has been broadly applied to select a subset of features from a given fingerprint.41−44 In this study, we adopted a simple strategy based on F-score, which measures the discrimination of two sets of numbers.(45) Given a binary classification task and a data set, in which the compound is characterized by an m-feature fingerprint, the F-score of the ith feature is defined as
where n+ and n are the numbers of soluble and insoluble samples within a data set; i, i[+] and i[−] are the average of the ith feature of all, soluble, and insoluble samples, respectively; and xk,i[+] and xk,i[−] are the ith feature of the kth soluble and insoluble samples, respectively.
In principle, the larger an F-score is, the more likely a feature is more discriminative. In this study, for each of the three fingerprints MACCS166, PC881, and ADD6, the F-score of each feature was calculated from the distribution of soluble and insoluble samples in data set I. Features were ranked in a descending order of their F-scores. Our aim is to select the most discriminative features so that computational efficiency can be improved, though information may be lost to some extent. Considering both sides, we chose the F-score of 0.001 as a threshold to select only the top-ranked features from each parent fingerprint. We adopted this F-score-based feature selection because it is very straightforward to implement and generally quite effective as well.(45) Besides, F-score can be calculated in advance and thus is independent of the chosen classifier.
All SVM calculations in this work were conducted by using the LIBSVM.(46) The 10-fold cross-validation was applied to evaluate model performance. Briefly, data set I was randomly split into 10 folds in a stratified way so that the ratio of soluble/insoluble samples in each fold was kept identical. In each round, one fold was chosen as a test subset, while the remaining nine folds were combined into a training subset. An SVM model was then built using this training subset, which in turn was used to predict the test subset. The above procedures were repeated for each of the 10 folds. The results from 10 rounds were averaged to give a final assessment of model performance. In addition, two independent test sets (data set II and III) were also used to provide additional evaluations. The following metrics were calculated
where TP, FP, TN, and FN denote the predicted true positive, false positive, true negative, and false negative, respectively. In addition, G-mean that tries to maximize the accuracy on the both sides of two classes was also calculated
The linear kernel and radical basis function (RBF) kernel are two common kernel functions in the LIBSVM. To get an overview of their general performance, we first investigated a few simple SVM models with default parameters within the LIBSVM. The 10-fold cross-validation results given by these models are listed in Table Table2.2. While others have found SVM models with RBF kernel outperform those with linear kernel,47,48 we observed that they were similar in performance. For example, when MACCS166 was employed, both kernels reported comparable G-means (69.7 vs 70.9%). The linear kernel gave marginally better results than RBF kernel when PC881 was used (76.6 vs 74.7%). We consider that the default parameters in the LIBSVM might not be suitable for RBF kernel in this case. Actually, several studies have shown that selecting optimal parameters is critical for RBF kernel.49,50 Some researchers also indicate that linear kernel is a special case of RBF kernel for some parameters.(51) Therefore, RBF kernel is more commonly used and was also adopted by us.
When there are multiple fingerprints available, it is important to choose an appropriate one. It is clear from Table Table22 that the SVM models employing PC881 demonstrate significantly superior results than those employing MACCS166. For example, PC881 outperformed MACCS166 by nearly 4% (74.7 vs 70.9%) when RBF kernel was applied. This is imaginable since the PC881 fingerprint is more than five times (881/166) as long as the MACCS166 key. The much longer PC881 is believed to be more information-rich, making it more discriminative than MACCS166 in our binary classification.
Data imbalance is known to have a great impact on most classifiers, including SVM.47,48,52 As shown in Table Table1,1, data set I shows partial data imbalance with a ratio of soluble to insoluble samples of 2.30. To address this issue, biased weights were assigned respectively to soluble and insoluble classes during model construction. The weights were determined from the proportion of soluble to insoluble samples in data set I, by imposing a larger penalty on the classification error for the minor class (0.435 and 1.000 for soluble and insoluble classes, respectively). As seen in Table Table2,2, there was a significant decrease in performance (10−20%) if data imbalance was not taken into account. In the following analysis, data imbalance was always considered.
The reduction and recombination feature selection strategy successfully enhanced compound recall and structural diversity for hits discovery.(30) This inspired us to mix the three fingerprints of MACCS166, PC881, and ADD6. The underlying assumption is that different fingerprints can encode different aspects of information for the problem of interest, so they may complement each other to yield better performance.
We first investigated the fingerprint combination strategy (without reduction). Table Table22 shows the four different combinations of MACCS166, PC881, and ADD6. As one can see, the SVM models employing combined fingerprint consistently outperformed those employing individual fingerprint. For instance, the SVM model employing MACCS166 + ADD6 outperformed the one employing MACCS166 by about 2% (72.8 vs 70.9%). This supports previous findings that the six additional physicochemical properties comprised in the ADD6 fingerprint are relevant to solubility.22,39 An interesting observation is that model performance increased as combined fingerprint became longer. This is in line with our previous observation that the longer PC881 performed better than MACCS166. On the other hand, the performance of SVM models tended to converge as the length of combined fingerprint increased. For example, Table Table22 shows that the gained performance was merely 1% by extending PC881 to PC881 + MACCS166 + ADD6 (74.7 vs 75.7%). Therefore, elongating a fingerprint by incorporating more features may not necessarily improve a model effectively. Moreover, issues, such as feature intercorrelation and feature redundancy, may arise when integrating different fingerprints.
The best result of 10-fold cross-validation was given by the SVM model employing PC881 + MACCS166 + ADD6 (75.7%, Table Table2).2). However, using such a long fingerprint (1053 features) would be computationally expensive, especially in the grid search for the optimal parameters of RBF kernel. Therefore, we utilized the reduction and recombination strategy to make a shorter fingerprint from existing ones. We believed this strategy could alleviate, if not fully solve, the issue of feature redundancy. Only the top-ranked features with F-score above 0.001 from each of the PC881, MACCS166, and ADD6 were retained, which resulted in three truncated fingerprints: PC307, MACCS90, and ADD5. They were then recombined together to yield a new fingerprint: PC307 + MACCS90 + ADD5. Compared to its full-length parent PC881 + MACCS166 + ADD6, there is only negligible information loss (75.6 vs 75.7%, Table Table2),2), and this new fingerprint is much shorter (402 features). The above results provided us with confidence to use the reduction and recombination strategy for feature selection. Further optimization of SVM model was based on this new PC307 + MACCS90 + ADD5.
The two parameters of C and γ in RBF kernel are critical to a SVM model.49,50 To seek the optimal pair of C and γ, grid search in the parameter space was conducted along with five-fold cross-validation, which turned out to be the most inefficient step during model construction. In this work, it took about two hours to accomplish a typical five-fold cross-validation task on a 16 CPU × 2.60 GHz Linux cluster (using only one CPU) with a maximal memory of 224 MB used. We started from a coarse grid (C [0, 12] and γ [−12, 0], both in log2 units) with a grid spacing of 1.0. A subregion (C [1, 3] and γ [−7, −5]) showing relatively better performance was identified. Further grid search was restricted in this subregion with a finer grid spacing of 0.25 to identify an even better subregion. This procedure was repeated until the optimal parameters (C = 2.43 and γ = −6.34) were determined. Built on these two parameters, our final SVM model achieved a G-mean of 80.3% by 10-fold cross-validation.
More often than not, a model fails to predict an independent test set, although it can perform extremely well during training. A common mistake in the applications of feature selection, as pointed by Smialowski et al.,(53) is that some researchers first use the whole data set for feature selection, then split it into training and test sets, with the former to build a classifier and the latter to evaluate model performance. We strongly agree that more rigorous evaluation should be provided, since in such procedure the trained classifier has already taken advantage of the information leaked from test set.
In this study, data set II, which was excluded entirely from feature selection and model construction, was used as an internal test set to evaluate the performance of our final SVM model. We preprocessed this test set using the 402 features of PC307 + MACCS90 + ADD5 as applied to data set I. The prediction results by our model are listed in Table Table3.3. The G-mean was 83.1%, which is close to that of the 10-fold cross-validation (80.3%), indicating the robustness of our model. As for soluble compounds, our model successfully recognized 2622 out of the 3177 soluble compounds, giving a sensitivity of 82.5%. This result may not be surprising since our data sets are imbalanced toward soluble compounds (Table (Table1),1), and thus classifiers tend to label samples as major class.(54) Nevertheless, when focusing on the classification of insoluble compounds, our model also gave a low false positive rate (16.4%). This can be ascribed to the application of biased weights to soluble/insoluble classes during model training. As a result, the hyper-plane of SVM classifier was pushed toward minor class (insoluble samples), giving a promising specificity (83.6%). In addition, using G-mean as a quality control in cross-validation, the performance of our SVM model was maximized for both soluble and insoluble compounds. The overall classification accuracy is 82.9%, which is comparable to those reported in previous studies (Supporting Information, Table S1). This level of performance is satisfactory, considering the large-size test set used here.
In the above analysis, the optimal parameters of C and γ were applied to the SVM model employing a selected subset of features (PC307 + MACCS90 + ADD5). What if the same parameters were applied to the SVM model employing the full-length PC881 + MACCS166 + ADD6? One can see from Table Table33 that slightly better results were obtained in terms of accuracy and G-mean. Therefore, information loss occurred after feature selection, but it was rather marginal. For example, the reported G-mean for the SVM models with and without feature selection are 83.1 and 83.2%, respectively. It is thus interesting to observe that the optimal parameters derived from the SVM model with feature selection are also applicable to that without feature selection, although they may not be truly optimal for the latter. This might also indicate that SVM models are more sensitive to the chosen parameters than the employing features of a fingerprint, which may be responsible for the universal success of SVM applications.
This data set consists of 32 pharmaceutical chemicals from a recent solubility prediction challenge(34) and was used to provide an external evaluation of our SVM model. A number of previous studies have reported their predictions for the same test set,16,55 making it possible to compare our model with theirs on the same ground. The comparative results are also listed in Table Table3.3. Our SVM model employing PC307 + MACCS90 + ADD5 gave a moderate accuracy of 75.0%, while slightly better results were obtained when PC881 + MACCS166 + ADD6 was applied. It should be noted that four compounds (Supporting Information, Table S2) in this test set were also contained in data set I (i.e., training set), making the prediction not completely independent. Comparable or slightly better results were obtained when these four common compounds were removed from data III. It is notable that one compound (PubChem CID: 3108) was incorrectly classified though it was included in data set I. Further investigation indicates that this compound was reported as soluble in data set I, while insoluble in data set III. Therefore, the inconsistency in the experimental determination of solubility for this compound finally led to the misclassification by our model. This indicates again the importance of data quality, especially when compiling from multiple sources. In comparison, our model achieved comparable performance to some previous methods (e.g., ChemSilico and SPARC). Relevant discussion is given below.
In this work, we have employed the reduction and recombination feature selection strategy to select the most discriminative features. It thus would be very helpful to interpret the predictability as well as the physical meanings of our SVM model from the perspective of these features. The description, F-score and weight of all the 1053 features from PC881, MACCS166, and ADD6 were provided (Supporting Information, excel file). The weight (i.e., relative contribution to classification) of each feature was derived from a linear SVM model by using the svm-weight.(56) In particular, the top 10 features that contributed most to classification are listed in Table Table4.4. A greater positive weight indicates a larger contribution of this feature to the classification of soluble samples and vice versa. As one can see, these top 10 features came from PC881, MACCS166, or ADD6, implying that all three fingerprints indeed played a key role in our model. It can be observed in Table Table44 that the most significant feature for the classification of soluble samples is the 1053rd feature (topological polar surface area). This is anticipated because the larger polar surface area a compound has, the more likely it is soluble in water. Similarly, compounds containing the 944th feature (nitroso group) also tend to be soluble, which is in accordance with the previous findings that this functional group makes a negative contribution to hydrophobicity.57,58 Likewise, the 1048th feature (molecular weight) contributes most to insolubility classification. This is true for many chemicals. For example, the solubility of alcohol in water decreases as the molecular size increases. However, the relationship between molecular weight and solubility is not always that straightforward. Other features, such as the 510th feature, can also be interpretable for insolubility classification since it basically encodes hydrophobic substructures. Nevertheless, this does not mean that compounds containing such negatively contributing features suggested in this work are necessarily insoluble or vice versa. Solubility or insolubility should always consider a molecule as a whole.
Data diversity should always be addressed when building a computational model. That is the reason why we emphasized the use of large data sets in this work. We plotted in Figure Figure2A2A the chemical space of data sets I−III, which is defined by molecular weight and topological polar surface area. These two coordinates were chosen because they were found in the above analysis to be relevant to solubility classification. As one can see, both data sets II and III (test sets) share a similar chemical space of data set I (training set), which may account for the reasonably good prediction of our SVM model on both test sets. However, data sets that are within a similar low-dimension chemical space may not necessarily distribute similarly in a higher dimension chemical space. As shown in Figure Figure2B,2B, the experimental solubility of data set III is more sparsely scattered than that of data set II, implying that the former is a more challenging test set for our SVM model as well as for other methods.34,59 This is in accordance with the relatively lower performance of our SVM model for data set III. In contrast, some other methods, such as MLR and ANN (Table (Table3),3), were calibrated by using the 100 compounds from the training set of the solubility prediction challenge,(34) whose chemical space (Figure (Figure2B)2B) is more similar to that of data set III. This might contribute to their relatively better performance than ours for data set III. Another possible reason is that the choice of 10 μg/mL as a binary cutoff for solubility classification may not be suitable for data set III, as the solubility of compounds therein was measured using a completely different experiment. Data set I covers a very small portion of the chemical space of the MLSMR and an even smaller portion of the chemical space of the PubChem BioAssay database (Supporting Information, Figure S1). Thus, the predictability of our model for a data set that is beyond the training chemical space of our model should not be anticipated without caution, which is true for any supervised machine learning methods.
In this study, we have presented a binary classification model of aqueous solubility using the SVM. A reduction and recombination feature selection strategy was applied to design a new fingerprint by selecting and recombining the most discriminative features from three existing fingerprints. Based on this new fingerprint (PC307 + MACCS90 + ADD5), an SVM model was constructed and optimized using a large and diverse training set (data set I, N = 41501). For an internal test set (data set II, N = 4510), our model correctly classified both soluble and insoluble samples with an overall accuracy of 82.9%. For an external drug-like test set (data set III, N = 32), the performance of our SVM model was found to be comparable to that of some other methods, such as MLR and ANN. Therefore, our model may be used as a practical tool for fast and accurate classification of solubility for untested compounds, which may facilitate compound selection and library design at the early stage of drug discovery. Our study may also provide insights into building predictive models based on very large data sets. In addition, using completely public resources (data sets, software, and methods) in this work will facilitate others to reproduce or compare with our results. The performance of our SVM classification model may be further improved when more experimental solubility data become available.
We thank the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM) for funding support. We also thank the NIH Fellows Editorial Board (FEB) for manuscript revision.
Recent modeling studies of aqueous solubility. Four common compounds in data set I and III. Chemical space for the compounds in data sets I−III as well as for those in the MLSMR and the PubChem BioAssay database. Definition, F-score, and weight of all the 1053 features from PC881, MACCS166, and ADD6. This material is available free of charge via the Internet at http://pubs.acs.org.