We first built and evaluated Laplacien-modified Naïve Bayesian KA-KI classifiers. For this we selected all datasets from the pre-processed (aggregated) KKB with minimum of 10 active compounds where active is defined as p-transformed activity value of ≥ 6 (i.e. IC50
< 1μM). A total of 188 kinase datasets qualify; the minimum dataset size (actives plus inactives) was 26. The classifiers were cross-validated by a leave-one-out analysis and 10 repetitions of randomized 75/25 training/test split (see methods). For both cross-validation methods ROC scores and enrichment factors show that these KA-KI classifiers perform well for most of the datasets (supporting table S1
). Good results are obtained for datasets with more than 40 active compounds (minimum 61 compounds total), which corresponds to 130 datasets with an ROC score (leave one out) of 0.7 or larger; that is with the exception of one dataset (MST1R) that has only two inactives and which therefore cannot be meaningfully evaluated and was excluded from further analysis (leaving 129 datasets). Raising the minimum number of actives to 70 further improves the results to ROC scores of 0.84 or greater (111 datasets with at least 113 total samples). ROC scores based on leave-one-out and 75/25 (train/test) cross validation procedures are well correlated for datasets with at least 40 actives (R2
= 0.73) and increases further for datasets with 70 actives or more (R2
= 0.77) as shown in figure S1 (supporting material)
. Correlation of the ROC score of the two cross validation procedures increases significantly for datasets with a ratio of total to active compounds of two or greater: for the datasets with at least 40 actives R2
is greater than 0.85 and for datasets with ≥70 actives R2
improves to ≥0.93; however this decreases the number of datasets to 57 and 47, respectively. In general the classifiers for the more balanced and larger datasets performed better.
For the 129 datasets with >40 actives, enrichment factors (EF) at 10% tested samples varied between ~1 and ~8 for both the leave-one-out and train/test cross validation procedures. The maximum enrichment factor (EF max) depends on the fraction of actives in the dataset. For all 129 datasets the actual enrichment factors reach the maximum possible enrichment indicating well-performing classifiers. However, EF is of limited usefulness to characterize classifiers based on balanced datasets. The nature of the datasets reflects what is typically reported in the literature emphasizing active compounds and reporting only few (often structurally related) inactive compounds. In contrast, high-throughput screening (HTS) results include mostly inactive / negative results. Therefore, in practice, the KA-KI classifiers – although highly predictive – may be of limited utility. Highly miniaturized HTS allows a throughput of several hundred thousand to millions of compounds. Consequently, in order to recover a sizable fraction of hits by screening only a few hundred to thousand compounds, 100- to 1000-fold enrichment or even greater would be desirable.
We investigated such a scenario by building KA-PI classifiers for the 189 kinase datasets with at least 10 actives as defined above. To build these classifiers we employ all (489,373) KKB kinase molecules as decoy; i.e. we presume as inactive all compounds that were not specifically annotated as active for any given kinase. This resulted in 189 datasets, each consisting of 489,373 samples including 10 to 6,388 actives. Their size and unbalanced nature corresponded quite well to real HTS datasets, for example those in PubChem.34
Laplacien-modified Naïve Bayesian classification is a suitable method to model our datasets of almost half a million unique kinase-related compounds (see methods), which can bind at different sites, such as ATP competitive compounds, but also allosteric inhibitors, and with different binding modes, for example type I and type II inhibitors.35
Although many of the presumed inactive compounds did not have specific activity annotations, they were all reported in patents and journal publications that focus on kinase inhibitors. They are thus closely related in terms of their biological focus and in many cases structurally.
Similar to the KA-KI classification models, the KA-PI classifiers were validated by leave-one-out and 10 repetitions of train/test cross validation. We report ROC scores and enrichment factors at various percentages of samples screened. In addition to the 75/25 split we also evaluated the classifiers in a 50/50 and 25/75 training/test cross validation (10 repetitions averaged, see methods).
shows ROC score (leave one out) as a function of the number of active samples for all 141 kinases KA-PI classifiers with at least 25 actives. The 141 kinases cover all major groups of the human kinome and also several non-protein kinases. All 189 models are shown in supporting figure S2
. As a general trend it can be seen that the quality of the classifiers increases with the number of active samples. In particular, datasets with more than about 50 actives resulted in much improved results compared to those with fewer actives. For classifiers with more than a few hundred actives there appeared to be no further improvement in ROC score as the number of actives increases further. For the majority of models the ROC scores are very high (>0.96). All details are provided in supporting table S4
, which shows ROC scores and enrichment factors for the leave-one-out and 75/25 train/test cross validation procedure for all 189 KA-PI kinase classifiers. Dataset statistics are also shown. Supporting figure S3
illustrates the relationship of ROC scores for leave-one-out vs. 75/25 train/test cross validation for KA-PI classifiers based on datasets with at least 10 active samples vs. datasets with at least 50 actives. As with the KA-KI models, leave-one-out ROC score estimate was closely related to the ROC score obtained by 75/25 train/test validation, in particular for the datasets with a greater number of actives. Supporting table S7
shows the ROC plots for the KI-PI classifiers based on a dataset with >50 actives using a 75/25 train/test set. While table S7
shows one ROC plot for each kinase / dataset it should be noted that ROC scores reported in table S4
are averaged over 10 repetitions.
Figure 1 Characterization of 141 kinase KA-PI protein and non-protein kinase classifiers, including all major protein kinase groups. ROC scores are shown as a function of active samples. Shape by protein vs. non-protein kinase, color-coded by kinase group, scaled (more ...)
While ROC score is a good measure of the predictors overall performance, enrichment, in particular for low percentages of screened compounds, is an important measure of the practical applicability of a predictor. Enrichment factors (EF) at very low percentages of screened samples for the KA-PI classifiers were very high for most kinases. illustrates the normalized enrichment factor, that is the ratio EF/EFmax
(see methods), at 0.1% tested samples for all 189 kinase classifiers based on randomized 75/25 train/test cross validation averaged over 10 repetitions. It can be seen that most classifiers are able to retrieve true actives very well if the number of actives in the datasets are greater than about 50. Enrichment for these classifiers is generally greater than 50% of EFmax
suggesting that the kinase classifiers presented here are practically applicable for virtual screening. Enrichment increased significantly for classifiers based on datasets of more than ~50 actives. This indicated a required minimum number of the active class of the training set to reliably retrieve true positives from the test set. This was also reflected in the ROC scores (compare ). Absolute EF values for 0.1, 0.5 and 1.0% of tested compounds are provided in supporting figure S4
and in supporting table S4
Figure 2 Normalized enrichment factors for 189 KA-PI kinase classifiers at 0.1% screened samples for 75 / 25 (train/test) cross validation (10 repetitions averaged) as a function of active compounds in the dataset. Shape by protein vs. non-protein kinase, color-coded (more ...)
suggested the highest enrichments for datasets that have between 50 and 250 active molecules. The normalized enrichment factors were slightly lower and relatively stable around 0.5 for datasets with more than a few hundred actives. A possible reason for higher enrichment in datasets with smaller numbers of active compounds that are derived from only a few studies may lie in overrepresentation of scaffolds (analog bias; see below for domain applicability results). However, this is less likely for the larger datasets that have been extracted from a large number of articles and patents. It should also be emphasized here that the decoy (presumed inactive) compounds were all derived from the same kinase literature that are also the source of the active compounds and are therefore closely related in terms of their biological focus and in many cases also structurally. Normalized enrichment factors obtained by 75/25 train/test cross validation at 0.1, 0.5 and 1.0% tested samples are shown in supporting figure S5
. As the percentage of tested samples increases, a larger fraction of classifiers show a very high ratio EF/EFmax
(>0.8). Although expected, because the maximum possible enrichment decreases, the results also indicated that it is more difficult to retrieve a certain fraction of true positives in a smaller set of sampled compounds, compared to a larger, i.e. from 0.1% vs. 0.5% or 1.0% tested compounds.
Following standard procedure to further validate the KA-PI classifiers with our data, we randomized the kinase activities (maintaining the number of actives for each kinase dataset). Leave-one-out cross validation resulted in ROC scores of close to 0.5 and EF of approximately the ratio of actives to overall number of samples. This corresponded to random classification as expected.
We also investigated how the ratio of training and test sets influenced the ROC and enrichment cross validation results for the various kinase datasets. In addition to 75/25 (supporting table S4
) we split the data into by 50/50 and 25/75 (supporting table S5
). ROCs were slightly higher for larger compared to smaller training sets (supporting figure S6
) indicating improved overall predictability, which was also consistent with the general trend of the ROC scores as a function of the number of active samples ( and supporting figure S2
). More specifically, as the ratio of training/test compounds decreased, the required number of active compounds in the datasets to give very good ROC (>0.96) increased. This suggested a threshold of required active (training) compounds to develop very good predictors.
In contrast to increased overall predictivity (measured by ROC score) for larger (compared to smaller) train / test ratios, enrichment factors at very low percentages of tested compounds (0.1% and 0.5%) increased slightly with lower (compared to higher) ratios of train/test sets. illustrates the average ratio of EF/EFmax
at 0.1% tested samples for the different train/test ratios across datasets for various ranges of numbers of active compounds. Other than for the lowest bin of active compounds (0 to 400), enrichment at 0.1% increases as the ratio of train/test decreases. Supporting figures S7 and S8
show EF and EFmax
for each individual dataset as a function of active compounds for the three train/test ratios at 0.1 and 0.5 % tested samples, respectively. Supporting figure S9
shows the normalized enrichment factors (EF/EFmax
) for each dataset at 0.1% tested samples, illustrating the same trend. Although the ratio of active to inactive compounds is the same among the different train/test splits, these results suggest that it is “easier” for the classifier to select true actives from a test set with a larger (in absolute numbers) pool of active compounds. Because the trend holds for the datasets with the highest numbers of actives, it is likely not a trivial analog bias; although classifiers are based on structural features and therefore true positive test compounds are by definition structurally related to the active training compounds. More importantly, the results indicated again that once a certain number of actives compounds are available, classifier performance does not increase significantly with additional data. This is consistent with the validation results described above based on ROC and EF.
Average normalized enrichment (EF/EFmax) at 0.1% tested samples for different train/test ratios binned by ranges of active compounds across 141 kinase datasets with at least 25 actives (EF averaged over 10 repetitions).
To evaluate how similarities of test to training compounds affect model performance we performed a simple domain applicability study (, compare methods). Datasets with at least 500 active compounds were selected (53 kinase datasets representative of most of the human kinome) and the actives of each dataset were clustered into 10 series. 10 models were generated each using one series as test set while using the remaining combined 9 clusters as training set. This way predictions are made for compounds that are structurally dissimilar to training compounds. 530 models were generated and evaluated. A shows a histogram of closest test- to training set similarities by ROC score; results are shown for datasets with at least 10 active test compounds (443 models). Specifically we investigated ROC score as a function of the similarity of test set cluster center to the closest training compound (B), the average of the closest similarities of the test set compounds to the training set compounds (C), and the similarity of the closest training compound to any test compound (D). As can be seen, the best criteria of model applicability is the average closest similarity of the test compounds to the training compounds (panel C). The models show good predictivity (ROC > 0.8) for relatively low average closest similarities (> 0.5).
Figure 4 Model domain applicability. KA-PI model performance as measured by ROC score as a function of similarity of test to training sets. 443 models shown for 53 kinases (representative of most of the human Kinome). A) Average closest similarities by binned (more ...)
In addition to ROC score for each model (cluster), we recorded the predictions of all active compounds over all 530 models (98,731 predictions total) and evaluated the hit rate as a function of similarity of the predicted test compound to the closest training compound (supporting figure S10
). Figure S10
A shows the true positive rate (TPR) and figure S10
B the distribution of true positives and actives as a function of closest similarity of the test compound to the training set. TPR drops off sharply is the similarity decreases below 0.5 and the TPR is greater than 0.8 for compounds with a Tanimoto similarity of > 0.6 (ECFP4).
To illustrate the potential applicability of the KA-PI models for virtual screening, shows the enrichment results at 0.1% tested samples as the fraction of true positives obtained from the test set vs. the fraction of actives in the entire dataset. As expected the true positive rate increases with the ratio of active to total compounds until the latter reaches 0.1% after which the true positives identified from the test set remain relatively constant between 40 and 80 percent. In this plot, the enrichment factor achieved by each classifier at 0.1% is the quotient of the y and the x values. In practical terms, were the KA-PI classification models applied to prioritize 500 compounds from a library of about 500,000, one may expect to recover anywhere from 10 to 400 actives depending on the number of actual actives for a given kinase target. Thus these models appear practically applicable, assuming that our datasets reasonably well represent the kinase inhibitor chemical space.
Figure 5 Fraction of true positives (TP) of the number of tested compounds in 0.1% of the test set as a function of the ratio of actives to total compounds for different ratios of training to test data. Also indicated is the enrichment factor as the size of the (more ...)
The performance of the Laplacien modified Naïve Bayesian kinase KA-PI classifiers based on ROC scores and enrichment at very low percentages of tested compounds indicate that the datasets employed here are well suitable to build highly predictive kinase classification models. The results consistently suggest that the best classifiers are obtained if a minimum number of conservatively 50 active training compounds are available against a large decoy set. Performance of the models can be further improved with additional active compounds of up to a few hundred, but does not improve much beyond this. Given the low ratio of structure-activity data points to unique compounds in the KKB, it is likely that some of the presumed inactive (decoy) compounds are in fact active against one or several kinases. It is therefore feasible that the kinase KA-PI classifiers can be further improved as more kinase profiling datasets become available.
In addition to the Naïve Bayes binary classifiers we also wanted to evaluate how suitable the datasets are for the development of quantitative predictors. We chose Partial Least Square (PLS) and k Nearest Neighbor (kNN) regression as two fairly different learning methods to quantitatively predict a continuous property. PLS and kNN regression are considered most applicable for datasets of congeneric (structurally similar) molecules. The methods can be sensitive to outliers and require high-quality data. Here, 168 kinase datasets with at least 20 exact molecule-activity data points were extracted from the standardized and aggregated KKB datasets (see methods). PLS and kNN QSAR models were built as described and evaluated by 10-fold cross validation, which was further repeated 10 times randomly partitioning the data. All cross-validation results (including, q2
, and RMSE) and dataset statistics are summarized in supporting table S2
. illustrates q2
values for the kNN vs. PLS models along with the number of structure-data points for each kinase. The kNN method in general outperformed PLS and the performance (measured by q2
) of both methods correlated reasonably well. From one can also conclude a trend in which the predictive quality for both PLS and kNN models generally improved with the size of the datasets. Although this trend was not as strict as what we observed for the Naïve Bayes classifiers, a cutoff for kNN q2
≥ 0.4 and PLS q2
≥ 0.25 leaves 91 kinase datasets – all except three having greater or equal to 50 structure-data points (also compare ). For kinases with 500 or more data points, all but one model have kNN q2
values of > 0.5.
Quantitative regression models developed from 168 kinase datasets. q2 values of kNN vs. PLS regression models and the number of kinase activity data points indicated by the circle size and color (see text).
Figure 8 kNN and PLS activity predictors for 91 kinases (q2 kNN > 0.4 and q2 PLS > 0.25) by the number of data points; datasets include protein and non-protein kinases and all major kinase groups. kNN q2 is shown by number of (unique) structure-data (more ...)
Another important characteristic influencing the quality of both PLS and kNN, was the activity range of the datasets. illustrates the average q2
for both kNN and PLS as a function of the p-transformed activity range, which corresponds to the orders of magnitude between the most and least active compound. PLS and kNN q2
values continuously increased from 0.1 to > 0.8 as the activity range increased from 1 to 12 orders of magnitude. To evaluate the activity range across structural series and to see how structural series distribute across different kinase datasets, we clustered compounds into 336 clusters and calculated activity ranges for each kinase for each cluster (see methods and supporting table S3
). Supporting figure S11
shows the total number of kinases and activity range of each cluster. These results show that many structural series span a wide activity range and many kinases suggesting that a scaffold bias should not be a general concern for the models built using these datasets.
Average q2 for kNN and PLS regression cross validation results as a function of the p-value range (defined as p-valuemax - p-valuemin) of all 168 kinase datasets.
shows kNN and PLS models (characterized by q2
, and number of structure-data points) for 91 kinases (selected from 182 total models) with kNN q2
≥ 0.4 and PLS q2
≥ 0.25. These include protein and non-protein kinases and all major kinase groups. The quality of the majority of the models is quite good indicated by q2
and also R2
values (compare supporting table S2
PLS and kNN regression are very different machine learning methods. For many of the kinase datasets we obtained high quality models with both methods, in particular the larger datasets and distinctly the ones spanning a wider activity range. These kinase data appear particularly well applicable for quantitative modeling and by that measure, they are of high quality.
Application of classification models to LINCS kinase profiling data
Because of their exceptional performance, we tested the KA-PI classifiers on data recently generated in the NIH LINCS project (see introduction). The recent KINOMEScan screening technology11,13
is used at the Harvard Medical School LINCS center to profile compounds against 486 targets. We applied the KA-PI classification models to the 43 compounds with known structures. illustrates the probability that a compound is active against a kinase and the corresponding actual activity based on the KINOMEScan profiling results. Shown are all data with greater than 10% probability of activity for targets that could be mapped to the KINOMEScan results (see methods). All compounds, targets, predictions and percent inhibition values are given in supporting table S8
. As can be seen, all but three activity predictions are confirmed by these screening results given a probability cutoff of 10% and an activity of >60 % inhibition; the majority of actives even have greater than 90% inhibition. If we select a 60 percent probability cutoff, there is only one outlier as shown in (CSK) predicted as highly likely active, but with 0 percent inhibition. This appears to be a false positive; PD173074, a FGFR inhibitor has a reported IC50
of ~20 μM for CSK (KBB, Eidogen-Sertanty). Overwhelmingly however, the kinase model activity predictions are in agreement with the KINOMEScan profiling results.
Figure 9 Probability (EstPGood) of compounds predicted active against a kinase based on KA-PI kinase classifiers and actual KINOMEScan percent inhibition values (at 10 μM); compare supporting table S8. Kinase classified by groups and protein vs. non-protein (more ...)
In addition to correct predictions of active kinase inhibitors, the classifiers’ predicted activities overall correspond very well to the LINCS KINOMEScan results. illustrates the aggregated (sum) probabilities of activity (by kinase category) for all mapped compound target combinations (see methods) by binned percent inhibition. As can be seen, predicted kinase activity probabilities are much greater for compounds reported as active in the profiling results and in particular for the most active category (>90 % inhibition). Supporting figure S12
shows the histogram of mapped KINOMEScan activity results with the vast majority being inactive (< 10% inhibition), and which correspond to very low predicted probability of activity ().
Figure 10 Aggregated predicted probabilities (EstPGood) of compounds being active against kinases (based on the KA-PI classifiers) as a function of the actual KINOMEScan percent inhibition ranges; by category of kinase group and protein vs. non-protein kinase; (more ...)
These results illustrate very good performance of the classifiers. Although expected, it should be noted that, because KINOMEScan profiling was performed at a relatively high screening concentration of 10 μM, a relatively large percentage of compounds appear active (figure S12
); this is in contrast to predictions that are based on models with an activity cutoff of IC50
≤ 1 μM. Nonetheless the KINOMEScan results are in very good concordance with the KA-PI model predictions.