The complete performance statistics of each scoring function against either Set1 or Set2 is reported in . The correlation plot of each scoring function and the distribution of predictions are shown in and . The IDs of complexes with their predicted scores (or absolute pK

_{d} values) are reported in

Table S1 and the IDs of complexes whose binding affinities are under-predicted or over-predicted are reported in

Table S2 and

Table S3 for each scoring function. We will explain the performance of each scoring function in the following section and discuss the chemical moieties and protein families which tend to point to the complexes that are being under-predicted or over-predicted ().

| **Table 2**The statistics (R^{2}, R_{0}^{2}, MAE, RMSE_{0}, and RMDSE_{0} as well as number of complexes predicted) for predicting Set1 and Set2 with respective QSBAR models, MedusaScore, and the consensus approach. The description of metrics can be found in Methods. |

| **Table 4**Some of the chemical features associated with the under-predicted or over-predicted complexes |

MedusaScore

We calculate MedusaScore for both Set1 and Set2. We use the VDWR-excluded protocol with no additional parameter adjustment. There are four complexes that contain ligand atom types that are not yet parameterized by MedusaScore (trimethylsulfonium groups in complex #183 in set1, and #249 and #74 in set2, as well as the phosphoramide group in complex #18 in set2). The R^{2} values are 0.34 and 0.47, for Set1 and Set2, respectively. We also test the effect of adding the VDWR term in MedusaScore. The R^{2} are slightly decreased to 0.30 and 0.44 for Set1 and Set2, respectively. The slight decrease of accuracy is consistent with the previous observation^{6} that the VDWR term is more sensitive to small deviations in the complex structure, causing uncertainty for binding energy estimation. The observation that accuracy only slightly decreases after including the VDWR term for prediction also verifies that the CSAR data sets are of high quality and only minimal steric clashes exist in the structure of the complexes. The largest VDWR interaction energy is found to be 29.7 kcal/mol for complex #154 in Set1. There are other three complexes in Set2 (complex #225, #222, and #92) that also have VDWR term larger than 20 kcal/mol. These complexes are found to be under-predicted by MedusaScore.

In total, there are 31.3% of Set1 complexes and 34.7% of Set2 complexes considered as outliers based on the definition described in Methods. A majority of complexes belonging to the glutamate-related family (glutamate receptor 1, 2, 3, 4, 6) are under-predicted by MedusaScore. Closely inspecting the protein-ligand interactions in those complexes, we find salt-bridge interactions, which are ignored in the current version of MedusaScore, are dominant. Moreover, both of the two complexes in the family of ADAM17 are under-predicted. This might be due to ignoring of metal-mediated interactions (the catalytic Zinc) in the binding pocket, where metals directly contribute to the ligand binding.

We have also analyzed the structural fragments based on their tendency of occurring in outliers in comparison with the normal group. We find that the combination of thiolane/thiophene moiety and the sulfonamide (or amide) group tends to contribute to the under-prediction of certain complexes (). For example, the four protease complexes (1:158, 1:159, 1:160, 1:161) and three coagulation factor X complexes (1:52, 1:141, 1:196) are under-predicted by MedusaScore. The most interesting chemical scaffold is the thiazole group, which seems to be strongly associated with the under-prediction of binding affinity. The thiazole group can be found in the four protease complexes (*vide supra*) and two carbonic anhydrase-related complexes (1:206 and 1:222). Moreover, MedusaScore tends to over-predict complexes which contain phosphate groups connected to a sugar moiety (usually in a nucleoside ligand).

QSBAR models

After removing descriptors with high inter-correlation and low variance, there are 422 and 377 descriptors (out of 1108 descriptors) used in modeling building and validation of Set1 and Set2, respectively. The results of external n-fold cross validation from CSAR data set modeling are reported in . The average external n-fold cross-validation R^{2} is 0.45 for Set1 and 0.53 for Set2. Since each fold has rather small size (around 17 complexes), R^{2} values could have large fluctuation due to the random distribution of prediction outliers among folds. Therefore, we also take MAE and RMSE values into account in the evaluation of prediction accuracy. We analyze the outliers in the fold(-s) with the worst MAE and RMSE values (i.e., fold#2 in Set1 and fold#1 in Set2). We find that some of the outliers have special moieties (and thus, could be viewed as structural outliers), for example, the N^{5}-[(R)-amino(sulfoamino)phosphoryl] group (2:18) and the hydroxy(oxo)phosphoniumolate group (1:25), or the whole family (Lipocalin) of complexes (1:207 and 1:208) is not present in the modeling set. On the other hand, in spite of having close neighbors in the descriptor space, some complexes are still predicted poorly (e.g., 2:126), suggesting that further improvement of protein-ligand interfacial descriptors is needed.

| **Table 3**The statistics (R^{2}, MAE, and RMSE) for external n-fold validation sets using QSBAR models built from Set1 and Set2. |

The validated Set1 (Set2) models are applied to predict Set2 (Set1). The results are reported in . The prediction accuracy of Set2 using Set1 models is higher than the prediction accuracy of Set1 using Set2 models (i.e., R^{2} value is 0.44 *vs.* 0.53, respectively). It is an expected outcome, because QSAR-based models have difficulty extrapolating data points under-represented in the training set, and, indeed, Set1 has more data points at the extremes of the binding affinity distribution.

We analyze the prediction outliers as described in Methods. About 29.5% of Set1 complexes and 23.3% of Set2 complexes are considered ill-behaved (i.e., outliers) by QSBAR models. Around 1000 ISIDA fragments are generated for Set1 and Set2. After removing fragments with low variance or high correlation, around 600 fragments in either Set1 or Set2 remain for the permutation test. Upon the analysis, we find that the ligands, which contain the flavan moiety or the combination of thiolane/thiophene moiety and the amide group, tend to be under-predicted by QSBAR models (). The flavan moiety occurs in the ligand complexes of particular protein families. For example, the complexes belong to the estrogen receptor-β (1:42, 1:43) and estrogen receptor (1:33) family. Coincidentally, the features of thiolane/thiophene and sulfonamide group are also found to contribute to the under-prediction by MedusaScore (*vide supra*). On the other hand, the ligands with naphthalene moiety tend to be over-predicted only by QSBAR models (e.g., 2:19, 2:23, 2:44, and 2:77). Moreover, the carboxyalkyl phosphate scaffold (with or without metal coordination) is found to be associated with over- or under-prediction. We find that complexes whose ligands contain large hydrophobic moieties (e.g., flavan and naphthalene) in a hydrophobic environment tend to be mis-predicted; this points to the underlying assumption of PL/MCT-tess descriptors that protein-ligand binding is driven mostly by charge transfer interactions. Moreover, the hybridization of carbon is not taken into account in the current implementation of PL/MCT-tess and EnTess descriptors. These factors may contribute to the low accuracy of prediction for compounds containing large hydrophobic moieties.

Comparing prediction outliers from QSBAR models and from MedusaScore scoring function, we find that these groups do not completely overlap (

Figure S1) and the corresponding structural features associated with QSBAR outliers are distinct from the ones for MedusaScore. This outcome is not unexpected since these two types of scoring function employ completely different principles toward representing protein-ligand interactions. This also implies the possibility of improving overall prediction accuracy by combining the two scoring functions.

Distribution of chemical fragments of ligands in CSAR data set We also analyze the distribution of ligand chemical features (represented by ISIDA fragments) in the entire CSAR data set. shows the occurrence of each chemical fragment (in % to that of the CSAR data set) in Set1 and Set2 ligands. The fragments are sorted by predominance of occurring in Set1. Overall, Set1 is chemically more diverse than Set2. Approximately 70% of chemical fragments are more prominent in Set1 than in Set2, and around 4% are unique for Set1. On the contrary, all of the fragments predominant in Set2, though under-represented, can still be found in Set1. The fragments marked by circles or squares () are associated with previously identified prediction outliers (e.g. flavan, thiolane/thiophene and sulfonamide ligand features). As expected, these chemical fragments are not represented equally in Set1 and Set2. This analysis suggests that the predictive power of Set2 models can be improved by extending the Set2 data set.

Interpretation of descriptors selected by QSBAR models We analyze the descriptors selected by either Set1 or Set2 models (q^{2} ≥ 0.5 and R^{2} ≥ 0.6) based on their frequency of occurrence in the respective models. For each descriptor we calculate Z-score based on the frequency distribution of all selected descriptors in Set1 (Set2) models. shows descriptors sorted by the difference of their Z-scores in Set1 and Set2. We find that the descriptors whose tetrahedral type includes a metal are frequent in Set1 models (i.e. high Z-score) but not in Set2 models. This can explain some mispredictions of Set1 by Set2 models, because metal interactions are under-represented in Set2 data set. Moreover, those descriptors, whose tetrahedral type is related to the under-predicted outliers of Set1 (see scaffolds in ), are selected less frequently by Set2 models. Therefore, we expect that by expanding Set2 the prediction accuracy of the corresponding QSBAR models should improve.

Consensus scoring function

We optimize the b

_{1}, b

_{2}, and b

_{3} parameters of the combined scoring equation (see Methods:

Eq. 3) using Set1 predictions by QSBAR models and by MedusaScore. The R

^{2} value between the fitted combined score and the experimental binding affinity is 0.45 and the respective parameters (b

_{1}, b

_{2}, and b

_{3}) are 0.58, −0.03, and 0.82. Applying the trained scoring function to Set2 gives R

^{2} of 0.58, which is higher than the R

^{2} value by using QSBAR models and significantly higher than MedusaScore alone (p < 0.05 by permutation test, N=10,000) ). This suggests the complementarity of these two types of scoring functions. Consequently, we apply the same procedure to optimize the combined scoring equation using Set2 predicted scores. The resulting R

^{2} value is 0.58 and b

_{1}, b

_{2}, and b

_{3} parameters are 0.002, −0.03, and 0.87, respectively. Applying the trained scoring function to Set1 gives R

^{2} of 0.45, which is slightly higher than R

^{2} by QSBAR and significantly higher than MedusaScore alone (p < 0.05). The relatively limited improvement over the individual QSBAR model might be due to the poorer performance on Set1 by each of the individual scoring functions.

We also analyze the prediction outliers of the combined scoring function. There are about 27.2% of Set1 complexes and 33.5% of Set2 complexes considered as prediction outliers. The percentage of outliers in Set2 for the combined scoring function is not as low as in the case of QSBAR models despite the fact that the overall performance of the combined scoring function for Set2 is better. By analyzing chemical features of outliers, we find characteristic moieties that correspond to those obtained for QSBAR models. For example, the thiolane/thiophene moiety with sulfonamide group and flavan-related scaffolds are found in the ligands of under-predicted complexes and the naphthalene moiety is in over-predicted ones.