Performance evaluation of subsite prediction methods
Multiple sequence alignments of 20 families (validation dataset) were used to identify actual subsites. Experimentally supported subsites (195 actual subsites) from these families were considered as gold standards for the evaluation of performance of five prediction methods, namely SPEER [
20] GroupSim [
16] and MultiRELIEF [
17], SDPpred [
12] and SPEL [
18]. The prediction sensitivities of these five methods are shown (Figure ) as Receiver Operating Characteristics (ROC) curves where sensitivity is plotted against the error rate (percentage of false positive). ROC
n statistics for individual methods are also provided in Table . As can be seen from Figure and Table , SPEER, GroupSim and MultiRELIEF clearly perform better than the other two methods with their sensitivities at 5% error rate being 54, 38 and 40 respectively (Additional file
1). Similar trend is also observed in PR (precision-recall) curves where precision (TP/TP+FP) for each method is plotted on the
y-axis, and recall (TP/TP+FN) is plotted on the
x-axis (Additional file
2). It should be mentioned that the SPEL method does not take full advantage of the curated subfamily clustering provided in the validation testset since SPEL performs the clustering automatically along with the subsite identification. If there is no information on subfamilies, the automatic clustering is advantageous, but this is not within the scope of our paper to analyze such cases.
| Table 1Comparison of ROCn statistics for different methods (see Methods for definition). |
Prediction of potential subsites and their structural properties
Based on the performance assessment using the validation dataset (195 subsites from 20 family alignments) three best performing methods, namely SPEER [
20] GroupSim [
16] and MultiRELIEF [
17] were further employed to identify new potential subsites. Results (top 15 predicted sites excluding the actual subsites) from these three methods were compared and sites that were commonly predicted by all three methods (C3 sites) or by any pair of methods (C2 sites) were selected as new potential subsites. Additional file
3 provides a list of such 264 new potential subsites (135 C3 sites, 129 C2 sites) for all families.
Since the sets of C3 and C2 sites do not include actual subsites and are not assigned any combined rank and score (this would require combining scores from different methods which is a non-trivial task), it is difficult to validate the performance of the ensemble approach. To estimate the performance, we defined subsites predicted by three or two methods (top 15 predicted sites including actual subsites; C3 and C2 sites). Altogether we identified 141 such C3 and 129 C2 sites, calculated the PR statistics and compared it with each individual method (Additional file
4). Expectedly, C3' and C2' sites provide better reliability (precision) than sensitivity (recall) compared to individual prediction methods.
Distribution of spatial distances
Understandably, experimental validation is the most authentic verification process for the predicted subsites. But, in the absence of such rigorous protocol one alternate way would be to examine structural features which are characteristic for actual subsites (such as the distribution of their spatial distances, solvent accessibility, secondary structural content and hydrogen bonding patterns) and to compare them with the characteristic structural features of predicted subsites.
Figure shows the distribution of spatial distances between actual and between potential subsites (Figure ); distances of actual/potential subsites to the specific ligand/substrate (Figure ). As can be seen from Figure , the mode of the pairwise distance distribution of the actual subsites is shifted toward lower distances compared to C3-C3 distances and this shift is more pronounced with respect to C2-C2 distances. Indeed, majority of site pairs fall within 20 Å and within this distance range the distribution means are statistically different (p-value << 10-5). Interestingly enough, for distances less than 20 Å, the more reliable prediction method is used (C3 instead of C2 sites), the closer potential subsites are to each other and to the distance distribution of actual subsites. For distances larger than 20 Å the situation is different and the actual subsite distance distribution has a longer tail corresponding to subsites located at large distances from each other.
Figure shows the spatial distances of actual and potential subsites from the specific substrate/ligands. As can be seen from this figure, the larger fraction (66%) of actual subsites is found to be in close contact (<= 10 Å) to substrates/ligands compared to C3 and C2 sites (52 and 46% respectively). This difference is even more prominent at a closer range (<= 5 Å) where 43% of actual subsites are found compared to only 17% C3 and 12% C2 sites. This might indicate the possibility of indirect interactions of C3 and C2 sites with the specific substrate/ligands. It shows that combining more reliable methods' predictions (C3 sites) provides better agreement with the actual subsite-ligand distance – another indication that the analysis of distance distribution patterns can provide the means to validate the prediction accuracy.
Structural properties of actual and predicted subsites
Important structural characteristics such as solvent accessibility, secondary structural content and hydrogen bonding patterns of actual and predicted subsites were analyzed and compared. Figure shows the solvent accessibility, secondary structure content and hydrogen bonding patterns of actual subsites (a), C3 (b) and C2 (c) sites. Overall, the distributions of structural properties of potential subsites are not very different from that observed for actual subsites or all sites. As can be seen from this figure, subsite prediction methods tend to over predict sites in beta-strands and under predict sites in solvent accessible areas and coils which are less evolutionary conserved than protein cores.
Examples of predicted subsites
Actual and potential subsites are shown for four protein families in Figure . For the IDH_IMDH family, SPEER, GroupSim and MultiRELIEF identified 10, 8 and 6 actual subsites, respectively, at 15% error rate. However, three other sites (N305, H229, and A323) were commonly predicted by all three methods (within the top 15 predicted sites excluding actual subsites). Figure maps the actual subsites along with sites that were commonly predicted by all three (three C3 sites; colored in green) or any two methods (nine C2 sites; colored in blue) onto 3D-structure of a representative protein from IDH_IMDH family. Spatial mapping of the potential subsites shows that two (N305 and A323) of the three C3 sites, reside within close distance (<= 10 Å) with respect to the specific cofactor NADP (shown in cyan) or specific ligand, isocitrate (shown in purple). In addition, five C2 sites (G101, L103, T104, E154, and Y308) are also found to be less than 10 Å apart in space from the NADP or isocitrate molecule.
For nucleotidyl cyclase family both actual subsites were identified by SPEER, GroupSim and MultiRELIEF within 15% error rate (Figure ). Eight potential C3 sites and three C2 sites fall within 10 Å distance from the specific activator (forskolin; shown in purple) or P-site inhibitor molecules (2'-deoxy-3'-AMP and pyrophosphate; shown in cyan).
SPEER and GroupSim successfully predicted both actual subsites (D189 and A221) for the serine protease family while MultiRELIEF failed to identify one subsite (D189) within 15% error rate. However, there are seven sites besides actual subsites that were commonly predicted by all three methods. Figure provides a representative structure of trypsin with the actual subsites and commonly predicted subsites (C3 and C2 sites). All C3 sites reside less than 10 Å apart from the specificity determining serine residue (marked in purple) whereas three C2 sites reside within 5 Å from the serine residue.
Finally, nine C3 and seven C2 sites were identified for the lactate-malate dehydrogenase (LDH_MDH) family. Figure shows a representative structure of lactate dehydrogenase complexed with cofactor, NAD (marked in cyan in Figure ) and ligand, oxamate. Predicted C3 and C2 sites were also projected onto the lactate dehydrogenase structure. 3D structural images were generated using the PyMOL software [
23].
Prediction of potential subsites using automatic family clustering
To check whether the use of automatic family clustering and the lack of manual curation would affect the subsite prediction accuracy, we predicted subsites for six protein families obtained from Proteinkeys database (Additional file
5; prediction dataset) that have automatically defined subgroups with at least three protein sequences. Three best performing prediction methods (SPEER, GroupSim and MultiRELIEF) were applied to this testset to identify potential new candidate subsites for specificity determination (Additional file
6). Since there is no information on the actual subsite locations for the automatically determined alignments from "prediction testset", we applied structural analysis of C3 and C2 sites which, as was shown in the previous section, may be indirectly used to validate the subsite prediction accuracy. Potential subsites for the six families as suggested by common prediction of all three methods or any two methods are listed in Additional file
6. In total, 24 C3 and 47 C2 sites were identified for the six families. These identified C3 and C2 sites could be extremely important in determining the specificity and therefore can be ideal target for mutagenesis experiments. Figure provides projection of these predicted C3 and C2 sites onto representative structures from six families. Commonly predicted C3 and C2 sites are shown in space filling model and are colored in green and blue, respectively. 3D structural images were generated using the PyMOL software [
23].
Spatial distances among the C3 and C2 sites were also analyzed. It has been observed that 90% of C3 sites are located within 20 Å distance with respect to each other (Additional file
7) whereas 80% of C2 sites reside within 20 Å distance. Overall, we observed similar distributions of structural properties of potential (C3 and C2) subsites from prediction testset and C3 and C2 sites identified from validation testset (Figure , Additional file
8). One exception is the solvent accessibility which tends to be larger for potential sites from the prediction testset.