Despite considerable algorithmic advances in the field, none of the prediction tools for PKA-dependent phosphorylation previously described in the literature achieves specificity and sensitivity rates both above 90%. In our view, several biological and computational aspects contribute to this development. Among them, there are several problems: (a) with the availability of experimental data, (b) with serine/threonine-rich regions, (c) with the incorporation of the available physico-chemical and biological knowledge into the scoring function used to discriminate between productively interacting substrates from non-permissive sequence stretches, (d) with the issue of accessibility of the phosphorylation motif within the protein's three-dimensional structure (intrinsically disordered regions surrounding the site) [27
] and (e) with the issue of hidden signals (proteins that would be phosphorylated if they were in contact with PKA but which never have the appropriate biological context during their life cycle) [41
With regard to point (a), little experimental data is available for most kinases with regard to the sequence variability of substrates, structural detail of kinase-substrate complexes, kinetic or energetic aspects of the interaction. Only a few kinases including PKA are reasonably well studied in this respect. For example, the number of sequentially dissimilar substrate sequences for the reliably parameterization of the profile term was estimated at least 200 in [50
]. This number is reached for PKA even in the neighbor jackknife test (among the 239 sequence examples, the number of excluded sequences has never been above 10) and the results of this test show that stable profile parameterization has almost been achieved.
The issue of many serines/threonines in the sequence (b) is especially challenging since ST-rich regions are common in intra- and extracellular proteins. To detect phosphorylated proteins on a large-scale basis, every single potential site in a sequence must be taken into consideration. If Sp
(measured as value between 0 and 1) is the rate of correct rejection of a site and if there are n
serine/threonine residues in a query sequence, the specificity of the task for classification of query proteins decreases significantly (to
<< 1). Considering the difficulties associated with prediction of potentially multiple sites in ST-rich regions, it is clear that very high accuracies are needed if such algorithms are to be applied routinely on a large-scale proteome basis.
The incorporation of the heterogeneous knowledge about the PKA-substrate protein relationship into the scoring function (issue c) is a non-trivial problem. Experimental reports usually do not provide the knowledge in the form that is necessary for formulating algorithms. There are two ways to deal with this problem – either to take the information as is and to hope that machine learning procedures filter the aspects of the data that are relevant for prediction, or to formulate a physically reasonable model of productive binding events with the kinase directly. Machine learning approaches have shown their usability in a variety of applications, especially in cases where lots of uniform data are available. The classical example is SIGNALP, for the derivation of which learning sets in the size of thousands of substrate proteins were collected [42
In many other biological applications, the data situation is by far not that comfortable. In such circumstances, the usage of machine learning software packages as "black boxes" for autonomous extraction of score function parameters without human interference and without explicitly considering the physico-chemical and biological realities of the problem under study can become dangerous. In their letter to the editors of the Biophysics Journal in 1994 [51
], Frank Darius and Raul Rojas analyze the difficulties arising from the discrepancy between the very high dimension of the parameter space in modern machine learning approaches and scarce data when exemplarily criticizing an alternative signalpeptide predictor. To summarize, the problem is that the calculated parameters are not reliable and it is not clear whether the correlations found are numerical noise of the data or biologically meaningful. If the data are scarce and no one tells the "black box" how the substrate protein interacts with the receptor, then the box would indeed need to be a "magic box" to know about it in order to pick the correct significant parameters.
As a direct consequence, human involvement and additional biological knowledge are indispensable for dimensionality reduction. In contrast to machine learning approaches, a physically justified model of productive binding with the kinase already provides a reasonable analytical form of scoring function terms. In this context, it is not so important whether this form can be further improved. We wish to emphasize that the number of parameters to be determined with the help of learning data is dramatically reduced.
In our approach, we consequently follow these considerations and try to incorporate all biologically relevant information into the analytical form of the prediction function. For example, it is utterly important that among all sequence positions, which carry relevant information, as many as possible are considered for the prediction procedure. In the case of PKA, we found the region to occupy the segment -18...+23. Also, we have very few parameters: a profile term over 13 positions centered around the putative site and 14 physical property terms Tj (typically involving 3 parameters: the mean and the standard deviation of an amino acid index averaged over some sequence region as well as a weight factor for the whole term). Especially the latter set of parameters is determined with high significance given the 239 positive examples. We think that the simplicity of our algorithm is its big strength since the output of the decision function clearly indicates what kind of property supports or prevents the prediction of a query as PKA substrate. In the process of predictor development, human interference can assure that only the biologically meaningful among the significant correlations enter the decision function.
The influence of the structure of the protein on the accessibility of the phosphorylation motif to the kinase (issue d) is difficult to estimate at present. By demanding an excess of hydrophilic, small and flexible residues in the region -18...+23 with the physical property terms, it becomes quite unlikely that a sequence region hit by our predictor is actually part of a 3D structure but rather represents an intrinsically disordered segment. Nevertheless, it cannot be excluded that sequence stretches that are not included in the motif definition might cause the entire protein to fold in such a way that the potential phosphorylation site is not accessible to its modifying kinase.
Finally, the cellular context is important (issue e). In the case of some translocation signals, it has been experimentally shown that even the presence of an in vivo
functional motif does not mean that the carrying protein is also imported into the corresponding subcellular compartment [41
]. This discovery of hidden sequence signals highlights the significance of cellular hierarchies for small functional protein motifs. Hence, current phosphorylation predictors including pkaPS do not really predict phosphorylation, but the potential of a sequence stretch to interact productively with the modifying kinase. This seemingly small detail may appear negligible but it is important to be considered in all predictions. E.g. a targeting signal located far away from a functional phosphorylation site on the same protein may lead to a removal from the cellular compartment of the respective kinase, thereby overriding the phosphorylation motif. This means that the analyzed motif is not the sole sequence stretch on the protein which is responsible for the modification.
These considerations mean that a phosphorylation predictor is not the only source of information that must be consulted when evaluating the phosphorylation state of a protein. Moreover, the number of apparently wrong predictions (if only the physiologically relevant cases are counted) of an algorithm is not only determined by the imperfection of its design since the predictor focuses on the query sequence stretch. Hence, even if all permissive amino acid permutations of the substrate motif are known, the theoretical accuracy of any post-translational modification predictor will have an upper limit clearly below 100%.