Improvements to Predikin. We have recently made some improvements to the Predikin algorithm and website. These include the ability to select different substitution matrices, the streamlining of the website to allow easier prediction of potential kinases given a substrate and the ability to perform whole-proteome analysis.
Substrate-to-Kinases Predictions. There are two fundamental questions a researcher may wish to ask about phosphorylation: which proteins will be phosphorylated by kinase X? and which kinases will phosphorylate protein Y? This is essentially the same problem, but seen from two different directions, and Predikin is able to answer both. Predikin's approach is the same regardless: analyse the kinase to produce a position weight matrix and then use this to score a potential phosphorylation site. However, in previous versions of the web server submitting one substrate and many kinases was not very practical. The web server has been redesigned so that now a researcher only needs to submit a single sequence file, the content of which determines the type of analysis: the file may contain one kinase and multiple substrate to identify likely targets of the kinase, or it may contain multiple kinases and one substrate to identify the most likely kinase. Multiple kinases with multiple substrates may also be submitted for larger analysis. Predikin attempts to align each sequence to a hidden Markov model describing the kinase catalytic domain. This information is used to identify which sequences in the submitted file are kinases, and thus a researcher need not specifically identify which are kinases and which substrates. All submitted sequences are treated as potential substrates, and thus auto-phosphorylation, or phosphorylation by another kinase can also be detected.
Whole-Proteome Analysis. The Predikin web server has also been adapted to allow large scale analysis to be conducted. This makes it possible to scan whole proteomes rather than a subset of selected proteins. As Predikin identifies kinase sequences, it is possible to submit a whole proteome in FASTA format and allow Predikin to identify the kinases and score each one against every potential phosphorylation site in the proteome. This type of analysis can be very time consuming (depending on the number of sequences, number of kinases and number of potential phosphorylation sites); therefore, these jobs are queued and the results emailed to the researcher. Smaller jobs are still run on-demand and the results are presented to the researcher through the web site. In the future, we intend to make the results of whole proteome analysis available on the Predikin website. This will enable researchers to access the results of common queries much faster.
Extending Predikin's Reach. Prior to submitting predictions to DREAM4, Predikin was unable to build valid weight matrices for two of the three protein kinases in the DREAM4 challenge. This led to three changes to the system that all contribute to increasing the number of protein kinases Predikin can build position weight matrices for: Updating PredikinDB, changing substitution matrices and changing substitution matrix cut-off values.
PredikinDB has continued to be updated from the latest UniProtKB 
releases. It now also incorporates data from PhosphoELM 
. Including further data sources has significantly increased the number of protein kinase-substrate interactions in PredikinDB — it now contains 5127 phosphorylation sites that are linked to a specific kinase, 2260 from PhosphoELM and 2867 from UniProtKB — this increases the chances of building a valid frequency matrix (see Methods
); therefore, Predikin is now able to make predictions for a much broader range of protein kinases.
To assess the ability of these new features to increase the number of protein kinases Predikin can make predictions for, and to evaluate their affect on accuracy, a published data set of 61 protein kinase from yeast was used. For each of these kinases, a position weight matrix, which described the sequence specificity surrounding the phospho-residue, had been experimentally determined 
To successfully build a position weight matrix, the Predikin method relies on identifying similar specificity-determining residues, and this, in turn, is reliant on the substitution matrix used. Testing has shown that the use of different substitution matrices can enable Predikin to build position weight matrices for more protein kinases (by altering what Predikin considers similar to a specificity-determining residue). To analyse the benefits of using different substitution matrices, we attempted to build position weight matrices for each of the yeast protein kinases using various BLOSUM matrices. To assess the quality of Predikin's position weight matrices we used the same evaluation method as the DREAM4 challenge: similarity to a experimentally mapped position weight matrix using the distance induced by the Frobenius norm (Frobenius distance; see Methods
). The DREAM4 challenge also provided p-values for each Frobenius distance, this is the probability that a random position weight matrix has the same or smaller Frobenius distance, and we have applied the same method to calculate p-values for the yeast kinase predictions.
From 16 BLOSUM matrices, BLOSUM30 clearly stands out as providing the most position weight matrices (), but an important question is whether the position weight matrices produced by this matrix are as accurate as those built by Predikin's default substitution matrix: BLOSUM62?
Number of position weight matrices built using each BLOSUM matrix.
We calculated the Frobenius distance for the 12 protein kinases for which a position weight matrix can be built using all of the substitution matrices. For any given kinase, the distance produced does not vary greatly as the BLOSUM matrix changes (). These results also show that there is no single best substitution matrix – the best matrix to use is dependant on the kinase (and there is no way of knowing in advance which matrix will perform best), and that while we may not select the best matrix for individual kinases every time, the difference in the prediction is likely to be very small.
Using different BLOSUM matrices does not adversely effect Frobenius distance.
Together these results show that we are able to increase the number of kinases Predikin can build position weight matrices for by changing the substitution matrix, and that BLOSUM30 captures the most kinases. We have also shown that the distance to the experimentally derived position weight matrix is not adversely effected by the use of BLOSUM30. We have also found that altering the substitution matrix cut-off value affects the number of position weight matrices that can be built. BLOSUM62 contains numbers ranging from −4 to 11 with higher numbers indicating more likely substitutions; by default, Predikin uses a cut-off value of 1, meaning that any substitution with a positive score is allowed; however, using a cut-off value of 0 greatly increases the number of kinases that position weight matrices can be built for, without affecting the accuracy of those position weight matrices. By using a cut-off value of 0 Predikin is able to build position weight matrices for many more protein kinases ().
We also asked the question of whether using a cut-off value of 0 adversely affected the distances we obtained compared with using a value of 1. We calculated the distance from the experimentally derived position weight matrix for 12 kinases using a cut-off value of both 1 and 0. In four cases, the smallest distance was produced with a cut-off value of 1 (Cdc5, Gcn2, Hrr25 and Ste20) and, in a further four cases, a cut-off value of 0 gave the smallest distance (Tpk1, Tpk2, Tpk3 and Ypk1). In the remaining four cases (Cla4, Ipl1, Pkh2 and Prk1) the smallest distance was equal between cut-off values (). These results show that using a substitution cut-off value of 0 does not adversely affect the majority of cases — and in some cases it even improves the Frobenius distance obtained. Again, the advantages of extending the range of Predikin are significant, while the disadvantages in increases to distance are very slight, as in most cases the increase in distance is itself very small.
Using a BLOSUM cut-off value of 0 instead of 1 does not adversely effect Frobenius distance.
shows the effect of applying various new options of Predikin to the yeast kinases characterised by Mok et al. 
. The leftmost distribution, showing output from the original version of Predikin, shows that while all predictions made had good p-values (
1e-6) Predikin was only able to make predictions for 25% of the kinases. By updating PredikinDB, but still using BLOSUM62 and a cut-off value of 1, Predikin is able to more than double the number of kinases predictions can be made for. The updated database also causes the median p-value to drop quite significantly (
1e-24). This trend is repeated when we use BLOSUM62 with a cut-off value of 0: the median p-value drops below 1e-30 and the coverage of kinase that Predikin can make predictions for rises to 80%. When we switch to BLOSUM30 we see a similar effect, with the final distribution in (far right) showing results using BLOSUM30 and a cut-off value of 0. Here the median p-value drops to 1e-42 and the coverage reaches over 90%. When we use the updated version of PredikinDB, the predictions generally improve, but we also see some outliers starting to appear. These always correspond to kinases that Predikin was previously unable to make predictions for. We consider the benefits of smaller Frobenius distances for most kinases and significantly greater coverage of kinases to greatly out-weigh the disadvantages of a small number of larger distances.
Effect of various BLOSUM matrix and cut-off values on Predikin's performance.
There remained five kinases that Predikin was unable to build specificity matrices for under any circumstances: Cak1, Kin1, Psk1, Sky1 and Ypl141c. Two of these (Cak1 and Sky1) are CMGC (a family of kinases including cyclin-dependent kinases, mitogen-activated kinases, CDK-like kinases and glycogen synthase kinases) kinases and the others are calmodulin-dependent kinases (CaMK). These are the two most represented groups in the kinases (37% CaMK and 25% CMGC kinases), and there are no consistent patterns with the specificity-determining residues of the kinases; therefore, we believe that the inability of Predikin to make predictions for these kinases is simply due to a lack of kinases with similar specificity-determining residues in PredikinDB, and that this will be rectified in time as our knowledge of kinase-substrate interactions grows.
New Style Position Weight Matrices
During the course of our investigations, a different method of converting a frequency matrix to a position weight matrix was devised (see Methods
). shows the Frobenius distances for the yeast protein kinases used above where the position weight matrices have been built with both the old (submitted to the DREAM4 challenge) and new methods. For all kinases except one — Cak1 — there is a decrease in distance. We believe that the reason for this improvement is that there were no adjustments for the background amino acid frequencies made with the experimental data; therefore, by also not accounting for them, our predictions more closely mimic the experimental results (see Methods
Comparison of old- and new-style position weight matrices.
The newer style matrices show a general trend to lower Frobenius distances, and hence lower p-values. As the primary purpose of Predikin is to enable predictions of phosphorylation events, we investigated whether this decrease in Frobenius distance correlates with an increase in predictive power. ROC analysis comparing the two styles of position weight matrix shows that there is almost no difference in predictive power between the two styles of position weight matrix (). This results demonstrates that while the Frobenius distance may be useful in determining which of several predicted position weight matrices is closest to an experimentally determined position weight matrix, it does not necessary correlate well with the predictive power of those position weight matrices.
Predictive performance of old- and new-style position weight matrices.
We further investigated the usefulness of the Frobenius distance and associated p-values by testing artificial position weight matrices that show no sequence preference against the protein kinases from the DREAM4 challenge. We constructed three position weight matrices had equal probabilities for all amino acids in all positions (values of 0.05 represent equal probability between the 20 amino acids) except for the phospho-residue position. One weight matrix had probabilities of 0.05 for all amino acids, the second had probabilities of 0.5 for serine and threonine and 0 for all other amino acids in the phosphorylated position, and the third had probabilities of 0.33 for serine, threonine and tyrosine and 0 for all other amino acids in the phosphorylated position. The lowest Frobenius distances was obtained by only assuming the phospho-residue is either serine or threonine — the p-values for these matrices are all lower than the ones obtained by Predikin in the DREAM4 challenge ().
Frobenius distances and p-values for low specificity position weight matrices.
It is important to remember that some protein kinases are less specific than others, and that in situations involving these kinases a position weight matrix where many of the probabilities are close to 0.05 may be entirely appropriate. To see if this was the case for the kinases in the DREAM4 challenge we produced sequence logos 
based on the predicted and experimental position weight matrices (). All of the kinases in the DREAM4 challenge have positions either side of the phospho-residue that do not have significant amino acid preferences, and that, apart from the phospho-residue position, only one or two other positions have a significant effect on specificity.
Sequence logos based on predicted and experimental position weight matrices for the kinases in the DREAM4 challenge.
Predikin's Performance in DREAM4
The Predikin algorithm entered the recent DREAM4 challenge and was declared “best performer” in the protein kinase section of the Peptide Recognition Domain specificity prediction category. In the following discussion, it should be noted that the DREAM4 predictions were made before some of the new features of Predikin described above had been implemented and before the evaluations with the yeast kinases had been completed. We were, therefore, unable to take full advantage of the knowledge subsequently gained.
There were three protein kinases in the Peptide Recognition Domain specificity section of the challenge: MELK , BIKE and CaMKK2. In all three cases, the Frobenius distance produced from Predikin's position weight matrix was the lowest achieved by any of the challenge entrants. By default, Predikin used BLOSUM62 as its substitution matrix with a cut-off value of 1. For some of the kinases in the DREAM4 challenge we had to adjust these settings. We used the following: BLOSUM62 with a cut-off value of 1 for CaMKK2, BLOSUM62 with a cut-off value of 0 for MELK and BLOSUM35 with a cut-off value of 0 for BIKE. shows Predikin's results from the DREAM4 evaluation; the p-values associated with each distance show that Predikin is producing position weight matrices that are significantly closer to the experimental position weight matrices than would be expected by random. also compares the distances achieved with the new form of position weight matrix described above with the distances from the position weight matrices submitted to the DREAM4 challenge. There is considerable improvement for two of the three, but there is a small increase in distance for CaMKK2. This increase for CaMKK2 is because the original position weight matrix did not distinguish between serine and threonine and gives them equal weight; however, the new position weight matrix incorrectly weights serine higher than threonine. The experimental position weight matrix for CaMKK2 shows that it has a very strong preference for threonine as the phosphorylated residue. The new predicted position weight matrix shows serine being more strongly preferred. This error in identifying the phosphoresidue preference accounts for the slight increase in distance for the new predicted position weight matrix compared to the original.
Frobenius distances for Predikin position weight matrices built with the submitted and new method.