We have presented here a model for cyclin-dependent kinase substrates. The model first defines a bioinformatic representation of the Cdk phosphorylation motif, either as a regular expression or a PSSM. In addition, the model proposes that a significant proportion of Cdk phosphorylation occurs on proteins that contain multiple phosphorylation sites. The non-random clustering of potential Cdk sites in particular proteins serves as evidence of biological function selected for by nature.
The canonical motif and PSSM strategies, combined, define a set of 91 candidate Cdk substrate proteins comprising 1.5% the yeast proteome. Of these, 46 (0.73% of the yeast proteome) were defined as strong candidates, either being detected using the canonical-motif scoring method, or scoring above the upper cutoff using PSSM-motif method. Twenty-seven were detected using only the canonical-motif method, 8 using only the PSSM-motif method, and 11 by both methods. The remaining 45 (0.72% of the yeast proteome) predicted candidates were “borderline” PSSM candidates only.
By comparison, only 0.10% of the sequences in the randomized mock proteome scored above the threshold for inclusion as strong candidates, and 0.45% of the sequences in the mock proteome met the score criteria for borderline, PSSM candidates (but not strong candidates). The ratio of candidate substrates detected in yeast-to-candidates substrates detected in mock yields an estimated false positive rate of 14% for the strong candidates and 63% for the borderline candidates. These values indicate that there is indeed clustering on the sequence level beyond what would be expected by random. From them we can infer that ~40 of the 46 strong candidates and ~17 of the 45 borderline candidates are bona fide Cdk substrates. Thus, although the false positive rate for the borderline candidates is high, that subset is nevertheless not inconsequential to biological researchers, since greater than 1 in 3 are likely to be bona fide substrates.
Out of the total set of 91 candidate substrates, 13 proteins (14%) are contained in the set of experimentally characterized in vivo
substrates. To our knowledge, at the time of writing there are 26 proteins in that set (Table S2
); thus 50% of the currently known substrates were detected as candidates. For reasons detailed below, we expect this method to be less than comprehensive, but rather to yield a set of likely candidate substrates useful for biological researchers while maintaining a reasonably low false positive rate. Extrapolating from our false positive and false negative rates, we expect there to be approximately 114 total proteins (1.9% of the yeast proteome) that are Cdc28 substrates.
Many of our candidate substrates were also predicted to contain Cdk phosphorylation sites using other leading phosphorylation detection algorithms, such as Scansite and NetPhosK. Scansite, using a threshold setting of “high” returns 265 yeast proteins (4.2% of the proteome) as candidate Cdk substrates. Of these, 35 are contained in our set of 91 candidate substrates (38%). Scansite predicts 8 of the 24 well-characterized candidate substrates (33%), as compared to the 50% hit rate using our method. When Scansite was run on our random sequence database, 2.8% of the sequences were detected as candidate Cdk substrates -a false positive rate of 67% for Scansite, for Cdk substrate prediction in this dataset. Therefore, although the present method was only somewhat more comprehensive (50% to 33%) than Scansite with respect to true positive detection, it was much more accurate in terms of false positive rate. Our method generates a set of strong candidates with an estimated false positive rate of 14%, while Scansite, even set to high stringency yields a false positive rate of 67%. Scansite yields a false positive rate similar to that of the borderline candidates (63%) generated using the current method.
detected 88 of our 91 (97%) candidates as containing Cdk substrates, using a scoring threshold of 0.60− a similar true positive rate as Scansite. However, our simulations indicate that fully 21% of the proteome, or 1300 proteins, is predicted by NetPhosK to be Cdk substrates, and so the false positive rate is expected to be even higher for NetPhosK than for Scansite. Thus, the major difference between two leading current phosphorylation prediction methods and the one presented here—protein-level motif clustering—is recognized as an increase in accuracy as measured by a reduced false positive rate.
Our method predicts approximately half of the known yeast Cdk substrates. Therefore, in this study, we make no claim at completeness. Instead, we show the utility of a targeted bioinformatic tool that produces a set of predictions that can be validated using experimental techniques. Our pilot proteomic study, in which we assayed for in vivo
phosphorylation using hypothesis-driven mass spectrometry 
, confirms a number of our predictions . In addition, our predictions are also consistent with many of the high scoring proteins from the high-throughput in vitro
phosphorylation study by Ubersax et al. 
, although most of these are as of yet unconfirmed in vivo
Our model, as it stands, is particularly useful for organisms with small proteomes, such as S. cerevisiae
. Larger proteomes may be problematic because the false positive rate likely will increase with the number and size of proteins. To extend this procedure effectively may require additional filtering procedures. For example, phosphorylation sites are largely expected to occur on solvent-accessible portions of proteins, particularly loops, so an additional weight could be added to motifs that are expected to occur in such regions, as determined by existing secondary structure prediction 
or homology modeling algorithms 
. Incorporating the conservation of phosphorylation motifs across related species into the model might also increase its specificity by adding additional biological restraints. However, this has proven to be not a straightforward task, complicated by the fact that orthologous candidate substrates show homologous regions that are enriched for Cdk motifs, but where in many cases the number and precise positioning of the motifs are not
very precisely conserved. Supplemental Table S3
shows some examples of the imperfect conservation of Cdk motifs across taxa in Cdk substrates. New algorithms are needed in order to properly account for these factors when performing multiple alignments of Cdk substrates.
Furthermore, the semi-processive physical model 
of Cdk phosphorylation also suggests that the clustering of sites likely occurs on contiguous surfaces or individual domains of proteins. The average spacing between motifs for candidate substrates identified in our study by canonical motif scoring is 103+/−63 (mean+/−standard deviation) amino acids residues, and by PSSM scoring is 69+/−46 residues. Among the candidate substrates, the subset that overlaps with known, experimentally characterized Cdk substrates, the average spacing was smaller than (63+/−37 for canonical motif scoring, and 38+/−20 for PSSM scoring) but statistically indistinguishable from spacing for the overall set of candidate substrates. Such large spaces between sites suggest that three-dimensional, domain level proximity, rather than simply linear spacing plays an important role in the processivity of Cdk2. Further exploration is necessary to determine the feasibility of using spacing data, or 3-D data for increasing the selectivity of the procedure.
The algorithm missed certain known yeast substrates such as Cdc23 
that are thought to contain single phosphorylation sites. However Cdc23 is present in cells in complex with the proteins Cdc16 and Cdc27 
, both of which also have multiple putative Cdk phosphorylation sties. Therefore, it is reasonable to hypothesize that the kinase recognizes and phosphorylates a surface of the entire complex that is formed by the junction of all three proteins. As data on protein complexes 
becomes more comprehensive and reliable, it may become feasible to statistically analyze the presence of Cdk motifs within complexes in a similar manner to that done for individual proteins. We note here that the domain-level clustering of motifs here likely differs from the local clustering observed in the substrates of kinases such as the casein kinases
and SR specific protein kinases
, where multiple phosphorylation sites are observed within a single extended motif or repeat region.
The success of the computational procedure presented here stresses the importance of not being limited to local sequence characteristics for functional prediction. The difficulty in the prediction of post-translational modifications and in phosphorylation prediction in particular, is that short, local sequences—even those that match an extremely well defined consensus—can occur frequently by random sequence drift. In the present study, we found useful the fact that Cdk substrates not only have consensus motifs that have been well studied and could be quite precisely defined, but also had the characteristic of site clustering. We incorporated both global and local sequence characteristics of Cdk substrates into a bioinformatic model that proved successful in predicting a significant number of putative substrates. A substantial amount of experimental information obtained by us and other leads us to believe that this set of putative substrates is, in fact, highly enriched for bona fide Cdk substrates. This set of proteins includes a substantial proportion of known substrates from previous in vivo and in vitro studies, as well as substrates that were confirmed as in vivo phosphorylation sites by mass spectrometry. In the future, these types of approaches—incorporating biochemical details into bioinformatics, and interfacing bioinformatics with experimental testing—should prove to be a useful strategy in predictive computational biology.